Evaluating 3D human face reconstruction from a frontal 2D image, focusing on facial regions associated with foetal alcohol syndrome

factor for diagnosis, alongside central nervous system impairments and growth abnormalities. Current methods for analysing the FAS facial phenotype rely on 3D facial image data, obtained from costly and complex surface scanning devices. An alternative is to use 2D images, which are easy to acquire with a digital camera or smart phone. However, 2D images lack the geometric accuracy required for accurate facial shape analysis. Our research offers a solution through the reconstruction of 3D human faces from single or multiple 2D images. We have developed a framework for evaluating 3D human face reconstruction from a single-input 2D image using a 3D face model for potential use in FAS assessment. We first built a generative morphable model of the face from a database of registered 3D face scans with diverse skin tones. Then we applied this model to reconstruct 3D face surfaces from single frontal images using a model-driven sampling algorithm. The accuracy of the predicted 3D face shapes was evaluated in terms of surface reconstruction error and the accuracy of FAS-relevant landmark locations and distances. Results show an average root mean square error of 2.62 mm. Our framework has the potential to estimate 3D landmark positions for parts of the face associated with the FAS facial phenotype. Future work aims to improve on the accuracy and adapt the approach for use in clinical settings.


Introduction
Early detection of foetal alcohol syndrome (FAS) allows for early intervention, mitigates the onset of secondary disorders such as mental breakdown or improper sexual behaviours, and leads to significantly better clinical outcomes. 1 The diagnosis of FAS is based on the evidence of central nervous system abnormalities, evidence of growth abnormalities, and a characteristic pattern of facial anomalies, specifically short palpebral fissure length, smooth philtrum, flat upper lip, and flat midface. 2,3 The FAS facial phenotype has been emphasised clinically for diagnosis. [4][5][6][7] However, clinical evaluation requires the expertise of trained dysmorphologists. This requirement limits efforts for large-scale screening in suspected high prevalence regions, such as South Africa, which has a prevalence rate estimated to be between 93 and 128 per 1000 live births 8 , and a shortage of highly trained clinical personnel. Alternative methods for assessing the FAS facial phenotype are possible but require careful acquisition of face data. Face data collection methods include direct anthropometry using handheld rulers and callipers. Indirect anthropometry, on the other hand, is possible through the acquisition of face data through 2D photogrammetry, 3D stereophotogrammetry, and 3D surface imaging scanners. 2,9,10 Direct anthropometry introduces inaccuracies due to the indentation of some features during contact measurements with physical instruments. For this reason, more efforts have been put into indirect anthropometry, which has the added benefit of near-instantaneous patient data acquisition. Furthermore, with indirect approaches, measurements on the images can be repeated in the absence of subjects. Indirect evaluation on 3D image data is typically more accurate than on 2D images. 11 However, acquiring 3D face images using 3D surface scanners tends to be costly and precludes large-scale deployment in low-resource settings.
Reconstruction of the 3D human face from a single 2D image is a popular topic of research, with applications in face recognition, face tracking, face animation, and medical analysis of faces. 12 However, to date, there has not been any report on the quantitative suitability of 3D from 2D face reconstruction for FAS-related facial phenotype characterisation. In this study, our aim was to evaluate the geometric accuracy of a 3D human face reconstruction from a single 2D facial image, using a 3D morphable model of the face. 13 We focused on 3D reconstruction of the complete face to enable surface-based approaches, and to allow us to evaluate landmark and distance-based measurements. We tested if such a reconstruction algorithm could be suitable for automated analysis of facial features related to FAS.

Related work
Three-dimensional morphable models (3DMMs) are high-resolution generative models containing shape and texture variations from sample populations. [13][14][15][16][17] Typically, 3DMMs are built from a set of 3D face scans after establishing anatomical dense correspondences across the face data set. Establishing correspondences ensures that similar features across a set of 3D face scans match each other (e.g. the tip of the nose or the eye corners) -we call this process 'registration'.
Several methods for building 3DMMs from a set of 3D face scans have been presented over the years. 12 In pioneering work, Blanz and Vetter 13 built a 3DMM from a set of face scans after computing dense correspondences with an optical flow-based registration technique. The shape and texture variations in a collection of face scans were then modelled using principal component analysis (PCA), resulting in a low dimensional representation. The learned face models were used to estimate a 3D face surface from a single 2D face image. Early 3DMMs were built using just hundreds of face scans. However, a recent study by Booth et al. 16 constructed a 3DMM known as the Large-Scale Facial Model using 9663 3D facial scans. Booth et al. 16 used the non-rigid iterative closest point (NICP) algorithm 18 for registration of the template face surface to each target face scan in the data set, aided by generalised Procrustes analysis (GPA) for similarity alignment of the registered face scans. They then used PCA 19 for statistical analysis of the registered face scans. The 3DMMs have already been successfully applied in various application areas including face tracking, face recognition, face segmentation, and face reconstruction. 12 However, additional research focusing on human face variations would still be required before the morphable model could be used for medical purposes. 16 Blanz and Vetter's 13 work was seminal, but their approximate 3D face meshes were only qualitatively evaluated. Romdhani and Vetter 20 took a different approach, extracting multiple features from a single image. The extracted features were then used to estimate a 3D face surface by minimising a cost function. In 2009, 3DMM, the Basel face model, was made available for research purposes and enabled the community to grow faster. 21 Aldrian and Smith 22 developed the first publicly available inverse graphics algorithm based on a 3DMM. Schönborn et al. 23 employed a sampling-based approach to fit a Gaussian process morphable model to a single 2D image. The face shape reconstruction accuracy as measured by a root mean squared average was 3.79 mm. Recently, a first benchmark was established for 3D reconstruction from 2D images. 24 This benchmark is, however, strongly biased towards light skin tones, which is a narrow subset of the world's population and might not be representative for general clinical application. The state-of-the-art method on this benchmark is a deep learning based method for 3DMM reconstruction, with an average reconstruction error of 1.38 mm. 25 While reconstruction algorithms are reported in the literature, there is limited research evaluating the accuracy of these algorithms, which has implications for the algorithm performance on medical-related applications. Additionally, to the best of our knowledge, model-based 3D face reconstruction from 2D image approach has not been evaluated with a focus on FAS applications, perhaps because 3D ground truth data may not be available. A robust single image-based reconstruction approach could offer a cost-effective alternative to 3D surface capturing systems.

Data description
We based our experiments on the BU-3DFE face database, which is a publicly available data set of high-quality 3D scans, acquired using the 3dMD face system. 26 It consists of face scans of 98 subjects of different ethnicities (56 female and 42 male subjects aged between 18 and 70 years). We used only the facial scan with a neutral expression for each identity (see Figure 1 for an example of the images). The data were used with ethical approval from both the University of Cape Town and the State University of New York. To maximise the number of faces for training, we performed a leave-one-out cross-validation scheme for our experiments. From each face scan, we derived the 3D ground truth face shape, established correspondence to our model template, and rendered a frontal 2D image for our 2D to 3D reconstruction task. To reach maximal accuracy, we used 12 manual landmarks to initialise the 2D to 3D reconstruction process: right outer and inner canthi, glabella, left inner and outer canthi, right and left alares, pronasale, subnasale, right and left cheilions. We did not rely strictly on these landmarks as the fitting framework used has been shown to work with automatic landmark detection. This gave us a set of 2D images with known ground truth 3D shapes for learning and evaluating our model and reconstruction framework.

Rigid alignment of face scans
The goal of rigid alignment is to bring all the face scans into a common coordinate system without deformations. Given a set of pre-processed 3D face scans (pre-processing involves trimming the face scans to remove the unwanted regions such as the hair and neck regions) and a set of facial landmarks for each face scan, the facial landmarks were used to calculate a least-squares alignment that brought landmarks corresponding across scans as close together as possible (Procrustes alignment). The training face scans were mapped, using rigid transformations, to the mean of the Basel face model 27 , which represents a common reference face surface. The results of these alignments are a collection of rigidly aligned 3D face scans.

Registration of face scans
After rigid alignment, we used a deformable model to establish dense correspondences between a reference face surface and each target face scan in the training data. By dense correspondences, we mean finding the mappings between similar features across the data set. The goal of registration is to re-parameterise the face scans to have the same number of vertices and triangulations across face scans in the training set with the key feature that each vertex corresponds to the same point on each face. The reference face surface is fitted to each target face scan, using a Gaussian process fitting approach 27 , to obtain dense face surface deformations, which best match a target face scan to a common reference face surface. The time for registering the reference to each target face was 5-8 minutes computed on an Intel(R) Core (TM) i5-8350U CPU @ 1.7 GHz. This registration approach builds on a Gaussian process defined by mean and covariance functions to model smooth deformations of the template shape. 28,29 During registration, we searched the optimal set of parameters of our Gaussian process model to match the 3D scan at hand. The results of applying the fitting approach are registered 3D face scans. As we wanted to build a 3DMM, at this stage, we also extracted the colour per vertex from the closest vertex on the face scan -this enabled us to not only build a shape but also a texture model.

Building face models
With dense correspondences established among the training data, we removed translations and rotations on the data to retain shape deformation. To perform these non-shape-related transformations on the training data, we applied the Procrustes analysis approach. [30][31][32] After removing alignments, the principal modes of variation were extracted from training data using PCA 28,29 to build 3D morphable face models (3DMMs). The 3DMMs consisted of the face means and the principal components as modes of variation. The 3DMMs are expressed as linear combinations of shape and texture vectors in the face subspace. The time required to build a surface model from each registered face scan was 10-20 minutes computed on an Intel(R) Core (TM) i5-8350U CPU @ 1.7 GHz. An example of registration results as well as a resulting 3DMM built from all 98 scans are illustrated in Figure 2.

3D from 2D face reconstruction
The key application of a 3DMM that we were interested in was the estimation of 3D face surfaces and 3D landmark positions for FAS measurements from 2D images. In the reconstruction setting, the 3DMM acts as a prior of 3D face shape, and we searched for the most likely reconstruction given only 2D images. This is potentially useful because 2D images, in contrast to 3D scans, are easy to acquire using either a mobile phone or a portable camera. One of the goals of this study was to reconstruct a neutral 3D face with shape and texture information from a single frontal 2D image and evaluate how close that reconstruction was to the known ground truth.
Given single 2D images, we estimated 3D face reconstructions by fitting the morphable model. The reconstruction time measured on an Intel(R) Core (TM) i5-8350U CPU @ 1.7 GHz was 58 minutes. We applied an approach proposed by Schönborn et al. 23 to fit a 3DMM to a single 2D image. The fitting algorithm recovers a full posterior model of the face by simultaneously optimising facial shape and texture as well as illumination and camera parameters for a test face image. We used a spherical harmonics illumination model which can recover a broad range of natural illumination conditions in combination with a pinhole camera model. Illumination estimation is a critical step and optimised early and regularly in the sampling process. The fitting algorithm tries to reconstruct the 2D image, producing a rendering from the 3D model that matches the 2D image as closely as possible. The results for fitting a morphable model to a single 2D image are 3D face shape and texture reconstructions, as illustrated in the pipeline in Figure 2.

Evaluating the face shape model
Before applying the 3DMM in our downstream reconstruction of a 3D face surface from a single 2D image, it was necessary to evaluate the quality of the built face model in terms of generalisation, specificity, and compactness. The details of the model evaluation metrics are discussed by Styner et al. 33 Shape model generalisation: This refers to the ability of the shape model to accurately represent an instance for which it was not trained. The leave-one-out approach 33 was used to evaluate the generalisation ability of the face shape model. For each iteration, a shape model was constructed from a set of training face surfaces, leaving out one face shape instance. With all the training data in correspondence, the left-out face instance was projected into the shape model space to generate a face estimate. To evaluate the geometric accuracy of the estimated face, the distance between the face estimate and the original left-out face instance was calculated. The average vertex-to-vertex root mean squared (RMS) distance between the left-out face instance and the estimated face instance was computed. The procedure was repeated until all the face instances in the training set were used and each time the evaluation metric was calculated. The model generalisation ability results are presented in Figure 3, which demonstrates the generalisation error represented as RMS distance (y axis), plotted against shape principal components (x axis). After 5 principal components, the generalisation accuracy was close to 1.5 mm, and with 50 principal components, we reached an accuracy of approximately 0.5 mm. Shape model specificity: This is defined as the ability of the shape model to randomly generate valid synthetic shape instances that are similar to real shape instances present in the training data set. 33 To evaluate Step 3 illustrates the 3D from 2D reconstruction process and Step 4 presents the 3D reconstruction result based on the single input image. In the reconstruction step, R is the rendering function, p represents rendering parameters, α n are shape parameters, and β n are colour parameters. Note that the images used are for illustrative purposes only.
model specificity, a set of 90 shape instances was randomly generated from a distribution of the 3D morphable face model. The RMS distance between the randomly generated shape instances and the closest face surfaces in the training set was calculated as a specificity estimate. Lower RMS deviations are desirable because they indicate that the synthesised shape instances are close to the real shape instances in the training set. Figure 4 shows the specificity results. The results are in common ranges for specificity. 15 Note that it is typical that the specificity decreases (distance increases) with greater model complexity (number of principal components).

Shape model compactness:
This indicates the percentage of variability accounted for by increasing numbers of principal components. Fewer principal components capture variability in shape information more efficiently. To validate the model compactness, the cumulative variance accounted for by the shape model was plotted as a function of the number of principal components of the shape model (illustrated in Figure 5). The line reflecting cumulative variance flattens as the number of shape principal components increases. Using only the first 20 shape principal components, the shape model accounts for more than 90% of shape variation in the training data set. This implies that the shape model is compact as it describes the training data set using a small number of principal components.

Evaluating 3D from 2D reconstruction results
To evaluate the geometric accuracy of face shape reconstructions from a single 2D image, the predicted 3D face surfaces were compared to the ground truth 3D face scans. We performed the reconstruction for each of the 98 2D images separately and build a separate 3DMM, removing that identity from the training data (leave-one-out cross-validation).
To measure the reconstruction error, we first rigidly aligned each predicted face mesh with the associated ground truth face shape. Following surface alignment, we computed the difference between the aligned face surfaces using the RMS distance metric. 34 The RMS metric gives the surface-to-surface assessment value for each pair of surface comparisons. To visualise the reconstruction error distribution on the face surface, we additionally generated the surface colour maps from the comparisons of the predicted face surfaces and the associated ground truth face scans.
The overall average RMS error between the pairs of the predicted 3D face surfaces and the ground truth 3D face shapes is 2.62 mm with a deviation of 1.41 mm, with errors ranging from 1.00 mm to 6.75 mm. Furthermore, the visual shape comparisons between the predicted face surfaces and the corresponding ground truth face shapes are represented using colour gradients as illustrated in column (e) of Figure 6.  Figure 6 shows the identities with the best and worst RMS values for the predicted face shapes, and their corresponding ground truth face shapes, including the face surface colour map comparison. We observe that the largest reconstruction errors are found in regions of the face that are not relevant when screening for facial phenotypes in FAS. However, we find that the philtrum, which is one of the discriminators for FAS facial analysis, shows larger errors in faces considered to be outliers.

Face surface analysis across skin tones
Foetal alcohol syndrome affects people of all ethnicities. Previous methods did not explore darker-skinned individuals well enough and structured light systems have acquisition issues when it comes to imaging darker skin tones. 35 Furthermore, previous models have a strong bias towards lighter skin tones. 36 We investigated what happens when darker and lighter skin tones are mixed. We can visualise the distribution of skin tones across our data set in Figure 7 and observe a heavy tail in the low intensity range. We also investigated the relationship between skin tone and the reconstruction error per mesh, and observe that we reach comparable reconstruction accuracy for the heavy tail of low intensity skin tones even though they are underrepresented in the training data ( Figure 8). Figure 9 shows faces with lowest and highest reconstruction error results across skin tones. The poorest face reconstructions are also indicated in Figure 8 with orange circles, while the best face reconstructions are illustrated in the same figure with green circles. We find that regions of the face that are not related to the FAS facial phenotype are most affected. We also present the average reconstruction error over the whole data set in Figure 10.

3D surface distance measurements
During FAS facial phenotype assessments, distances between facial features on the face surface can be measured using either a physical instrument or a computer-assisted tool. These surface measurements are used to confirm the diagnosis of facial syndrome. A study by Douglas et al. 37 extracted facial features and performed measurements on the following face distances related to FAS: palpebral fissure length, inner canthal distance, outer canthal distance and interpupillary distance. However, these measurements were conducted on 2D stereophotogrammetry images projected in 3D space. We extracted landmark points from our face reconstructions and can derive such measurements directly from our 3D reconstruction without manual interaction. We present reconstruction accuracy of those landmark points as well as distance reconstruction accuracy (measured in 3D) based on our 3D reconstructions compared to the ground truth 3D shapes.
Landmark estimation. Landmarks are essential when taking measurements on a face surface. We identified and selected a subset of 14 landmark points which are related to FAS facial phenotype assessments. These landmarks are described in a study by Mutsvangwa and Douglas 38 . The results of the landmark estimation errors were computed and are illustrated in Table 1.
The landmark error was computed by measuring position distance from reference surface to the reconstructed surface. As shown in Table 1, 11 of 14 landmark errors were lower than 3.5 mm. The large standard deviations are mainly a result of the few outliers observed in Figure 8. Facial feature distances. The distance measurements characteristic to FAS facial features include the palpebral fissure length, outer canthal distance, and inner canthal distance, as illustrated in Figure 11b. The landmarks required for the distance measurements are described in Table 1 and illustrated in Figure 11a. The corresponding distances on the reconstructed 3D face surface and the 3D ground truth face scan were compared and the difference calculated. Table 2 shows the results of average absolute distance errors and their standard deviations for the palpebral fissure length, outer canthal distance, and inner canthal distance facial feature distances.

Discussion
We constructed a 3D face model from 2D face scans and evaluated the accuracy of 3D face shape predictions from single images. The constructed morphable model of the face was evaluated for generalisation, specificity, and compactness parameters. The lowest generalisation error was 0.5 mm which suggests that the face shape model described the unseen face shapes well when given data outside the training set. The generalisation results of the face shape model compare well with other results found in the literature. 15,16 The specificity results of the face shape model are in the range of 13.2 mm to 14.5 mm, which is in the common ranges for specificity. 15 The compactness results of our face shape model indicate that more than 90% of the variability in the training set is retained with just 20 principal components and this compares well with other results in the literature. For example, Booth et al. 16 report that the first 40 principal components retained more than 90% of variability in their training set. Overall, our morphable model construction and evaluation seem successful.
The numerical average reconstruction error between the reconstructed face shapes and the corresponding ground truth face shapes in our data set was 2.62 mm. These findings are comparable to other results in the literature. For example, Zollhofer et al. 39 compared reconstructed face surfaces obtained from 3D face scans via a Kinect sensor to ground truth face scans, reporting an average deviation of about 2 mm. Additionally, Feng et al. 40 reported a root mean square error of 2.83 mm from surface comparisons between the predicted 3D face meshes and the corresponding ground truth 3D face scans.
For FAS facial phenotype assessment, we are interested in specific regions of the face such as the eyes, the midface, the upper lip, and the philtrum. These regions provide cues to clinicians when examining the FAS facial phenotype. The whole face surface reconstruction was examined using the colourmap surface comparisons shown in Figures 6, 9 41 suggested that visual inspections of the 3D surfaces using heat maps can delineate and discriminate facial features. In Figure 8, we find that the reconstruction quality in our data set is not affected by skin tones. On top of the heatmap representation, we also evaluated the landmarks and distances previously explored for the facial phenotype in FAS, as shown in Tables 1 and 2, respectively. We show accuracies in a minimum range of 2.57 mm for landmarking errors and 1.25 mm for distance errors. Similar results are reported in the literature. Regarding landmark localisation error, a study by Sukno et al. 42 reported an average error per landmark of below 3.4 mm. For inter-landmark distances, Douglas et al. 37 reported an average difference, between the manual and automated approaches, within 1 mm for palpebral fissure length, but with greater variations for outer canthal distance and inner canthal distance.
The highest face surface reconstruction errors belong to a relatively small set of 3D scans. Furthermore, the surface differences could imply that, during the model fitting phase of the reconstruction process, our statistical model did not completely capture all geometric cues in the 2D image of the face. We define a geometric cue as the information contained in a 2D image of the face, such as shading or contours.

Limitations and future research
Although we used a data set of scans of normal adult controls, with no known FAS indications, we assume that the framework is invariant to the data when built and applied to a population of interest. Ideally, training and test data sets would be collected from FAS and non-FAS control populations, with similar demographics. It is a challenge to access 3D data of individuals with FAS; however, the acquired face database (BU-3DFE) is very diverse. Future work could focus on reducing the reconstruction errors to acceptable clinical standards by collecting and analysing larger data sets, including more training data, especially from underrepresented populations. This would broaden the applicability of the morphable models of the face. To improve on the surface reconstruction performance, future developments could consider using multi-view 2D images of the face to provide more geometric cues during the model fitting of the face.

Conclusion
In this study, we aimed to evaluate whether an inverse graphicsbased 3D from 2D reconstruction algorithm is suitable for acquiring 3D face data for FAS facial shape analysis. The reconstruction task was accomplished by fitting a 3DMM to a 2D image to recover a 3D face representation. Additionally, 3DMMs were built from a collection of 3D face scans with shape and texture information. We provided an evaluation performance of face reconstruction for future applications to FAS diagnosis. The resulting accuracies are promising for these future applications, even across different ethnicities.