How often have we wished we could have a version of ourselves to be the hero character in a game? Or an e-commerce fashion try-on? Technology has come a long way – such that customized and engaging virtual experiences are now sought-after by many industries, including augmented and virtual reality, gaming, fashion, and entertainment. Seeing oneself represented - in fact, extended - faithfully into a virtual world has become crucial to the success of those experiences.
Standard procedures to generate realistic 3D face avatars typically require a combination of expensive head capturing systems and extensive manual work from skilled artists. We are also faced with a lack of facial datasets with 3D ground truth shapes, which has also been a hindering issue in the process of the 3D reconstruction of the human face.
Having this in mind, Didimo has developed a method for realistic 3D face reconstruction. Developed by our team using a convolutional neural network (CNN) and more classical computer vision tools, it directly estimates the shape coefficients of a custom 3D morphable model (3DMM) from a single frontal photo with a neutral expression, and refines the shape to match a set of automatically extracted facial landmarks for optimal likeness.

Method overview. We generate varied head shape render pairs to train a shape regression CNN. During reconstruction, the head shape predicted by the CNN is refined by landmark-based deformation.
To comprehend how we did this, we first need to understand that this involved combining a significant number of meaningful head shape identities and models to produce the 3DMM. Following this, we devised a method to simulate a few head shape/photo pairs by sampling the 3DMM and rendering the resulting head meshes in a photo-realistic manner. Both face and eye albedo textures were randomly selected from comprehensive databases, and we also optionally simulated occlusions by the hair, beard, and hats. We based our focal length and camera sensor size on randomization from four different smartphones: Apple iPhone 11, Xiaomi Mi Note 10, Google Pixel 5, and HUAWEI P40 Pro. Lighting was also key, and as such, we used high dynamic range (HDR) images covering varied conditions (interior/exterior, bright/low light, and day/night). Accounting for a range of head rotation and camera positions as well, the renders were extracted with Blender. Each head shape was rendered twice, under different cameras and HDR images, so that the CNN could learn to reconstruct the same shape consistently under different perspectives.
Even though we are still honing its realism, the quality we were able to achieve renders it unnecessary to train the CNN with any real photo-3D scan pairs or pre-train it on a related task, like facial recognition. The refinement process of the head shape is done using what we call facial landmarks; in this project, we used 140 of those to outline the eyes, nose, mouth, jawline, cheekbone, and forehead. But how does it differentiate from before? The way we employ these landmarks sets us apart from previous models: our method weighs them so that, for instance, eye landmarks have a bigger influence on the final outcome than, say, jawline ones, given that they are typically more accurate.
How did we do it? Through reconstruction from photos of 6 subjects (Caucasian, Asian, and African; for both males and females), for which we have 3D ground truth face meshes. Additionally, we used three off-the-shelf automated face recognition services from AWS, Azure, and Face ++, to compute similarity scores between the input photo and face renders performed before and after refinement. The output we got suggests that our facial reconstruction generally achieves good likeness and, despite the inherent discrepancies between scoring methods, all of them picked a significant increase in similarity when applying the landmark-based refinement. The figure below shows the reconstruction renders of two input photos, subjectively confirming the possibility of accurately reconstructing the facial shape from a single frontal photo.

3D face reconstructions generated from single frontal photos using a convolutional neural network to estimate the initial head shape followed by landmark-based deformation for optimal likeness.
Our team achieved similar realism across real photos from a wide range of human faces with a computation time of around 10 seconds on CPU for shape reconstruction, making our method well-suited to streamline the creation of lifelike avatars at scale.
Going forward, we are hopeful about an enhanced 3DMM so it covers an ever-wider selection of human faces – as well as facial expressions, so that we can use non-neutral expressions in the input photos. In a nutshell: a new world of avatar creation is dawning, and we are keen on being an essential part of what’s next in it. Businesses, studios, and developers all need to harness the power of effective digital human twins, and we are helping to pave the way for them to make this a business priority for better experiences and new sources of revenue.
For more detailed information and technical points, you may find the full paper here.