![]() These animation files were generated using classic computer vision algorithms running on face-calisthenics video sequences and supplemented with hand-animated sequences for extreme facial expressions that were missing from the calisthenic videos. A normalized rig used for all the different identities (face meshes) was set up by our artist which was exercised and rendered automatically using animation files containing FACS weights. The synthetic animation sequences were created by our interdisciplinary team of artists and engineers. After a certain number of steps we start adding synthetic sequences to learn the weights for the temporal FACS regression subnetwork. We initially train the model for only landmark regression using both real and synthetic images. This allows the model to learn temporal aspects of facial animations and makes it less sensitive to inconsistencies such as jitter. The FACS regression sub-network that is trained alongside the landmarks regressor uses causal convolutions these convolutions operate on features over time as opposed to convolutions that only operate on spatial features as can be found in the encoder. This setup allows us to augment the FACS weights learned from synthetic animation sequences with real images that capture the subtleties of facial expression. Our FACS regression architecture uses a multitask setup which co-trains landmarks and FACS weights using a shared backbone (known as the encoder) as feature extractor. This alignment allows for a tight crop of the input images, reducing the computation of the FACS regression network. We also use the facial landmarks (location of eyes, nose, and mouth corners) predicted by MTCNN for aligning the face bounding box prior to the subsequent regression stage. Thus to solve this we tweaked the algorithm for our specific use case where once a face is detected, our MTCNN implementation only runs the final O-Net stage in the successive frames, resulting in an average 10x speed-up. The original MTCNN algorithm is quite accurate and fast but not fast enough to support real-time face detection on many of the devices used by our users. To achieve the best performance, we implement a fast variant of the relatively well known MTCNN face detection algorithm. To achieve this, we use a two stage architecture: face detection and FACS regression. The idea is for our deep learning-based method to take a video as input and output a set of FACS for each frame. An example of a FACS rig being exercised can be seen below. Despite being over 40 years old, FACS are still the de facto standard due to the FACS controls being intuitive and easily transferable between rigs. ![]() The one we use is called the Facial Action Coding System or FACS, which defines a set of controls (based on facial muscle placement) to deform the 3D face mesh. There are various options to control and animate a 3D face-rig. The framework described in this blog post was also presented as a talk at SIGGRAPH 2021. In this post, we will describe a deep learning framework for regressing facial animation controls from video that both addresses these challenges and opens us up to a number of future opportunities. This is particularly challenging at Roblox, where we support a dizzying array of user devices, real-world conditions, and wildly creative use cases from our developers. Despite numerous research breakthroughs, there are limited commercial examples of real-time facial animation applications. However, animating virtual 3D character faces in real time is an enormous technical challenge. Facial expression is a critical step in Roblox’s march towards making the metaverse a part of people’s daily lives through natural and believable avatar interactions.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |