StatMuse is a distributed team building cool things with sports data and artificial intelligence.
We need to create a system that can predict 51 "blend shape" coefficients per frame (at 60 fps) given any new arbitrary audio clip of our specific speaker talking.
Blend shapes are identifiers for specific facial features, e.g. eyeBlinkLeft, mouthSmileRight, noseSneerRight, etc. The full list can be found here: https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
Each blend shape has a floating point value indicating the current position of that feature relative to its neutral configuration, ranging from 0.0 (neutral) to 1.0 (maximum movement).
Our training data will consist of 3-5 hours of a speaker's recorded speech audio, aligned with blend shape coefficients, obtained through motion capture, which reflect how they moved their face as they spoke.
One obvious challenge will be in handling the ambiguity within the data, which is largely driven by the speaker’s emotions while delivering sounds (i.e., the same sound can produce a variety of different facial poses based on the emotional state of the speaker).
It appears a similar problem has been solved here by the Nvidia team: https://research.nvidia.com/publication/2017-07_Audio-Driven-Facial-Animation
From the paper:
"We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.
Their solution is predicting the relative position of 5K vertices of a fixed-mesh face model vs. us wanting to predict 51 blend shape coefficients. Aside from that distinction, the ideal solution would be very similar to Nvidia’s approach.
Our ideal timeline is rather ambitious! We are hoping to get something working by end of January 2019. We're definitely open to discussion though; please mention in your proposal how long you think this will realistically take.
Please submit a short, informal proposal that includes:
If you have any clarifying questions you would like answered before submitting a proposal, please feel free to send them to [email protected] and we'll be happy to respond.
You can submit your proposal to [email protected], and we’ll be in touch.