×

Speaker-independent 3D face synthesis driven by speech and text. (English) Zbl 1172.68605

Summary: In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio-visual speech data set in Turkish language was created using a 3D facial motion capture system that was developed for this study. The performance of this method was evaluated in three categories. First, audio-driven results were reported, and compared with the time-delayed neural network (TDNN) and recurrent neural network (RNN) algorithms, which are popular in speech-processing field. It was found out that TDNN performs best and RNN performs worst for this data set. Second, results for the codebook-based method after incorporating text information were given. It was seen that text information together with the audio improves the synthesis performance significantly. For many applications, the donor speaker for the audio-visual data will not be available to provide audio data for synthesis. Therefore, we designed a speaker-independent version of this codebook technique. The results of speaker-independent synthesis are important, because there are no comparative results reported for speech input from other speakers to animate the face model. It was observed that although there is small degradation in the trajectory correlation (0.71-0.67) with respect to speaker-dependent synthesis, the performance results are quite satisfactory. Thus, the resulting system is capable of animating faces realistically from input speech of any Turkish speaker.

MSC:

68T10 Pattern recognition, speech recognition
PDFBibTeX XMLCite
Full Text: DOI