Figure 5. Rate of the correct answers of the CVC words (vowelrecognition), through watching a part of the face (or syntheticlips) and listening to a –18 dB SNR acoustic stimulus.It is not surprising that the whole face contains relevantinformation to perceive speech. The less part of the face wereshown, the fewer words were recognised correctly. Although the2D ellipse representation of the lips is uncommon, the subjectsaccepted it. Results of the different stimuli show an error rate ofthe synthesised stimulus comparable to that of natural lips.These results and the robustness of axis of the image ellipse areencouraging for us to use them as visual features of aspeechreading system. To be able to get information on thebrightness of the oral cavity (teeth and tongue either visible ornot), the intensity factor k was added to the a and b axes of theellipse of the inner lips. This aims at improving the recognitionrate of the ellipse model towards that of a visible mouth.In the automatic recognition experiment the corpus ofspeechreading was the manually segmented 120 ms speech ofthe middle of C1VC1 words. (C is selected from b, v, t, l, j and k,while V is selected from a (O), á (a:), e (E), é (e:), i (i), o (o), ö(2), u (u) and ü (y).) The aim of recognition is to identify thevowel, taking the sequence of three images from the manuallysegmented middle of the vowel (key-frames). Axes a and b andthe intensity factor k - calculated for the intensity of oral cavity- were the features of the visual signal. (Visual featuresspreadness s, elongation e, and intensity factor k were also triedand provided the same results as the previous ones.) Five seriesof 54 CVC words (six of C by nine of V) were pronounced by afemale speaker. Three of the five utterances were used fortraining and two of them for testing. Training patterns wereexcluded from testing.A feed-forward neural network was trained by a backpropagation algorithm with visual signals. There were 18patterns (six surrounding consonants by three utterances) usedfor training and 12 patterns (two utterances) were used fortesting for each of the nine vowels. The 18 training patternswere represented by three neurons in the hidden layer for eachvowel. A feed-forward neural network was trained by conjugategradient back propagation algorithm with Powell-Beale restartsduring 2.000 epochs (MATLAB implementation). A recognitionrate of 81% was obtained for visual stimuli.3. CONCLUSIONSIn this paper geometric moments are proposed for lip shapedescriptors. An image ellipse can be derived from second ordermoments that can represent the shape, orientation and positionof the lips. An intensity factor represents the visibility of teethand tongue. Using these features, an 81% recognition rate wasreached in a vowel recognition task using visual features by anautomatic recogniser. Human perceivers on 75 VCV and 27 CVCwords, the subjects judged the consonant or the vowel in themiddle of the word evaluated the proposed method. Therecognition rate was comparable for natural lips and thesynthesised 2D image ellipse lip model. Semi-syllables are thekey structures in Hungarian continuous speech recognition [9]and this work is going to be further developed for the Hungariancontinuous audio-visual speech recognition.. REFERENCES1. Massaro, D.W., Stork, D.G., Speech recognition andsensory integration, American Scientist, May-June, 1998.2. Nankaku, Y., Tokuda, K., and Kitamura, T. Intensity-and location normalised training for HMM-based visualspeech recognition. In Proceedings of the Eurospeech’99,Budapest: pp. 1287-1290, 1999.3. Petajan, E.D. Automatic lipreading to enhance speechrecognition. In Proceedings of the GlobalTelecommunications Conference, Atlanta, GA: IEEECommunication Society. pp. 265-272, 1984.4. Bregler C., and Omohundro, S.M. Nonlinear imageinterpolation using manifold learning. In G. Tesauro,D.S. Touretzky and T.K. Leen (Eds.), Advances inNeural Information Processing Systems, vol. 7,Cambridge, MA: MIT Press. pp. 973-980, 1995.5. Yuille, A.L., Cohen D.S., and Hallinan, P.W. FeatureExtraction from Faces Using Deformable Templates. InProceedings of the Computer Vision and PatternRecognition. Washington, DC. IEEE Computer SocietyPress: pp. 104-109, 1989.6. Luettin, J., Thacker N.A., and Beet, S.W. Active shapemodels for visual speech feature extraction. In D.G.Stork and M.E. Hennecke (Eds.), Speechreading byHumans and Machines, Berlin: Springer-Verlag, pp. 383-390, 1996.7. Hu, M.K. Visual pattern recognition by momentinvariants. IRE Transactions on Information Theory,Vol. 8. (1) pp. 179-187, 1962.8. Mukundan, R., and Ramakrishnan K.R. Momentfunctions in image analysis. Singapore: Word ScientificPress. pp. 11-24, 1998.9. Vicsi, K., Vigh, A. Text independent neural network/rulebased hybrid, continuous speech recognition.EUROSPEECH’95. Madrid: pp. 2201-2204, 1995.020406080FaceMouthLipEllipseAudio% of correct recognition