CN102004549A - Automatic lip language identification system suitable for Chinese language - Google Patents

Automatic lip language identification system suitable for Chinese language Download PDF

Info

Publication number
CN102004549A
CN102004549A CN 201010558253 CN201010558253A CN102004549A CN 102004549 A CN102004549 A CN 102004549A CN 201010558253 CN201010558253 CN 201010558253 CN 201010558253 A CN201010558253 A CN 201010558253A CN 102004549 A CN102004549 A CN 102004549A
Authority
CN
China
Prior art keywords
lip
chinese character
module
matrix
image sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010558253
Other languages
Chinese (zh)
Other versions
CN102004549B (en
Inventor
吕坤
贾云得
张欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN2010105582532A priority Critical patent/CN102004549B/en
Publication of CN102004549A publication Critical patent/CN102004549A/en
Application granted granted Critical
Publication of CN102004549B publication Critical patent/CN102004549B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to an automatic lip language identification system suitable for Chinese language, comprising a wear-type camera, a man-machine interaction module, a lip contour positioning module, a geometric vector acquisition module, a motion vector acquisition module, a characteristic matrix building module, a transformation matrix T acquisition module, a conversion characteristic matrix acquisition module, a memory A, a memory B and a canonical correlation discriminatory analysis module. The wear-type camera is used for recording Chinese character sound image sequences, transmitting the Chinese character sound image sequences to the lip contour positioning module through the man-machine interaction module, and detecting and tracking lip contours by utilizing a convolution virtual electrostatic field Snake module; the geometric vector acquisition module and the motion vector acquisition module respectively extract geometric and motion characteristics from the lip contours and join up the geometric and motion characteristics as an input characteristic matrix of the canonical correlation discriminatory analysis module; and the canonical correlation discriminatory analysis module calculates the similarity among the characteristic matrixes and acquires identification results after processing. Compared with the traditional lip language identification systems, the system has higher identification accuracy.

Description

Automatic lip language recognition system suitable for Chinese
Technical Field
The invention relates to an automatic lip language recognition system, in particular to an automatic lip language recognition system suitable for Chinese, and belongs to the technical field of automatic lip language recognition.
Background
Lip language Recognition or lip reading is an attractive field in Human-Computer Interaction (HCI) and plays an important role in Automatic Speech Recognition (ASR) systems. Human language perception is a very natural multimodal process. People with hearing impairment can take full advantage of lip language cues, and even normal people can take advantage of visual information to enhance language understanding, particularly in noisy environments. The utilization of the information of the visual channel can effectively improve the performance and robustness of the modern automatic language identification system.
The lip recognition task generally comprises three main steps: firstly, detecting face and lip regions in a pronunciation image sequence; extracting features suitable for classification from the lip region; and thirdly, lip language recognition is carried out by using the lip region characteristics.
For the first step, the existing method mainly uses an image processing algorithm to position the face and lip region, and the method is easily affected by illumination, angle, rotation, shielding and the like, and generates certain errors.
The lip language features mentioned in the second step are classified into three categories in the existing literature: (1) a texture-based feature of a lower layer; (2) high-level contour-based features; (3) a combination of the two. Of these features, lip geometry (e.g., height, width, angle of the lip) and lip movement characteristics among the contour-based features are considered to be the most useful visual information. A great deal of recent work on lip contour segmentation has used deformable templates (deformable models), and one effective method is to use Snake models and modified Snake models, such as Gradient Vector Flow (GVF) Snake models, Virtual Electrostatic Field (VEF) Snake models, convolution Virtual electrostatic Field (convolutive VEF) Snake models. In contrast, the convolution virtual electrostatic field Snake model can more quickly and accurately locate the lip contour by using a Virtual Electrostatic Field (VEF) as an external force (external force) and a convolution mechanism (convolution).
In the third step of lip language recognition using lip region features, a widely used classification method is a Hidden Markov Model (HMM). Hidden markov models are useful in language recognition because they naturally model the temporal characteristics of a language. But considering the essential nature of the language, the assumption of segmented static and dependency of hidden markov models (the piece-wise stability and independence assertions) is two limitations of the model.
An important prior art used in the present invention is: and (3) a lip tracking algorithm based on a convolution virtual static electric field Snake model.
A detailed design of a lip tracking algorithm based on a convolution virtual electrostatic field Snake model is disclosed in the document "lip tracking algorithm based on a convolution virtual electrostatic field Snake model" (the sixth conference on harmony human-computer environment joint academic conference, 2010) by lukun et al.
Another important prior art used in the present invention is: a typical Correlation Analysis of Canonical Correlation (DCC) method.
Kim et al, in the document "analytical Learning And registration of Image Set Classes Using Canonical Correlations" (IEEE Transactions On Pattern Analysis And Machine Analysis, Vol.29, No.6 (2007)). According to the method, the similarity (represented by typical correlation coefficients) of homogeneous data sets (within-class sets) is maximized by introducing a transformation matrix T, and the similarity of heterogeneous data sets (within-class sets) is minimized, so that a better recognition effect is achieved.
In recent years, the canonical correlation discriminant analysis method has been successfully applied to the fields of image set matching, human face or object recognition and the like, so that the canonical correlation discriminant analysis method is a simple and effective method in theory for solving the problem of lip language recognition. However, no relevant documents and practical applications using typical correlation discriminant analysis methods for automatic lip recognition have been found so far.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides an automatic lip language recognition system suitable for Chinese.
The purpose of the invention is realized by the following technical scheme.
An automatic lip language recognition system for Chinese, comprising: the device comprises a head-mounted camera, a man-machine interaction module, a lip contour positioning module, a geometric vector acquisition module, a motion vector acquisition module, a feature matrix construction module, a transformation matrix T acquisition module, a conversion feature matrix acquisition module, a memory A, a memory B and a typical correlation discriminant analysis module.
The connection relationship is as follows: the output end of the head-mounted camera is connected with the input end of the man-machine interaction module; the output end of the human-computer interaction module is connected with the input end of the lip contour positioning module; the output end of the lip contour positioning module is connected with the input end of the geometric vector acquisition module; the output end of the geometric vector acquisition module is connected with the input ends of the motion vector acquisition module and the feature matrix construction module; the output end of the motion vector acquisition module is connected with the input end of the characteristic matrix construction module; the output end of the feature matrix construction module is connected with the input ends of the transformation matrix T acquisition module and the transformation feature matrix acquisition module; the transformation matrix T acquisition module is connected with the memory A; the conversion characteristic matrix acquisition module is connected with the memory A and the memory B; the memory A and the memory B are also connected with the input end of the typical correlation discriminant analysis module; the output end of the typical correlation discriminant analysis module is connected with the input end of the human-computer interaction module.
The main functions of each module and equipment are as follows:
the main functions of the head-mounted camera are: acquiring a Chinese character pronunciation image sequence sent by a testee.
The main functions of the man-machine interaction module are as follows: providing a closed contour curve for a testee to adjust the position of the head-mounted camera, so that the lip region of the testee acquired by the head-mounted camera is contained in the closed contour curve. Acquiring a Chinese character pronunciation image sequence shot by the head-mounted camera; thirdly, outputting the result of the typical relevant discriminant analysis module.
The main functions of the lip profile positioning module are: lip contour curves are obtained by sequentially positioning the lip contour on each frame of image in a Chinese character pronunciation image sequence by using a lip tracking algorithm proposed in a document 'lip tracking algorithm based on a convolution virtual electrostatic field Snake model' by Lukun et al, and the lip contour curves are output to a geometric vector acquisition module.
The main functions of the geometric vector acquisition module are: lip geometric characteristic vectors are obtained from lip contour curves of each frame of image in the Chinese character pronunciation image sequence output by the lip contour positioning module; and in order to compensate lip difference and image scaling difference between different testees, lip geometric feature vectors are subjected to normalization operation to obtain normalized lip geometric feature vectors, and the normalized lip geometric feature vectors are output to a motion vector acquisition module and a feature matrix construction module.
The main functions of the motion vector acquisition module are: and constructing lip motion characteristic vectors of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operation, and then outputting the lip motion characteristic vectors to a characteristic matrix construction module.
The main functions of the feature matrix construction module are: and constructing a characteristic matrix of the Chinese character pronunciation image sequence, and then outputting the characteristic matrix of the Chinese character pronunciation image sequence to a transformation matrix T acquisition module and a conversion characteristic matrix acquisition module.
The main functions of the transformation matrix T acquisition module are: a feature matrix of a Chinese character pronunciation Image sequence of training data is processed by a typical correlation discriminant Analysis method provided by T. -K.Kim et al in the document "characterization Learning And Recognition of Image Set Classes Using Canonica Correlations" (IEEE Transactions On Pattern Analysis And Machine understanding, Vo1.29, No.6(2007)), so as to obtain a transformation matrix T, And the transformation matrix T is stored in a memory A.
The main functions of the conversion characteristic matrix acquisition module are as follows: and converting the feature matrix of the Chinese character pronunciation image sequence of the training data by using the transformation matrix T in sequence to obtain a conversion feature matrix, and storing the conversion feature matrix of the Chinese character pronunciation image sequence of the training data in the memory A.
A memory A: and storing the transformation matrix T and a conversion characteristic matrix of the Chinese character pronunciation image sequence of the training data.
A memory B: and storing the conversion characteristic matrix of the Chinese character pronunciation image sequence of the test data.
A typical correlation discriminant analysis module: and acquiring the typical correlation coefficient sum of the conversion feature matrix of the current test data and the conversion feature matrix of each training data in the memory A from the memory B, further processing the typical correlation coefficient sums to obtain the identification result of the current test data, and outputting the identification result to the human-computer interaction module.
The working process of the automatic lip language recognition system comprises a system training process and a system testing process:
the working flow of the system training process is as follows:
step 1.1: selecting m Chinese characters as training data, wherein m is more than or equal to 5 and m is a positive integer;
step 1.2: the human-computer interaction module displays a closed contour curve.
Step 1.3: the tested person fixes the head-wearing camera on the head; the position of the head-mounted camera is adjusted by the tested person, so that the tested person can directly shoot the lower half part of the tested face, and the shot image is sent to the human-computer interaction module for display; the subject again adjusts the position of the head-mounted camera so that the lip region of the subject is contained in the closed contour curve described in step 1.2.
Step 1.4: the testee pronounces the m Chinese characters in the step 1.1 at the speed of 1 Chinese character per second, and simultaneously the shooting speed of the head-wearing camera is n frames per second, n is more than or equal to 25 and n is a positive integer; therefore, the video stream of each Chinese character pronunciation consists of n frames of image sequences; the n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; and the head-mounted camera sends the shot Chinese character pronunciation image sequence to the man-machine interaction module.
Step 1.5: and the human-computer interaction module sends the closed contour curve in the step 1.2 and the Chinese character pronunciation image sequence shot by the head-mounted camera in the step 1.4 to the lip contour positioning module.
Step 1.6: the lip contour positioning module uses a lip tracking algorithm proposed by Lukun et al in the literature, "lip tracking algorithm based on convolution virtual electrostatic field Snake model" to sequentially position the lip contour on each frame image in the Chinese character pronunciation image sequence, so as to obtain a lip contour curve, and outputs the lip contour curve to the geometric vector acquisition module. When the lip outline of the first image in each Chinese character pronunciation image sequence is positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts a closed outline curve provided by a man-machine interaction module; when the lip contours of other images in the Chinese character pronunciation image sequence are positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts the lip positioning result curve of the previous image of the images.
Step 1.7: the geometric vector acquisition module sequentially gets the data fromObtaining lip geometric characteristic vector from lip contour curve of each frame image in Chinese character pronunciation image sequence, and using giI represents the sequence number of each frame image in a Chinese character pronunciation image sequence, i is more than or equal to 1 and less than or equal to n, and i is a positive integer; and in order to compensate lip shape difference and image scaling difference among different testees, lip geometric characteristic vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector after normalization operation, and using gi' represents; and then outputting the lip geometric feature vector after the normalization operation to a motion vector acquisition module and a feature matrix construction module. The specific operation steps for obtaining the lip geometric feature vector after the normalization operation are as follows:
step 1.7.1: and calculating an extreme value of the lip contour curve in the horizontal direction to obtain point coordinates of the left and right mouth angles.
Step 1.7.2: connecting the left and right nozzle corner points by a straight line, taking the midpoint of the left and right nozzle corner points as the center of a circle, and taking the center of the circle as a point O, and rotating the straight line clockwise for 5 times, wherein the rotation is 30 degrees each time; two line segments of which the straight line intersects the lip-shaped curve are obtained every time the lip-shaped curve rotates once, and 12 line segments are obtained in total, and L is respectively used in the clockwise sequence from the left mouth corner1~L12The length of the 12 line segments is expressed, and the length L of the 12 line segments is called1~L12Is a radiation vector; when a straight line between the two points at the left and right mouth corners is rotated by 90 degrees, an upper intersection point and a lower intersection point intersecting the lip-shaped curve become a point a and a point B, respectively.
Step 1.7.3: selecting one point from two points of the left and right mouth corners, namely the point Q, and respectively connecting the point Q with a point A and a point B by straight lines; angle AQO is theta1Indicating that angle BQO is theta2Is represented by L1~L12To obtain theta1And theta2To thereby obtain theta1And theta2Cosine value of (d);
step 1.7.4: l is1~L12And theta1And theta2The cosine value of the image forms a lip geometric feature vector in a frame of image; due to L1And L7Is half the length of the line connecting the left and right mouth corners, so that their values are equal, thus removing L from the geometric feature vector of the lips7I.e. the geometric feature vector g of the lips in a frame of imagei=[L1,…,L6,L8,…L12,cosθ1,cosθ2]t
Step 1.7.5: to compensate lip shape difference and image scaling difference between different testees, lip geometric feature vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector after normalization operation, and using gi' represents; gi' is a 13-dimensional transverse vector, gi′=[L1′,…,L6′,L8′,…L12′,cosθ1,cosθ2](ii) a Wherein,
Figure BSA00000359669100061
j=1,2,…6,8,…,12,
Figure BSA00000359669100062
is the distance between the left and right corners of the mouth in the first frame of image of a sequence of Chinese character pronunciation images.
Step 1.8: the motion vector acquisition module constructs lip motion characteristic vectors (p is used) of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operationiRepresents) p)iIs a 13-dimensional transverse vector, pi=(gi′-gi-1')/Δ t, wherein g0′=g1', Δ t is the time interval of two consecutive frames; then the lip movement characteristic vector piOutputting the data to a feature matrix construction module;
step 1.9: feature matrix construction module constructs feature matrix (using Z) of Chinese character pronunciation image sequence of training datafWherein f represents the sequence number of the Chinese character pronunciation image sequence of the training data, f is more than or equal to 1 and less than or equal to m, and f is a positive integer), and then a feature matrix Z of the Chinese character pronunciation image sequence of the training data is formedfAre respectively output toThe device comprises a transformation matrix T acquisition module and a conversion characteristic matrix acquisition module. The specific operation steps for constructing the feature matrix of the Chinese character pronunciation image sequence are as follows:
step 1.9.1: the following operations are sequentially carried out on each frame image in the Chinese character pronunciation image sequence: connecting the lip geometric feature vector with the lip motion feature vector to form a joint feature vector (using v)iRepresents) viIs a 26-dimensional column vector and,
Figure BSA00000359669100063
step 1.9.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombined so that the feature matrix Z of the image sequence of the pronunciation of Chinese characters of the training dataf={v1,v2,...,vn}∈R26×n
Step 1.10: transformation matrix T obtains the characteristic matrix Z of the Chinese character pronunciation image sequence of the module to m training datafThe transformation matrix T belongs to R by adopting a typical correlation discriminant analysis method provided by T.K.Kim et al to process to obtain a transformation matrix T belonging to R26×rR < 26, and R is a positive integer, R represents a real number, and the transformation matrix T is stored to the memory a.
Step 1.11: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to sequentially compare the characteristic matrix Z of the Chinese character pronunciation image sequence of the training datafConverting to obtain a conversion characteristic matrix Zf′=TTZfAnd training the conversion characteristic matrix Z of the Chinese character pronunciation image sequence of the dataf' store to memory a.
Through the operation of the steps, the training of the automatic lip language recognition system can be completed.
The working flow of the system testing process is as follows:
step 2.1: m ' Chinese characters are selected from m training data as test data, m ' is less than or equal to m, and m ' is a positive integer.
Step 2.2: the human-computer interaction module displays a closed contour curve.
Step 2.3: the tested person fixes the head-wearing camera on the head; the position of the head-mounted camera is adjusted by the tested person, so that the tested person can directly shoot the lower half part of the tested face, and the shot image is sent to the human-computer interaction module for display; the subject again adjusts the position of the head-mounted camera so that the lip region of the subject is contained in the closed contour curve described in step 2.2.
Step 2.4: the testee pronounces the m' Chinese characters in the step 2.1 at the speed of 1 Chinese character per second, and simultaneously the shooting speed of the head-wearing camera is n frames per second; therefore, the video stream of each Chinese character pronunciation consists of n frames of image sequences; the n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; and the head-mounted camera sends the shot Chinese character pronunciation image sequence to the man-machine interaction module.
Step 2.5: and the human-computer interaction module sends the closed contour curve in the step 2.2 and the Chinese character pronunciation image sequence in the step 2.4 to the lip contour positioning module.
Step 2.6: the same as step 1.6 in the system training process.
Step 2.7: the same as step 1.7 in the system training process.
Step 2.8: the same as step 1.8 in the system training process.
Step 2.9: feature matrix construction module constructs feature matrix (using Z) of Chinese character pronunciation image sequence of test dataeRepresenting, wherein e represents the sequence number of the Chinese character pronunciation image sequence of the test data, e is more than or equal to 1 and less than or equal to m' and e is a positive integer), and then testing the characteristic matrix Z of the Chinese character pronunciation image sequence of the test dataeAnd outputting the data to a conversion feature matrix acquisition module. Specific method for constructing feature matrix of Chinese character pronunciation image sequenceThe operation steps are as follows:
step 2.9.1: the following operations are sequentially carried out on each frame image in the Chinese character pronunciation image sequence: connecting the lip geometric characteristic vector with the lip movement characteristic vector to form a combined characteristic vector vi,viIs a 26-dimensional column vector and,
Figure BSA00000359669100081
step 2.9.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombined so that the feature matrix Z of the phonetic image sequence of Chinese characters of the test datae={v1,v2,...,vn}∈R26×n
Step 2.10: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to test the characteristic matrix Z of the Chinese character pronunciation image sequence of the dataeConverting to obtain a conversion characteristic matrix Ze′=TTZeAnd converting the character feature matrix Z of the Chinese character pronunciation image sequence of the test datae' store to memory B.
Step 2.11: the typical correlation discriminant analysis module reads a conversion feature matrix Z of all training data from a memory Af' reading the conversion characteristic matrix Z of the Chinese character pronunciation image sequence of the current test data from the memory BeKim et al then calculates the transformation feature matrix Z of the test data Using a typical correlation discriminant Analysis method Set forth in the document "characterization Learning And correlation of Image Set Classes Using Canonica Correlations" (IEEE Transactions On Pattern Analysis And Machine Analysis, Vol.29, No.6(2007))e' conversion feature matrix Z with each training datafThe sum of typical correlation coefficients of'; because repeated Chinese characters may exist in the training data, the sum of typical correlation coefficients corresponding to the same Chinese character is 1 or more than 1, so thatAnd further calculating the average value of the typical correlation coefficient sum corresponding to each Chinese character in the training data, taking out the maximum value from the average values, and outputting the Chinese character corresponding to the maximum value in the training data to the man-machine interaction module.
Step 2.12: the man-machine interaction module displays the Chinese characters transmitted by the typical relevant discriminant analysis module.
Through the steps, the automatic identification of the test data can be completed.
Advantageous effects
Compared with the traditional Chinese automatic lip language recognition system, the invention has the following advantages:
firstly, the invention uses the head-wearing camera to directly obtain the lip image sequence, the position of the head-wearing camera is adjusted by a man-machine interaction mode when the experiment is started, the relative position of the camera and the face is fixed in the experiment process, the tested Chinese character can be pronounced naturally, and the head posture and position are not required to be kept deliberately. Compared with the prior method, the method can accurately acquire the lip image sequence, greatly reduce the early calculation amount, reduce the constraint on the tested object and enable the experimental process to be more natural.
Secondly, the invention uses the convolution virtual electrostatic field Snake model to position the lip contour, which is faster and more accurate.
The lip language features extracted by the invention are combined with lip geometric features and lip motion features, so that the analysis is more accurate.
The invention successfully applies the typical correlation discriminant analysis method to the field of lip language automatic recognition for the first time, and overcomes the limitation of the hidden Markov model in language recognition.
Drawings
Fig. 1 is a schematic structural diagram of an automatic lip language recognition system for chinese according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
An automatic lip language recognition system suitable for Chinese, the system structure of which is shown in fig. 1, includes: the device comprises a head-mounted camera, a man-machine interaction module, a lip contour positioning module, a geometric vector acquisition module, a motion vector acquisition module, a feature matrix construction module, a transformation matrix T acquisition module, a conversion feature matrix acquisition module, a memory A, a memory B and a typical correlation discriminant analysis module.
The connection relationship is as follows: the output end of the head-mounted camera is connected with the input end of the man-machine interaction module; the output end of the human-computer interaction module is connected with the input end of the lip contour positioning module; the output end of the lip contour positioning module is connected with the input end of the geometric vector acquisition module; the output end of the geometric vector acquisition module is connected with the input ends of the motion vector acquisition module and the feature matrix construction module; the output end of the motion vector acquisition module is connected with the input end of the characteristic matrix construction module; the output end of the feature matrix construction module is connected with the input ends of the transformation matrix T acquisition module and the transformation feature matrix acquisition module; the transformation matrix T acquisition module is connected with the memory A; the conversion characteristic matrix acquisition module is connected with the memory A and the memory B; the memory A and the memory B are also connected with the input end of the typical correlation discriminant analysis module; the output end of the typical correlation discriminant analysis module is connected with the input end of the human-computer interaction module.
The main functions of each module and equipment are as follows:
the main functions of the head-mounted camera are: acquiring a Chinese character pronunciation image sequence sent by a testee.
The main functions of the man-machine interaction module are as follows: providing a closed contour curve for a testee to adjust the position of the head-mounted camera, so that the lip region of the testee acquired by the head-mounted camera is contained in the closed contour curve. Acquiring a Chinese character pronunciation image sequence shot by the head-mounted camera; thirdly, outputting the result of the typical relevant discriminant analysis module.
The main functions of the lip profile positioning module are: lip contour curves are obtained by sequentially positioning the lip contour on each frame of image in a Chinese character pronunciation image sequence by using a lip tracking algorithm proposed by Lukun et al in the document 'lip tracking algorithm based on convolution virtual electrostatic field Snake model', and the lip contour curves are output to a geometric vector acquisition module.
The main functions of the geometric vector acquisition module are: lip geometric characteristic vectors are obtained from lip contour curves of each frame of image in the Chinese character pronunciation image sequence output by the lip contour positioning module; and in order to compensate lip difference and image scaling difference between different testees, lip geometric feature vectors are subjected to normalization operation to obtain normalized lip geometric feature vectors, and the normalized lip geometric feature vectors are output to a motion vector acquisition module and a feature matrix construction module.
The main functions of the motion vector acquisition module are: and constructing lip motion characteristic vectors of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operation, and then outputting the lip motion characteristic vectors to a characteristic matrix construction module.
The main functions of the feature matrix construction module are: and constructing a characteristic matrix of the Chinese character pronunciation image sequence, and then outputting the characteristic matrix of the Chinese character pronunciation image sequence to a transformation matrix T acquisition module and a conversion characteristic matrix acquisition module.
The main functions of the transformation matrix T acquisition module are: a feature matrix of a Chinese character pronunciation Image sequence of training data is processed by a typical correlation discriminant Analysis method provided by T. -K.Kim et al in the document "characterization Learning And Recognition of Image Set Classes Using Canonical Correlations" (IEEE Transactions On Pattern Analysis And Machine Analysis, Vol.29, No.6(2007)), so as to obtain a transformation matrix T, And the transformation matrix T is stored in a memory A.
The main functions of the conversion characteristic matrix acquisition module are as follows: and converting the feature matrix of the Chinese character pronunciation image sequence of the training data by using the transformation matrix T in sequence to obtain a conversion feature matrix, and storing the conversion feature matrix of the Chinese character pronunciation image sequence of the training data in the memory A.
A memory A: and storing the transformation matrix T and a conversion characteristic matrix of the Chinese character pronunciation image sequence of the training data.
A memory B: and storing the conversion characteristic matrix of the Chinese character pronunciation image sequence of the test data.
A typical correlation discriminant analysis module: and acquiring the typical correlation coefficient sum of the conversion feature matrix of the current test data and the conversion feature matrix of each training data in the memory A from the memory B, further processing the typical correlation coefficient sums to obtain the identification result of the current test data, and outputting the identification result to the human-computer interaction module.
The system is used for carrying out experiments, 10 testees (4 men and 6 women) are selected in the experiments, then each person pronounces 10 Chinese characters of 'zero, one, two, three, four, five, I, love, Beijing and Beijing' for 20 times, and each Chinese character obtains 200 Chinese character pronunciation image sequences; then, for each Chinese character, randomly selecting 80% (160) from 200 Chinese character pronunciation image sequences corresponding to each Chinese character as training data, and using the rest 20% (40) Chinese character pronunciation image sequences as test data; thus there were 1600 training data and 400 test data.
The steps for obtaining 2000 Chinese character pronunciation image sequences are as follows:
step 1: the human-computer interaction module displays a closed contour curve.
Step 2: the head-wearing camera is fixed on the head by 10 tested persons in sequence; the position of the head-mounted camera is adjusted by the tested person, so that the tested person can directly shoot the lower half part of the tested face, and the shot image is sent to the human-computer interaction module for display; the subject again adjusts the position of the head-mounted camera so that the lip region of the subject is contained in the closed contour curve described in step 1.
And step 3: a testee pronounces 10 Chinese characters of 'zero, one, two, three, four, five, I, love, Beijing and Beijing' at the speed of 1 Chinese character per second, wherein each Chinese character pronounces 20 times, and the shooting speed of the head-wearing camera is 30 frames per second, so that the video stream of each Chinese character pronunciation is composed of 30 frame image sequences; a30-frame image sequence of a Chinese character is referred to as a Chinese character pronunciation image sequence.
Through the operation of the steps, 2000 Chinese character pronunciation image sequences of 10 Chinese characters can be obtained.
Then, the experimenter uses 1600 randomly selected Chinese character pronunciation image sequences as training data to train the system, and the process is as follows:
step 1: and sending the closed contour curve appearing in the man-machine interaction module and 1600 Chinese character pronunciation image sequences to the lip contour positioning module.
Step 2: the lip contour positioning module uses a lip tracking algorithm proposed by Lukun et al in the literature 'lip tracking algorithm based on convolution virtual electrostatic field Snake model' to sequentially position the lip contour on each frame of image in the Chinese character pronunciation image sequence to obtain a lip contour curve, and outputs the lip contour curve to the geometric vector acquisition module. When the lip outline of the first image in each Chinese character pronunciation image sequence is positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts a closed outline curve provided by a man-machine interaction module; when the lip contours of other images in the Chinese character pronunciation image sequence are positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts the lip positioning result curve of the previous image of the images.
And step 3: geometric vector acquisition module sequentially pronounces image sequences from Chinese charactersObtaining lip geometric characteristic vector g from lip contour curve of each frame of image1~g30Is represented by the following formula (I); and in order to compensate lip shape difference and image scaling difference among different testees, lip geometric characteristic vector g is subjected to1~g30Carrying out normalization operation to obtain lip geometric feature vector g after normalization operation1′~g30'; and then outputting the lip geometric feature vector after the normalization operation to a motion vector acquisition module and a feature matrix construction module. The specific operation steps for obtaining the lip geometric feature vector after the normalization operation are as follows:
step 3.1: and calculating an extreme value of the lip contour curve in the horizontal direction to obtain point coordinates of the left and right mouth angles.
Step 3.2: connecting the left and right nozzle corner points by a straight line, taking the midpoint of the left and right nozzle corner points as the center of a circle, and taking the center of the circle as a point O, and rotating the straight line clockwise for 5 times, wherein the rotation is 30 degrees each time; two line segments of which the straight line intersects the lip-shaped curve are obtained every time the lip-shaped curve rotates once, and 12 line segments are obtained in total, and L is respectively used in the clockwise sequence from the left mouth corner1~L12The length of the 12 line segments is expressed, and the length L of the 12 line segments is called1~L12Is a radiation vector; when a straight line between the two points at the left and right mouth corners is rotated by 90 degrees, an upper intersection point and a lower intersection point intersecting the lip-shaped curve become a point a and a point B, respectively.
Step 3.3: the left mouth angle is called as a point Q, and the point Q is respectively connected with a point A and a point B by straight lines; angle AQO is theta1Indicating that angle BQO is theta2Is represented by L1~L12To obtain theta1And theta2To thereby obtain theta1And theta2Cosine value of (d);
step 3.4: l is1~L12And theta1And theta2The cosine value of the image forms a lip geometric feature vector in a frame of image; due to L1And L7Is half the length of the line connecting the left and right mouth corners, so that they have equal values, and are therefore used in the geometric feature vector of the lipsFalling L7I.e. the geometric feature vector g of the lips in a frame of imagei=[L1,…,L6,L8,…L12,cosθ1,cosθ2]t,i=1,2,…,30;
Step 3.5: to compensate lip shape difference and image scaling difference between different testees, lip geometric feature vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector g after normalization operationi′;gi' is a 13-dimensional transverse vector, gi′=[L1′,…,L6′,L8′,…L12′,cosθ1,cosθ2](ii) a Wherein,
Figure BSA00000359669100131
j=1,2,…6,8,…,12,
Figure BSA00000359669100132
is the distance between the left and right corners of the mouth in the first frame of image of a sequence of Chinese character pronunciation images.
And 4, step 4: the motion vector acquisition module constructs lip motion characteristic vectors p of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operationi,piIs a 13-dimensional transverse vector, pi=(gi′-gi-1')/Δ t, wherein g0′=g1', Δ t is the time interval of two consecutive frames; then the lip movement characteristic vector piOutputting the data to a feature matrix construction module;
and 5: feature matrix construction module constructs feature matrix Z of Chinese character pronunciation image sequence of training datafF is 1, 2, …, 1600, and then training the feature matrix Z of the Chinese character pronunciation image sequence of the datafAnd respectively outputting the data to a transformation matrix T acquisition module and a conversion characteristic matrix acquisition module. The specific operation steps for constructing the feature matrix of the Chinese character pronunciation image sequence are as follows:
step 5.1: in turn, theThe following operations are carried out on each frame image in the Chinese character pronunciation image sequence: connecting the lip geometric characteristic vector with the lip movement characteristic vector to form a combined characteristic vector vi,viIs a 26-dimensional column vector and,
Figure BSA00000359669100133
step 5.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombined so that the feature matrix Z of the image sequence of the pronunciation of Chinese characters of the training dataf={v1,v2,...,vn}∈R26×30
Step 1.6: transformation matrix T obtains the characteristic matrix Z of the Chinese character pronunciation image sequence of the module to 1600 training datafAnd processing by adopting a typical correlation discriminant analysis method proposed by T.
Step 1.7: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to sequentially compare the characteristic matrix Z of the Chinese character pronunciation image sequence of the training datafConverting to obtain a conversion characteristic matrix Zf′=TTZfAnd training the conversion characteristic matrix Z of the Chinese character pronunciation image sequence of the dataf' store to memory a.
Through the operation of the steps, the training of the automatic lip language recognition system can be completed.
After the automatic lip language recognition system is trained, an experimenter uses 400 pieces of test data to test the system, and the process is as follows:
step 1: and sending the closed contour curve appearing in the human-computer interaction module and the 400 Chinese character pronunciation image sequences to the lip contour positioning module.
Step 2: the lip contour positioning module uses a lip tracking algorithm proposed by Lukun et al in the literature 'lip tracking algorithm based on convolution virtual electrostatic field Snake model' to sequentially position the lip contour on each frame of image in the Chinese character pronunciation image sequence to obtain a lip contour curve, and outputs the lip contour curve to the geometric vector acquisition module. When the lip outline of the first image in each Chinese character pronunciation image sequence is positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts a closed outline curve provided by a man-machine interaction module; when the lip contours of other images in the Chinese character pronunciation image sequence are positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts the lip positioning result curve of the previous image of the images.
And step 3: the geometric vector acquisition module sequentially acquires lip geometric characteristic vectors g from lip contour curves of each frame of image in Chinese character pronunciation image sequence1~g30Is represented by the following formula (I); and in order to compensate lip shape difference and image scaling difference among different testees, lip geometric characteristic vector g is subjected to1~g30Carrying out normalization operation to obtain lip geometric feature vector g after normalization operation1′~g30'; and then outputting the lip geometric feature vector after the normalization operation to a motion vector acquisition module and a feature matrix construction module. The specific operation steps for obtaining the lip geometric feature vector after the normalization operation are as follows:
step 3.1: and calculating an extreme value of the lip contour curve in the horizontal direction to obtain point coordinates of the left and right mouth angles.
Step 3.2: connecting the left and right nozzle corner points by a straight line, taking the midpoint of the left and right nozzle corner points as the center of a circle, and taking the center of the circle as a point O, and rotating the straight line clockwise for 5 times, wherein the rotation is 30 degrees each time; two line segments of which the straight line intersects the lip-shaped curve are obtained every time the lip-shaped curve rotates once, and 12 line segments are obtained in total, and L is respectively used in the clockwise sequence from the left mouth corner1~L12The length of the 12 line segments is expressed, and the length L of the 12 line segments is called1~L12Is a radiation vector; when the straight line between the two points of the left and right mouth angles is rotated by 90 degrees, the lip is contacted withThe upper and lower intersections at which the curved lines intersect become points a and B, respectively.
Step 3.3: the left mouth angle is called as a point Q, and the point Q is respectively connected with a point A and a point B by straight lines; theta for angle AQ01Indicating that angle BQO is theta2Is represented by L1~L12To obtain theta1And theta2To thereby obtain theta1And theta2Cosine value of (d);
step 3.4: l is1~L12And theta1And theta2The cosine value of the image forms a lip geometric feature vector in a frame of image; due to L1And L7Is half the length of the line connecting the left and right mouth corners, so that their values are equal, thus removing L from the geometric feature vector of the lips7I.e. the geometric feature vector g of the lips in a frame of imagei=[L1,…,L6,L8,…L12,cosθ1,cosθ2]t,i=1,2,…,30;
Step 3.5: to compensate lip shape difference and image scaling difference between different testees, lip geometric feature vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector g after normalization operationi′;gi' is a 13-dimensional transverse vector, gi′=[L1′,…,L6′,L8′,…L12′,cosθ1,cosθ2](ii) a Wherein,
Figure BSA00000359669100151
j=1,2,…6,8,…,12,
Figure BSA00000359669100152
is the distance between the left and right corners of the mouth in the first frame of image of a sequence of Chinese character pronunciation images.
And 4, step 4: the motion vector acquisition module constructs lip motion characteristic vectors p of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operationi,piIs a 13-dimensional transverse vector, pi=(gi′-gi-1')/Δ t, wherein g0′=g1', Δ t is the time interval of two consecutive frames; then the lip movement characteristic vector piOutputting the data to a feature matrix construction module;
and 4, step 4: feature matrix construction module constructs feature matrix Z of Chinese character pronunciation image sequence of test dataeE 1, 2, …, 400, and then testing the feature matrix Z of the Chinese character pronunciation image sequence of the dataeAnd outputting the data to a conversion feature matrix acquisition module. The specific operation steps for constructing the feature matrix of the Chinese character pronunciation image sequence are as follows:
step 4.1: the following operations are sequentially carried out on each frame image in the Chinese character pronunciation image sequence: connecting the lip geometric characteristic vector with the lip movement characteristic vector to form a combined characteristic vector vi,viIs a 26-dimensional column vector and,
Figure BSA00000359669100153
step 4.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombined so that the feature matrix Z of the phonetic image sequence of Chinese characters of the test datae={v1,v2,...,vn}∈R26×30
And 5: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to test the characteristic matrix Z of the Chinese character pronunciation image sequence of the dataeConverting to obtain a conversion characteristic matrix Ze′=TTZeAnd converting the character feature matrix Z of the Chinese character pronunciation image sequence of the test datae' store to memory B.
Step 6: the typical correlation discriminant analysis module reads a conversion feature matrix Z of all training data from a memory Af', read from memory BConversion characteristic matrix Z of Chinese character pronunciation image sequence of current test datae' then calculating a conversion feature matrix Z of the test data by adopting a typical correlation discriminant analysis method proposed by Te' conversion feature matrix Z with each training datafThe sum of typical correlation coefficients of'; as repeated Chinese characters may exist in the training data, the typical correlation coefficient sum corresponding to the same Chinese character is 1 or more than 1, so that the average value of the typical correlation coefficient sum corresponding to each Chinese character in the training data is further calculated, the maximum value is taken out from the average values, and the Chinese character corresponding to the maximum value in the training data is output to the man-machine interaction module.
And 7: the man-machine interaction module displays the Chinese characters transmitted by the typical relevant discriminant analysis module.
Through the steps, the automatic identification of the test data can be completed, and the identification accuracy of the system is shown in the 2 nd column in the table 1; meanwhile, in order to illustrate the effect of the invention, 2 experiments were also performed:
1. under the condition of the same experimental environment, training data and test data, the convolution virtual electrostatic field Snake model used in the invention is changed into the traditional Snake model, other functions are unchanged, and the obtained identification accuracy is shown in the 3 rd column in the table 1;
2. under the same experimental environment, training data and test data, the typical correlation analysis method used in the present invention is changed into a Continuous Hidden Markov Model (CHMM) with unchanged other functions, and the obtained recognition accuracy is shown in column 4 in table 1.
TABLE 1 comparison of recognition accuracy (%) -for the different methods
(1) (2) (3)
'zero' 90.0 73.5 88.5
'one' 92.0 75.0 90.5
'two' 86.5 76.0 83.0
'three' 93.0 81.5 92.5
'four' 95.0 83.0 95.5
'Wu' 89.5 73.0 91.0
'I' 96.0 82.0 95.0
Love " 97.0 82.5 95.5
'Bei' 93.5 81.5 94.0
"Jing" 90.0 75.5 88.0
Experiments show that the system provided by the invention has higher identification accuracy.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications may be made or equivalents may be substituted for some of the features thereof without departing from the scope of the present invention, and such modifications and substitutions should also be considered as the protection scope of the present invention.

Claims (1)

1. An automatic lip language recognition system for Chinese, comprising: the system comprises a head-mounted camera, a man-machine interaction module, a lip contour positioning module, a geometric vector acquisition module, a motion vector acquisition module, a feature matrix construction module, a transformation matrix T acquisition module, a conversion feature matrix acquisition module, a memory A, a memory B and a typical correlation discriminant analysis module;
the connection relationship is as follows: the output end of the head-mounted camera is connected with the input end of the man-machine interaction module; the output end of the human-computer interaction module is connected with the input end of the lip contour positioning module; the output end of the lip contour positioning module is connected with the input end of the geometric vector acquisition module; the output end of the geometric vector acquisition module is connected with the input ends of the motion vector acquisition module and the feature matrix construction module; the output end of the motion vector acquisition module is connected with the input end of the characteristic matrix construction module; the output end of the feature matrix construction module is connected with the input ends of the transformation matrix T acquisition module and the transformation feature matrix acquisition module; the transformation matrix T acquisition module is connected with the memory A; the conversion characteristic matrix acquisition module is connected with the memory A and the memory B; the memory A and the memory B are also connected with the input end of the typical correlation discriminant analysis module; the output end of the typical correlation discriminant analysis module is connected with the input end of the human-computer interaction module;
the main functions of each module and equipment are as follows:
the main functions of the head-mounted camera are: acquiring a Chinese character pronunciation image sequence sent by a testee;
the main functions of the man-machine interaction module are as follows: providing a closed contour curve for a testee to adjust the position of a head-mounted camera, so that the lip region of the testee acquired by the head-mounted camera is contained in the closed contour curve; acquiring a Chinese character pronunciation image sequence shot by the head-mounted camera; thirdly, outputting the result of the typical correlation discriminant analysis module;
the main functions of the lip profile positioning module are: using a lip tracking algorithm proposed in a document 'lip tracking algorithm based on a convolution virtual electrostatic field Snake model' by Lukun et al to sequentially position the lip contour on each frame of image in a Chinese character pronunciation image sequence to obtain a lip contour curve, and outputting the lip contour curve to a geometric vector acquisition module;
the main functions of the geometric vector acquisition module are: lip geometric characteristic vectors are obtained from lip contour curves of each frame of image in the Chinese character pronunciation image sequence output by the lip contour positioning module; in order to compensate lip difference and image scaling difference between different testees, lip geometric feature vectors are subjected to normalization operation to obtain normalized lip geometric feature vectors, and the normalized lip geometric feature vectors are output to a motion vector acquisition module and a feature matrix construction module;
the main functions of the motion vector acquisition module are: constructing lip motion characteristic vectors of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operation, and then outputting the lip motion characteristic vectors to a characteristic matrix construction module;
the main functions of the feature matrix construction module are: constructing a feature matrix of the Chinese character pronunciation image sequence, and then outputting the feature matrix of the Chinese character pronunciation image sequence to a transformation matrix T acquisition module and a transformation feature matrix acquisition module;
the main functions of the transformation matrix T acquisition module are: aiming at a feature matrix of a Chinese character pronunciation Image sequence of training data, a typical correlation discriminant analysis method provided by T < -K.Kim et al in the literature, "discriminating Learning and Recognition of Image Set Classes Using Canonical relations" is adopted for processing to obtain a transformation matrix T, and the transformation matrix T is stored in a memory A;
the main functions of the conversion characteristic matrix acquisition module are as follows: converting the feature matrix of the Chinese character pronunciation image sequence of the training data in sequence by using the transformation matrix T to obtain a conversion feature matrix, and storing the conversion feature matrix of the Chinese character pronunciation image sequence of the training data in a memory A;
a memory A: storing the transformation matrix T and a conversion characteristic matrix of the Chinese character pronunciation image sequence of the training data;
a memory B: storing a conversion characteristic matrix of the Chinese character pronunciation image sequence of the test data;
a typical correlation discriminant analysis module: obtaining the typical correlation coefficient sum of the conversion characteristic matrix of the current test data and the conversion characteristic matrix of each training data in the memory A from the memory B, further processing the typical correlation coefficient sums to obtain the identification result of the current test data, and outputting the identification result to the man-machine interaction module;
the working process of the automatic lip language recognition system comprises a system training process and a system testing process:
the working flow of the system training process is as follows:
step 1.1: selecting m Chinese characters as training data, wherein m is more than or equal to 5 and m is a positive integer;
step 1.2: the man-machine interaction module displays a closed contour curve;
step 1.3: the tested person fixes the head-wearing camera on the head; the position of the head-mounted camera is adjusted by the tested person, so that the tested person can directly shoot the lower half part of the tested face, and the shot image is sent to the human-computer interaction module for display; the subject again adjusts the position of the head-mounted camera so that the lip region of the subject is contained in the closed contour curve described in step 1.2;
step 1.4: the testee pronounces the m Chinese characters in the step 1.1 at the speed of 1 Chinese character per second, and simultaneously the shooting speed of the head-wearing camera is n frames per second, n is more than or equal to 25 and n is a positive integer; therefore, the video stream of each Chinese character pronunciation consists of n frames of image sequences; the n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; the head-mounted camera sends the shot Chinese character pronunciation image sequence to the man-machine interaction module;
step 1.5: the human-computer interaction module sends the closed contour curve in the step 1.2 and the Chinese character pronunciation image sequence shot by the head-mounted camera in the step 1.4 to the lip contour positioning module;
step 1.6: the lip contour positioning module uses a lip tracking algorithm proposed by Lukun et al in the literature 'lip tracking algorithm based on convolution virtual electrostatic field Snake model' to sequentially position the lip contour on each frame of image in the Chinese character pronunciation image sequence to obtain a lip contour curve, and outputs the lip contour curve to the geometric vector acquisition module; when the lip outline of the first image in each Chinese character pronunciation image sequence is positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts a closed outline curve provided by a man-machine interaction module; when the lip outlines of other images in the Chinese character pronunciation image sequence are positioned, the initial curve of the convolution virtual electrostatic field Snake model adopts the lip positioning result curve of the previous image of the image;
step 1.7: the geometric vector acquisition module sequentially acquires lip geometric characteristics from a lip contour curve of each frame of image in a Chinese character pronunciation image sequenceSign vector, in giI represents the sequence number of each frame image in a Chinese character pronunciation image sequence, i is more than or equal to 1 and less than or equal to n, and i is a positive integer; and in order to compensate lip shape difference and image scaling difference among different testees, lip geometric characteristic vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector after normalization operation, and using gi' represents; then outputting the lip geometric feature vector after the normalization operation to a motion vector acquisition module and a feature matrix construction module; the specific operation steps for obtaining the lip geometric feature vector after the normalization operation are as follows:
step 1.7.1: calculating an extreme value of the lip contour curve in the horizontal direction to obtain point coordinates of a left nozzle angle and a right nozzle angle;
step 1.7.2: connecting the left and right nozzle corner points by a straight line, taking the midpoint of the left and right nozzle corner points as the center of a circle, and taking the center of the circle as a point O, and rotating the straight line clockwise for 5 times, wherein the rotation is 30 degrees each time; two line segments of which the straight line intersects the lip-shaped curve are obtained every time the lip-shaped curve rotates once, and 12 line segments are obtained in total, and L is respectively used in the clockwise sequence from the left mouth corner1~L12The length of the 12 line segments is expressed, and the length L of the 12 line segments is called1~L12Is a radiation vector; when the straight line between the two points of the left and right mouth corners is rotated by 90 degrees, the upper intersection point and the lower intersection point which are intersected with the lip-shaped curve become a point A and a point B respectively;
step 1.7.3: selecting one point from two points of the left and right mouth corners, namely the point Q, and respectively connecting the point Q with a point A and a point B by straight lines; angle AQO is theta1Indicating that angle BQO is theta2Is represented by L1~L12To obtain theta1And theta2To thereby obtain theta1And theta2Cosine value of (d);
step 1.7.4: l is1~L12And theta1And theta2The cosine value of the image forms a lip geometric feature vector in a frame of image; due to L1And L7Is half the length of the line connecting the left and right mouth corners, so that their values are equal, thus removing L from the geometric feature vector of the lips7I.e. geometric feature vector of lips in a frame of imagegi=[L1,…,L6,L8,…L12,cosθ1,cosθ2]t
Step 1.7.5: to compensate lip shape difference and image scaling difference between different testees, lip geometric feature vector g is subjected toiCarrying out normalization operation to obtain lip geometric feature vector after normalization operation, and using gi' represents; gi' is a 13-dimensional transverse vector, gi′=[L1′,…,L6′,L8′,…L12′,cosθ1,cosθ2](ii) a Wherein,
Figure FSA00000359669000041
j=1,2,…6,8,…,12,is the distance between the left and right mouth corners in the first frame image of a Chinese character pronunciation image sequence;
step 1.8: the motion vector acquisition module constructs lip motion characteristic vectors of each frame of image on the basis of the lip geometric characteristic vectors subjected to normalization operation, and uses piIs represented by the formula piIs a 13-dimensional transverse vector, pi=(gi′-gi-1')/Δ t, wherein g0′=g1', Δ t is the time interval of two consecutive frames; then the lip movement characteristic vector piOutputting the data to a feature matrix construction module;
step 1.9: the feature matrix construction module constructs the feature matrix of the Chinese character pronunciation image sequence of the training data by ZfRepresenting, wherein f represents the sequence number of the Chinese character pronunciation image sequence of the training data, f is more than or equal to 1 and less than or equal to m, and f is a positive integer; then, a feature matrix Z of the Chinese character pronunciation image sequence of the training data is usedfRespectively outputting the data to a transformation matrix T acquisition module and a transformation characteristic matrix acquisition module; the specific operation steps for constructing the feature matrix of the Chinese character pronunciation image sequence are as follows:
step 1.9.1: sequentially doing for each frame image in the Chinese character pronunciation image sequenceThe following operations are carried out: connecting the lip geometric feature vector with the lip motion feature vector to form a joint feature vector, using viIs represented by viIs a 26-dimensional column vector and,
Figure FSA00000359669000043
step 1.9.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombined so that the feature matrix Z of the image sequence of the pronunciation of Chinese characters of the training dataf={v1,v2,...,vn}∈R26×n
Step 1.10: transformation matrix T obtains the characteristic matrix Z of the Chinese character pronunciation image sequence of the module to m training datafThe transformation matrix T belongs to R by adopting a typical correlation discriminant analysis method provided by T.K.Kim et al to process to obtain a transformation matrix T belonging to R26×rR < 26, wherein R is a positive integer and represents a real number, and storing the transformation matrix T into a memory A;
step 1.11: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to sequentially compare the characteristic matrix Z of the Chinese character pronunciation image sequence of the training datafConverting to obtain a conversion characteristic matrix Zf′=TTZfAnd training the conversion characteristic matrix Z of the Chinese character pronunciation image sequence of the dataf' store to memory a;
through the operation of the steps, the training of the automatic lip language recognition system can be completed;
the working flow of the system testing process is as follows:
step 2.1: selecting m ' Chinese characters from m training data as test data, wherein m ' is less than or equal to m and m ' is a positive integer;
step 2.2: the man-machine interaction module displays a closed contour curve;
step 2.3: the tested person fixes the head-wearing camera on the head; the position of the head-mounted camera is adjusted by the tested person, so that the tested person can directly shoot the lower half part of the tested face, and the shot image is sent to the human-computer interaction module for display; the subject again adjusts the position of the head-mounted camera so that the lip region of the subject is contained in the closed contour curve described in step 2.2;
step 2.4: the testee pronounces the m' Chinese characters in the step 2.1 at the speed of 1 Chinese character per second, and simultaneously the shooting speed of the head-wearing camera is n frames per second; therefore, the video stream of each Chinese character pronunciation consists of n frames of image sequences; the n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; the head-mounted camera sends the shot Chinese character pronunciation image sequence to the man-machine interaction module;
step 2.5: the human-computer interaction module sends the closed contour curve in the step 2.2 and the Chinese character pronunciation image sequence in the step 2.4 to the lip contour positioning module;
step 2.6: the same as the operation of step 1.6 in the system training process;
step 2.7: the same as the operation of step 1.7 in the system training process;
step 2.8: the same as the operation of step 1.8 in the system training process;
step 2.9: the feature matrix construction module constructs the feature matrix of the Chinese character pronunciation image sequence of the test data by ZeRepresenting, wherein e represents the sequence number of the Chinese character pronunciation image sequence of the test data, e is more than or equal to 1 and less than or equal to m', and e is a positive integer; then testing the character matrix Z of the Chinese character pronunciation image sequence of the dataeOutputting the data to a conversion characteristic matrix obtaining module; the specific operation steps for constructing the feature matrix of the Chinese character pronunciation image sequence are as follows:
step 2.9.1: the following operations are sequentially carried out on each frame image in the Chinese character pronunciation image sequence: connecting the lip geometric characteristic vector with the lip movement characteristic vector to form a combined characteristic vector vi,viIs a 26-dimensional column vector and,
Figure FSA00000359669000061
step 2.9.2: the feature matrix of the Chinese character pronunciation image sequence is composed of the joint feature vector v of each frame image in the Chinese character pronunciation image sequenceiCombination ofAnd, thus, the feature matrix Z of the Chinese character pronunciation image sequence of the test datae={v1,v2,...,vn}∈R26×n
Step 2.10: the conversion characteristic matrix acquisition module reads the transformation matrix T from the memory A and uses the transformation matrix T to test the characteristic matrix Z of the Chinese character pronunciation image sequence of the dataeConverting to obtain a conversion characteristic matrix Ze′=TTZeAnd converting the character feature matrix Z of the Chinese character pronunciation image sequence of the test datae' store to memory B;
step 2.11: the typical correlation discriminant analysis module reads a conversion feature matrix Z of all training data from a memory Af' reading the conversion characteristic matrix Z of the Chinese character pronunciation image sequence of the current test data from the memory BeKim et al, then calculates the transformed feature matrix Z of the test data Using the typical correlation discriminant analysis method Set forth in the literature, "discrete Learning and correlation of Image Set Classes Using cancer Correlations"e' conversion feature matrix Z with each training datafThe sum of typical correlation coefficients of'; because repeated Chinese characters may exist in the training data, the typical correlation coefficient sum corresponding to the same Chinese character is 1 or more than 1, the average value of the typical correlation coefficient sum corresponding to each Chinese character in the training data is further calculated, the maximum value is taken out from the average values, and the Chinese character corresponding to the maximum value in the training data is output to the human-computer interaction module;
step 2.12: the man-machine interaction module displays the Chinese characters transmitted by the typical relevant discriminant analysis module;
through the steps, automatic classification and identification of the test data can be completed.
CN2010105582532A 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language Expired - Fee Related CN102004549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105582532A CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105582532A CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Publications (2)

Publication Number Publication Date
CN102004549A true CN102004549A (en) 2011-04-06
CN102004549B CN102004549B (en) 2012-05-09

Family

ID=43811953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105582532A Expired - Fee Related CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Country Status (1)

Country Link
CN (1) CN102004549B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN105787428A (en) * 2016-01-08 2016-07-20 上海交通大学 Method for lip feature-based identity authentication based on sparse coding
CN106250829A (en) * 2016-07-22 2016-12-21 中国科学院自动化研究所 Digit recognition method based on lip texture structure
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data
CN107122646A (en) * 2017-04-26 2017-09-01 大连理工大学 A kind of method for realizing lip reading unblock
CN107992812A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip reading recognition methods and device
CN108596107A (en) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 Lip reading recognition methods and its device, AR equipment based on AR equipment
CN109389098A (en) * 2018-11-01 2019-02-26 重庆中科云丛科技有限公司 A kind of verification method and system based on lip reading identification
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
US10275685B2 (en) 2014-12-22 2019-04-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content
CN110580336A (en) * 2018-06-08 2019-12-17 北京得意音通技术有限责任公司 Lip language word segmentation method and device, storage medium and electronic equipment
CN111326152A (en) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 Voice control method and device
CN111898420A (en) * 2020-06-17 2020-11-06 北方工业大学 Lip language recognition system
CN112053160A (en) * 2020-09-03 2020-12-08 中国银行股份有限公司 Intelligent bracelet for lip language recognition, lip language recognition system and method
WO2021051603A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Coordinate transformation-based lip cutting method and apparatus, device, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 20070630 Tae-Kyun Kim et al. Discriminative learning and recognition of image set classes using canonical correlations 全文 1 第29卷, 第6期 2 *
《第六届和谐人机环境联合学术会议》 20101024 吕坤 等 基于卷积虚拟静电场Snake模型的唇形跟踪算法 , 2 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275685B2 (en) 2014-12-22 2019-04-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN104808794B (en) * 2015-04-24 2019-12-10 北京旷视科技有限公司 lip language input method and system
CN105787428A (en) * 2016-01-08 2016-07-20 上海交通大学 Method for lip feature-based identity authentication based on sparse coding
CN106250829A (en) * 2016-07-22 2016-12-21 中国科学院自动化研究所 Digit recognition method based on lip texture structure
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data
CN107025439B (en) * 2017-03-22 2020-04-24 天津大学 Lip region feature extraction and normalization method based on depth data
CN107122646A (en) * 2017-04-26 2017-09-01 大连理工大学 A kind of method for realizing lip reading unblock
CN107992812A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip reading recognition methods and device
CN108596107A (en) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 Lip reading recognition methods and its device, AR equipment based on AR equipment
US11527242B2 (en) 2018-04-26 2022-12-13 Beijing Boe Technology Development Co., Ltd. Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view
CN110580336A (en) * 2018-06-08 2019-12-17 北京得意音通技术有限责任公司 Lip language word segmentation method and device, storage medium and electronic equipment
CN109389098A (en) * 2018-11-01 2019-02-26 重庆中科云丛科技有限公司 A kind of verification method and system based on lip reading identification
CN111326152A (en) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 Voice control method and device
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
WO2021051603A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Coordinate transformation-based lip cutting method and apparatus, device, and storage medium
CN111898420A (en) * 2020-06-17 2020-11-06 北方工业大学 Lip language recognition system
CN112053160A (en) * 2020-09-03 2020-12-08 中国银行股份有限公司 Intelligent bracelet for lip language recognition, lip language recognition system and method
CN112053160B (en) * 2020-09-03 2024-04-23 中国银行股份有限公司 Intelligent bracelet for lip language identification, lip language identification system and method

Also Published As

Publication number Publication date
CN102004549B (en) 2012-05-09

Similar Documents

Publication Publication Date Title
CN102004549A (en) Automatic lip language identification system suitable for Chinese language
CN110866953B (en) Map construction method and device, and positioning method and device
Ko et al. Sign language recognition with recurrent neural network using human keypoint detection
Luettin et al. Speechreading using probabilistic models
Papandreou et al. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition
Schuckers et al. On techniques for angle compensation in nonideal iris recognition
Potamianos et al. Recent advances in the automatic recognition of audiovisual speech
Brown et al. Comparative study of coarse head pose estimation
Youssif et al. Arabic sign language (arsl) recognition system using hmm
Feng et al. Depth-projection-map-based bag of contour fragments for robust hand gesture recognition
Geetha et al. A vision based dynamic gesture recognition of indian sign language on kinect based depth images
Bao et al. Dynamic hand gesture recognition based on SURF tracking
Cappelletta et al. Viseme definitions comparison for visual-only speech recognition
Jiang et al. Improved face and feature finding for audio-visual speech recognition in visually challenging environments
CN104821010A (en) Binocular-vision-based real-time extraction method and system for three-dimensional hand information
Haq et al. Using lip reading recognition to predict daily Mandarin conversation
Lu et al. Review on automatic lip reading techniques
Watanabe et al. Lip reading from multi view facial images using 3D-AAM
Chiţu et al. Comparison between different feature extraction techniques for audio-visual speech recognition
Gao et al. Learning and synthesizing MPEG-4 compatible 3-D face animation from video sequence
Zheng et al. Review of lip-reading recognition
Shiraishi et al. Optical flow based lip reading using non rectangular ROI and head motion reduction
KR101621304B1 (en) Active shape model-based lip shape estimation method and system using mouth map
Reveret et al. Visual coding and tracking of speech related facial motion
Aharon et al. Representation analysis and synthesis of lip images using dimensionality reduction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120509

Termination date: 20171122