CN102004549A - Automatic lip language identification system suitable for Chinese language - Google Patents

Automatic lip language identification system suitable for Chinese language Download PDF

Info

Publication number
CN102004549A
CN102004549A CN 201010558253 CN201010558253A CN102004549A CN 102004549 A CN102004549 A CN 102004549A CN 201010558253 CN201010558253 CN 201010558253 CN 201010558253 A CN201010558253 A CN 201010558253A CN 102004549 A CN102004549 A CN 102004549A
Authority
CN
China
Prior art keywords
lip
chinese character
module
image sequence
character pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010558253
Other languages
Chinese (zh)
Other versions
CN102004549B (en
Inventor
吕坤
贾云得
张欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN2010105582532A priority Critical patent/CN102004549B/en
Publication of CN102004549A publication Critical patent/CN102004549A/en
Application granted granted Critical
Publication of CN102004549B publication Critical patent/CN102004549B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to an automatic lip language identification system suitable for Chinese language, comprising a wear-type camera, a man-machine interaction module, a lip contour positioning module, a geometric vector acquisition module, a motion vector acquisition module, a characteristic matrix building module, a transformation matrix T acquisition module, a conversion characteristic matrix acquisition module, a memory A, a memory B and a canonical correlation discriminatory analysis module. The wear-type camera is used for recording Chinese character sound image sequences, transmitting the Chinese character sound image sequences to the lip contour positioning module through the man-machine interaction module, and detecting and tracking lip contours by utilizing a convolution virtual electrostatic field Snake module; the geometric vector acquisition module and the motion vector acquisition module respectively extract geometric and motion characteristics from the lip contours and join up the geometric and motion characteristics as an input characteristic matrix of the canonical correlation discriminatory analysis module; and the canonical correlation discriminatory analysis module calculates the similarity among the characteristic matrixes and acquires identification results after processing. Compared with the traditional lip language identification systems, the system has higher identification accuracy.

Description

A kind of automatic lip reading recognition system that is applicable to Chinese
Technical field
The present invention relates to a kind of automatic lip reading recognition system, particularly a kind of automatic lip reading recognition system that is applicable to Chinese belongs to automatic lip reading recognition technology field.
Background technology
Lip reading identification or labiomaney be man-machine interaction (Human-Computer Interaction, HCI) in very noticeable field, (Automatic Speech Recognition ASR) plays an important role in the system in automatic language identification for it.The human language perception is a very natural multi-modal process.Hearing impaired crowd can make full use of the prompting of lip reading, even the normal person also can utilize visual information to strengthen understanding to language, particularly in noisy environment.Utilize the information of visual channel can improve the performance and the robustness of modern automatic speech verification system effectively.
The lip reading identification mission generally comprises three key steps: 1. detect face and lip-region in the pronunciation image sequence; 2. extract the feature that is fit to classification from lip-region; 3. use the lip-region feature to carry out lip reading identification.
Influences such as at the 1. step, the main algorithm of Flame Image Process that uses is located face and lip-region in the existing method, and these class methods are subjected to illumination, angle easily, rotate, block can produce certain error.
The lid speech characteristic of 2. mentioning in the step is divided into three major types: the feature based on texture of (1) low layer in existing document; (2) Gao Ceng feature based on profile; (3) the former two's combination.In these features, be considered to the most available visual information based on lip geometric properties in the feature of profile (as height, width, the angle of lip) and lip movement feature.All used deformable template (deformable model) about a large amount of recent work that the lip outline line is cut apart, wherein a kind of effective ways just are to use Snake model and improved Snake model, as gradient vector flow (Gradient Vector Flow, GVF) Snake model, virtual electrostatic field (Virtual Electric Field, VEF) Snake model, the virtual electrostatic field of convolution (Convolutional Virtual Electric Field, Convolutional VEF) Snake model.Comparatively speaking, the virtual electrostatic field Snake of convolution model is by using virtual electrostatic field (virtual electric field, VEF) as external force (external force), and use convolution (convolution) mechanism, this model is the retention Hp contouring more rapidly and accurately.
The 3. the step use the lip-region feature to carry out in the lip reading identification, widely used sorting technique is Hidden Markov Model (HMM) (hidden markov model (HMM)).Hidden Markov Model (HMM) in speech recognition of great use because it can be naturally the time domain specification of language is carried out modeling.But consider the essential attribute of language, the sectional type static state and the dependent hypothesis (the piece-wise stationary and independence assumptions) of Hidden Markov Model (HMM) are two limitations of this model.
An important prior art that uses among the present invention is: based on the lip track algorithm of the virtual electrostatic field Snake of convolution model.
People such as Lv Kun disclose the detailed design based on the lip track algorithm of the virtual electrostatic field Snake of convolution model in document " based on the lip track algorithm of the virtual electrostatic field Snake of convolution model " (the 6th harmonious man-machine environment associating academic conference, 2010).
The other important prior art that the present invention uses is: canonical correlation discriminatory analysis (Discriminative Analysis of Canonical Correlation, DCC) method.
People such as T.-K.Kim are at document " Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations " (IEEE Transactions On Pattern Analysis And Machine Intelligence, the operation steps of canonical correlation discriminant analysis method is disclosed Vol.29, No.6 (2007)).In this method by introducing the similarity that a transformation matrix T maximizes homogeneous data collection (within-class sets) (with canonical correlation coefficient and represent), minimize the similarity of heterogeneous data collection (between-class sets), to reach the better recognition effect.
The canonical correlation discriminant analysis method successfully has been applied to fields such as image set coupling, people's face or object identification in recent years, and therefore the canonical correlation discriminant analysis method being used to solve the lip reading identification problem is a simple but effective method theoretically.But, so far, also do not find the pertinent literature and the practical application that the canonical correlation discriminant analysis method are used for automatic lip reading identification identification.
Summary of the invention
The objective of the invention is to have proposed a kind of automatic lip reading recognition system that is applicable to Chinese in order to overcome the deficiency that prior art exists.
The objective of the invention is to be achieved through the following technical solutions.
A kind of automatic lip reading recognition system that is applicable to Chinese comprises: head mounted image-sensing head, human-computer interaction module, lip locations of contours module, geometric vector acquisition module, motion vector acquisition module, eigenmatrix constructing module, transformation matrix T acquisition module, converting characteristic matrix acquisition module, storer A, storer B, canonical correlation discriminatory analysis module.
Its annexation is: the output terminal of head mounted image-sensing head is connected with the input end of human-computer interaction module; The output terminal of human-computer interaction module is connected with the input end of lip locations of contours module; The output terminal of lip locations of contours module is connected with the input end of geometric vector acquisition module; The output terminal of geometric vector acquisition module is connected with the input end of motion vector acquisition module with the eigenmatrix constructing module; The output terminal of motion vector acquisition module is connected with the input end of eigenmatrix constructing module; The output terminal of eigenmatrix constructing module is connected with the input end of transformation matrix T acquisition module with converting characteristic matrix acquisition module; Transformation matrix T acquisition module is connected with storer A; Converting characteristic matrix acquisition module is connected with storer B with storer A; Storer A also is connected with the input end of canonical correlation discriminatory analysis module with storer B; The output terminal of canonical correlation discriminatory analysis module is connected with the input end of human-computer interaction module.
The major function of each module and equipment is:
The major function of head mounted image-sensing head is: obtain the Chinese character pronunciation image sequence that the subject sends.
The major function of human-computer interaction module is: 1. a closed contour curve is provided, and the position for the subject adjusts the head mounted image-sensing head makes the subject's that the head mounted image-sensing head obtains lip-region be comprised in this closed contour curve.2. obtain the Chinese character pronunciation image sequence that the head mounted image-sensing head is taken; 3. the result to canonical correlation discriminatory analysis module exports.
Lip locations of contours module functions is: the lip track algorithm that people such as use Lv Kun propose in document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " positions the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively, obtain the lip contour curve, and export it to geometric vector acquisition module.
The major function of geometric vector acquisition module is: obtain lip geometric properties vector in the lip contour curve of the every two field picture from the Chinese character pronunciation image sequence of lip locations of contours module output; And in order to remedy lip difference and the image zoom proportional difference between the different subjects, lip geometric properties vector is done the normalization operation, obtain the lip geometric properties vector after normalization is operated, and export motion vector acquisition module and eigenmatrix constructing module to.
The major function of motion vector acquisition module is: based on the lip geometric properties vector through the normalization operation, construct the lip movement proper vector of every two field picture, export the lip movement proper vector to the eigenmatrix constructing module then.
The major function of eigenmatrix constructing module is: the eigenmatrix of structure Chinese character pronunciation image sequence, the eigenmatrix with the Chinese character pronunciation image sequence exports transformation matrix T acquisition module and converting characteristic matrix acquisition module to then.
The major function of transformation matrix T acquisition module is: at the eigenmatrix of the Chinese character pronunciation image sequence of training data, people such as employing T.-K.Kim are at document " Discriminative Learning and Recognition of Image Set Classes Using Canonicai Correlations " (IEEE Transactions On Pattern Analysis And Machine Intelligence, Vo1.29, No.6 (2007)) the canonical correlation discriminant analysis method that proposes in is handled, obtain transformation matrix T, and be stored to storer A.
The major function of converting characteristic matrix acquisition module is: use transformation matrix T successively the eigenmatrix of the Chinese character pronunciation image sequence of training data to be changed, obtain the converting characteristic matrix, and with the converting characteristic matrix stores of the Chinese character pronunciation image sequence of training data to storer A.
Storer A: the converting characteristic matrix of the Chinese character pronunciation image sequence of memory mapping matrix T and training data.
Storer B: the converting characteristic matrix of the Chinese character pronunciation image sequence of store test data.
Canonical correlation discriminatory analysis module: from storer B, obtain the converting characteristic matrix of current test data and each training data among the storer A the converting characteristic matrix canonical correlation coefficient and, then further to these canonical correlation coefficients and handling, obtain the recognition result of current test data, and this recognition result is outputed to human-computer interaction module.
The course of work of described automatic lip reading recognition system is divided into systematic training process and system testing process:
The workflow of systematic training process is:
Step 1.1: choose m Chinese character as training data, m 〉=5 and m are positive integer;
Step 1.2: human-computer interaction module shows a closed contour curve.
Step 1.3: tested people is fixed on head with the head mounted image-sensing head; Tested people adjusts the position of head mounted image-sensing head, makes its latter half of directly taking tested face, and photographic images is sent to human-computer interaction module and shows; Tested people adjusts the position of head mounted image-sensing head once more, makes subject's lip-region be comprised in the closed contour curve described in the step 1.2.
Step 1.4: the subject pronounces to the Chinese character of the m described in the step 1.1 with the word speed of 1 Chinese character p.s., and the shooting speed of head mounted image-sensing head is a per second n frame simultaneously, and n 〉=25 and n are positive integer; Therefore the video flowing of each Chinese character pronunciation is made up of the n frame image sequence; The n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; The head mounted image-sensing head is sent to human-computer interaction module with the Chinese character pronunciation image sequence of taking.
Step 1.5: the Chinese character pronunciation image sequence that human-computer interaction module is taken the head mounted image-sensing head described in curve of closed contour described in the step 1.2 and the step 1.4 is sent to lip locations of contours module.
Step 1.6: people such as lip locations of contours module use Lv Kun position the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively at the lip track algorithm that document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " proposes, obtain the lip contour curve, and export it to geometric vector acquisition module.When wherein the lip profile of first image in each Chinese character pronunciation image sequence being positioned, the closed contour curve that the initial curve of the virtual electrostatic field Snake of convolution model adopts human-computer interaction module to provide; When the lip profile of other image in this Chinese character pronunciation image sequence was positioned, the initial curve of the virtual electrostatic field Snake of convolution model adopted the lip positioning result curve of the previous image of this image.
Step 1.7: the geometric vector acquisition module obtains lip geometric properties vector successively in the lip contour curve of every two field picture from the Chinese character pronunciation image sequence, use g iExpression, i represents the serial number of each two field picture in the Chinese character pronunciation image sequence, 1≤i≤n and i are positive integer; And in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector after normalization is operated, use g i' expression; Export the lip geometric properties vector after the normalization operation to motion vector acquisition module and eigenmatrix constructing module then.The concrete operations step of obtaining through the lip geometric properties vector after the normalization operation is:
Step 1.7.1: calculate the extreme value of lip contour curve horizontal direction, obtain the point coordinate of the left and right sides corners of the mouth.
Step 1.7.2: 2 of the left and right sides corners of the mouths are linked to each other with straight line, are the center of circle with the mid point of left and right sides corners of the mouth point, claim this center of circle for a some O, rotate this straight line in the direction of the clock 5 times, rotate 30 at every turn and spend; Every rotation once will obtain two line segments of straight line and lip curve intersection, obtains 12 line segments altogether, begins to use L respectively by clockwise order from the left corners of the mouth 1~L 12The length of representing these 12 line segments claims the length L of these 12 line segments 1~L 12Be the radiation vector; The straight line of left and right sides corners of the mouth point-to-point transmission is revolved when turning 90 degrees, become an A respectively and put B with the last intersection point and the following intersection point of lip curve intersection.
Step 1.7.3: optionally from 2 of the left and right sides corners of the mouths a bit claim this point to be a some Q, will put Q respectively with an A with put B and link to each other with straight line; ∠ AQO θ 1Expression, ∠ BQO θ 2Expression can be according to L 1~L 12, obtain θ 1And θ 2Angle, and then obtain θ 1And θ 2Cosine value;
Step 1.7.4:L 1~L 12And θ 1And θ 2Cosine value just constituted lip geometric properties vector in the two field picture; Because L 1And L 7Be to connect half of left and right sides corners of the mouth line segment length, thus their value equate, therefore in lip geometric properties vector, remove L 7, i.e. lip geometric properties in two field picture vector g i=[L 1..., L 6, L 8... L 12, cos θ 1, cos θ 2] t
Step 1.7.5: in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector after normalization is operated, use g i' expression; g i' be one 13 dimension transversal vector, g i'=[L 1' ..., L 6', L 8' ... L 12', cos θ 1, cos θ 2]; Wherein,
Figure BSA00000359669100061
J=1,2 ... 6,8 ..., 12,
Figure BSA00000359669100062
It is distance between the corners of the mouth of the left and right sides in first two field picture of a Chinese character pronunciation image sequence.
Step 1.8: the motion vector acquisition module is constructed the lip movement proper vector of every two field picture and (is used p based on the lip geometric properties vector through the normalization operation iExpression), p iBe one 13 dimension transversal vector, p i=(g i'-g I-1')/Δ t, wherein, g 0'=g 1', Δ t is the time interval of two successive frames; Then with lip movement proper vector p iExport the eigenmatrix constructing module to;
Step 1.9: the eigenmatrix of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure training data (is used Z fExpression, wherein f represents the serial number of the Chinese character pronunciation image sequence of training data, 1≤f≤m and f are positive integer), then with the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fExport transformation matrix T acquisition module and converting characteristic matrix acquisition module respectively to.The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 1.9.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms the associating proper vector and (use v iExpression), v iBe one 26 dimensional vector,
Figure BSA00000359669100063
Step 1.9.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of training data f={ v 1, v 2..., v n∈ R 26 * n
Step 1.10: transformation matrix T acquisition module is at the eigenmatrix Z of the Chinese character pronunciation image sequence of m training data f, the canonical correlation discriminant analysis method that people such as employing T.-K.Kim propose is handled, and obtains transformation matrix T ∈ R 26 * r, r<26, and r is positive integer, R represents real number, and stores transformation matrix T into storer A.
Step 1.11: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses transformation matrix T successively to the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fChange, obtain converting characteristic matrix Z f'=T TZ f, and the converting characteristic matrix Z of the Chinese character pronunciation image sequence of training data f' store storer A into.
Through the operation of above-mentioned steps, can finish training to described automatic lip reading recognition system.
The workflow of system testing process is:
Step 2.1: choose the individual Chinese character of m ' as test data from m training data, m '≤m and m ' are positive integer.
Step 2.2: human-computer interaction module shows a closed contour curve.
Step 2.3: tested people is fixed on head with the head mounted image-sensing head; Tested people adjusts the position of head mounted image-sensing head, makes its latter half of directly taking tested face, and photographic images is sent to human-computer interaction module and shows; Tested people adjusts the position of head mounted image-sensing head once more, makes subject's lip-region be comprised in the closed contour curve described in the step 2.2.
Step 2.4: the subject pronounces to the individual Chinese character of the m ' described in the step 2.1 with the word speed of 1 Chinese character p.s., and the shooting speed of head mounted image-sensing head is a per second n frame simultaneously; Therefore the video flowing of each Chinese character pronunciation is made up of the n frame image sequence; The n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; The head mounted image-sensing head is sent to human-computer interaction module with the Chinese character pronunciation image sequence of taking.
Step 2.5: human-computer interaction module is sent to lip locations of contours module with the Chinese character pronunciation image sequence described in curve of closed contour described in the step 2.2 and the step 2.4.
Step 2.6: identical with the operation of step 1.6 in the systematic training process.
Step 2.7: identical with the operation of step 1.7 in the systematic training process.
Step 2.8: identical with the operation of step 1.8 in the systematic training process.
Step 2.9: the eigenmatrix of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure test data (is used Z eExpression, wherein e represents the serial number of the Chinese character pronunciation image sequence of test data, 1≤e≤m ' and e are positive integer), then with the eigenmatrix Z of the Chinese character pronunciation image sequence of test data eExport converting characteristic matrix acquisition module to.The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 2.9.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms associating proper vector v i, v iBe one 26 dimensional vector,
Figure BSA00000359669100081
Step 2.9.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of test data e={ v 1, v 2..., v n∈ R 26 * n
Step 2.10: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses the eigenmatrix Z of transformation matrix T to the Chinese character pronunciation image sequence of test data eChange, obtain converting characteristic matrix Z e'=T TZ e, and with the converting characteristic matrix Z of the Chinese character pronunciation image sequence of test data e' store storer B into.
Step 2.11: canonical correlation discriminatory analysis module reads the converting characteristic matrix Z of whole training datas from storer A f', from storer B, read the converting characteristic matrix Z of the Chinese character pronunciation image sequence of current test data e', adopt people such as T.-K.Kim at document " Discriminative Learning and Recognition of Image Set Classes Using Canonicai Correlations " (IEEE Transactions On Pattern Analysis And Machine Intelligence then, Vol.29, No.6 (2007)) the canonical correlation discriminant analysis method that proposes in calculates the converting characteristic matrix Z of this test data e' with the converting characteristic matrix Z of each training data f' canonical correlation coefficient and; Owing to may there be the Chinese character of repetition in the training data, therefore the canonical correlation coefficient of same Chinese character correspondence and having more than 1 or 1, so further calculate each the Chinese character correspondence in the training data canonical correlation coefficient and mean value, and from these mean values, take out maximal value, human-computer interaction module is arrived in the corresponding Chinese character output in training data of this maximal value.
Step 2.12: human-computer interaction module shows the Chinese character that canonical correlation discriminatory analysis module transmits.
Through above-mentioned steps, can finish automatic identification to test data.
Beneficial effect
Compare with the automatic lip reading recognition system of existing Chinese, the present invention has the following advantages:
1. use the head mounted image-sensing head directly to obtain the lip image sequence among the present invention, tested use interactive means was adjusted the position of head mounted image-sensing head when each experiment began, the relative position of camera and people's face is fixed in the experimentation, testedly Chinese character pronunciation can be carried out naturally, head pose and position need not be painstakingly gone to keep.Compare forefathers' method, this paper can obtain the lip image sequence very exactly, and early stage, operand significantly reduced, and can reduce tested constraint, made experimentation more natural.
2. the present invention uses the virtual electrostatic field Snake of convolution model orientation lip profile, more quick and precisely.
3. the lid speech characteristic of the present invention's extraction combines lip geometric properties and lip movement feature, makes analysis more accurate.
4. the present invention arrives the lip reading automatic identification field with the successful application of canonical correlation discriminant analysis method first, has overcome the limitation of Hidden Markov Model (HMM) in speech recognition.
Description of drawings
Fig. 1 is the structural representation of the automatic lip reading recognition system that is applicable to Chinese in the specific embodiment of the invention.
Embodiment
The present invention is described in detail below in conjunction with the drawings and specific embodiments.
A kind of automatic lip reading recognition system that is applicable to Chinese, its system architecture comprises as shown in Figure 1: head mounted image-sensing head, human-computer interaction module, lip locations of contours module, geometric vector acquisition module, motion vector acquisition module, eigenmatrix constructing module, transformation matrix T acquisition module, converting characteristic matrix acquisition module, storer A, storer B, canonical correlation discriminatory analysis module.
Its annexation is: the output terminal of head mounted image-sensing head is connected with the input end of human-computer interaction module; The output terminal of human-computer interaction module is connected with the input end of lip locations of contours module; The output terminal of lip locations of contours module is connected with the input end of geometric vector acquisition module; The output terminal of geometric vector acquisition module is connected with the input end of motion vector acquisition module with the eigenmatrix constructing module; The output terminal of motion vector acquisition module is connected with the input end of eigenmatrix constructing module; The output terminal of eigenmatrix constructing module is connected with the input end of transformation matrix T acquisition module with converting characteristic matrix acquisition module; Transformation matrix T acquisition module is connected with storer A; Converting characteristic matrix acquisition module is connected with storer B with storer A; Storer A also is connected with the input end of canonical correlation discriminatory analysis module with storer B; The output terminal of canonical correlation discriminatory analysis module is connected with the input end of human-computer interaction module.
The major function of each module and equipment is:
The major function of head mounted image-sensing head is: obtain the Chinese character pronunciation image sequence that the subject sends.
The major function of human-computer interaction module is: 1. a closed contour curve is provided, and the position for the subject adjusts the head mounted image-sensing head makes the subject's that the head mounted image-sensing head obtains lip-region be comprised in this closed contour curve.2. obtain the Chinese character pronunciation image sequence that the head mounted image-sensing head is taken; 3. the result to canonical correlation discriminatory analysis module exports.
Lip locations of contours module functions is: people such as use Lv Kun position the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively at the lip track algorithm that document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " proposes, obtain the lip contour curve, and export it to geometric vector acquisition module.
The major function of geometric vector acquisition module is: obtain lip geometric properties vector in the lip contour curve of the every two field picture from the Chinese character pronunciation image sequence of lip locations of contours module output; And in order to remedy lip difference and the image zoom proportional difference between the different subjects, lip geometric properties vector is done the normalization operation, obtain the lip geometric properties vector after normalization is operated, and export motion vector acquisition module and eigenmatrix constructing module to.
The major function of motion vector acquisition module is: based on the lip geometric properties vector through the normalization operation, construct the lip movement proper vector of every two field picture, export the lip movement proper vector to the eigenmatrix constructing module then.
The major function of eigenmatrix constructing module is: the eigenmatrix of structure Chinese character pronunciation image sequence, the eigenmatrix with the Chinese character pronunciation image sequence exports transformation matrix T acquisition module and converting characteristic matrix acquisition module to then.
The major function of transformation matrix T acquisition module is: at the eigenmatrix of the Chinese character pronunciation image sequence of training data, people such as employing T.-K.Kim are at document " Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations " (IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol.29, No.6 (2007)) the canonical correlation discriminant analysis method that proposes in is handled, obtain transformation matrix T, and be stored to storer A.
The major function of converting characteristic matrix acquisition module is: use transformation matrix T successively the eigenmatrix of the Chinese character pronunciation image sequence of training data to be changed, obtain the converting characteristic matrix, and with the converting characteristic matrix stores of the Chinese character pronunciation image sequence of training data to storer A.
Storer A: the converting characteristic matrix of the Chinese character pronunciation image sequence of memory mapping matrix T and training data.
Storer B: the converting characteristic matrix of the Chinese character pronunciation image sequence of store test data.
Canonical correlation discriminatory analysis module: from storer B, obtain the converting characteristic matrix of current test data and each training data among the storer A the converting characteristic matrix canonical correlation coefficient and, then further to these canonical correlation coefficients and handling, obtain the recognition result of current test data, and this recognition result is outputed to human-computer interaction module.
The use said system experimentizes, select 10 subjects (4 male sex and 6 women) in the test, allow then they everyone to " zero, one, two, three, four, five, I, like, north, capital " 10 Chinese character pronunciations 20 times, each Chinese character obtains 200 Chinese character pronunciation image sequences; Then, for each Chinese character, picked at random 80% (160) is as training data from 200 Chinese character pronunciation image sequences of its correspondence, and the Chinese character pronunciation image sequence of remaining 20% (40) is as test data; Therefore training data has 1600, and test data has 400.
The step that 2000 Chinese character pronunciation image sequences obtain is as follows:
Step 1: human-computer interaction module shows a closed contour curve.
The tested people of step 2:10 name is fixed on head with the head mounted image-sensing head successively; Tested people adjusts the position of head mounted image-sensing head, makes its latter half of directly taking tested face, and photographic images is sent to human-computer interaction module and shows; Tested people adjusts the position of head mounted image-sensing head once more, makes subject's lip-region be comprised in the closed contour curve described in the step 1.
Step 3: the subject with p.s. 1 Chinese character word speed to " zero, one, two, three, four, five, I, like, north, capital " 10 Chinese characters pronounce, each Chinese character pronunciation 20 times, the shooting speed of head mounted image-sensing head is per second 30 frames simultaneously, so the video flowing of each Chinese character pronunciation is made up of 30 frame image sequence; 30 frame image sequence of a Chinese character are called a Chinese character pronunciation image sequence.
Through the operation of above-mentioned steps, can obtain 2000 Chinese character pronunciation image sequences of 10 Chinese characters.
Then, the experimenter uses 1600 Chinese character pronunciation image sequences of picked at random as training data system to be trained, and process is as follows:
Step 1: closed contour curve and 1600 Chinese character pronunciation image sequences of occurring in the human-computer interaction module are sent to lip locations of contours module.
Step 2: the lip track algorithm that people such as lip locations of contours module use Lv Kun propose in document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " positions the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively, obtain the lip contour curve, and export it to geometric vector acquisition module.When wherein the lip profile of first image in each Chinese character pronunciation image sequence being positioned, the closed contour curve that the initial curve of the virtual electrostatic field Snake of convolution model adopts human-computer interaction module to provide; When the lip profile of other image in this Chinese character pronunciation image sequence was positioned, the initial curve of the virtual electrostatic field Snake of convolution model adopted the lip positioning result curve of the previous image of this image.
Step 3: the geometric vector acquisition module obtains lip geometric properties vector g successively in the lip contour curve of every two field picture from the Chinese character pronunciation image sequence 1~g 30Expression; And in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g 1~g 30Do the normalization operation, obtain the lip geometric properties vector g after normalization is operated 1'~g 30'; Export the lip geometric properties vector after the normalization operation to motion vector acquisition module and eigenmatrix constructing module then.The concrete operations step of obtaining through the lip geometric properties vector after the normalization operation is:
Step 3.1: calculate the extreme value of lip contour curve horizontal direction, obtain the point coordinate of the left and right sides corners of the mouth.
Step 3.2: 2 of the left and right sides corners of the mouths are linked to each other with straight line, are the center of circle with the mid point of left and right sides corners of the mouth point, claim this center of circle for a some O, rotate this straight line in the direction of the clock 5 times, rotate 30 at every turn and spend; Every rotation once will obtain two line segments of straight line and lip curve intersection, obtains 12 line segments altogether, begins to use L respectively by clockwise order from the left corners of the mouth 1~L 12The length of representing these 12 line segments claims the length L of these 12 line segments 1~L 12Be the radiation vector; The straight line of left and right sides corners of the mouth point-to-point transmission is revolved when turning 90 degrees, become an A respectively and put B with the last intersection point and the following intersection point of lip curve intersection.
Step 3.3: the left corners of the mouth is called a Q, will puts Q and link to each other with straight line with some B with an A respectively; ∠ AQO θ 1Expression, ∠ BQO θ 2Expression can be according to L 1~L 12, obtain θ 1And θ 2Angle, and then obtain θ 1And θ 2Cosine value;
Step 3.4:L 1~L 12And θ 1And θ 2Cosine value just constituted lip geometric properties vector in the two field picture; Because L 1And L 7Be to connect half of left and right sides corners of the mouth line segment length, thus their value equate, therefore in lip geometric properties vector, remove L 7, i.e. lip geometric properties in two field picture vector g i=[L 1..., L 6, L 8... L 12, cos θ 1, cos θ 2] t, i=1,2 ..., 30;
Step 3.5: in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector g after normalization is operated i'; g i' be one 13 dimension transversal vector, g i'=[L 1' ..., L 6', L 8' ... L 12', cos θ 1, cos θ 2]; Wherein,
Figure BSA00000359669100131
J=1,2 ... 6,8 ..., 12,
Figure BSA00000359669100132
It is distance between the corners of the mouth of the left and right sides in first two field picture of a Chinese character pronunciation image sequence.
Step 4: the motion vector acquisition module is constructed the lip movement proper vector p of every two field picture based on the lip geometric properties vector through the normalization operation i, p iBe one 13 dimension transversal vector, p i=(g i'-g I-1')/Δ t, wherein, g 0'=g 1', Δ t is the time interval of two successive frames; Then with lip movement proper vector p iExport the eigenmatrix constructing module to;
Step 5: the eigenmatrix Z of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure training data f, f=1,2 ..., 1600, then with the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fExport transformation matrix T acquisition module and converting characteristic matrix acquisition module respectively to.The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 5.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms associating proper vector v i, v iBe one 26 dimensional vector,
Figure BSA00000359669100133
Step 5.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of training data f={ v 1, v 2..., v n∈ R 26 * 30
Step 1.6: transformation matrix T acquisition module is at the eigenmatrix Z of the Chinese character pronunciation image sequence of 1600 training datas f, the canonical correlation discriminant analysis method that people such as employing T.-K.Kim propose is handled, and obtains transformation matrix T, and stores transformation matrix T into storer A.
Step 1.7: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses transformation matrix T successively to the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fChange, obtain converting characteristic matrix Z f'=T TZ f, and the converting characteristic matrix Z of the Chinese character pronunciation image sequence of training data f' store storer A into.
Through the operation of above-mentioned steps, can finish training to described automatic lip reading recognition system.
After automatic lip reading recognition system trained, the experimenter used 400 test datas that this system is tested, and process is as follows:
Step 1: closed contour curve and 400 Chinese character pronunciation image sequences of occurring in the human-computer interaction module are sent to lip locations of contours module.
Step 2: the lip track algorithm that people such as lip locations of contours module use Lv Kun propose in document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " positions the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively, obtain the lip contour curve, and export it to geometric vector acquisition module.When wherein the lip profile of first image in each Chinese character pronunciation image sequence being positioned, the closed contour curve that the initial curve of the virtual electrostatic field Snake of convolution model adopts human-computer interaction module to provide; When the lip profile of other image in this Chinese character pronunciation image sequence was positioned, the initial curve of the virtual electrostatic field Snake of convolution model adopted the lip positioning result curve of the previous image of this image.
Step 3: the geometric vector acquisition module obtains lip geometric properties vector g successively in the lip contour curve of every two field picture from the Chinese character pronunciation image sequence 1~g 30Expression; And in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g 1~g 30Do the normalization operation, obtain the lip geometric properties vector g after normalization is operated 1'~g 30'; Export the lip geometric properties vector after the normalization operation to motion vector acquisition module and eigenmatrix constructing module then.The concrete operations step of obtaining through the lip geometric properties vector after the normalization operation is:
Step 3.1: calculate the extreme value of lip contour curve horizontal direction, obtain the point coordinate of the left and right sides corners of the mouth.
Step 3.2: 2 of the left and right sides corners of the mouths are linked to each other with straight line, are the center of circle with the mid point of left and right sides corners of the mouth point, claim this center of circle for a some O, rotate this straight line in the direction of the clock 5 times, rotate 30 at every turn and spend; Every rotation once will obtain two line segments of straight line and lip curve intersection, obtains 12 line segments altogether, begins to use L respectively by clockwise order from the left corners of the mouth 1~L 12The length of representing these 12 line segments claims the length L of these 12 line segments 1~L 12Be the radiation vector; The straight line of left and right sides corners of the mouth point-to-point transmission is revolved when turning 90 degrees, become an A respectively and put B with the last intersection point and the following intersection point of lip curve intersection.
Step 3.3: the left corners of the mouth is called a Q, will puts Q and link to each other with straight line with some B with an A respectively; ∠ AQ0 θ 1Expression, ∠ BQO θ 2Expression can be according to L 1~L 12, obtain θ 1And θ 2Angle, and then obtain θ 1And θ 2Cosine value;
Step 3.4:L 1~L 12And θ 1And θ 2Cosine value just constituted lip geometric properties vector in the two field picture; Because L 1And L 7Be to connect half of left and right sides corners of the mouth line segment length, thus their value equate, therefore in lip geometric properties vector, remove L 7, i.e. lip geometric properties in two field picture vector g i=[L 1..., L 6, L 8... L 12, cos θ 1, cos θ 2] t, i=1,2 ..., 30;
Step 3.5: in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector g after normalization is operated i'; g i' be one 13 dimension transversal vector, g i'=[L 1' ..., L 6', L 8' ... L 12', cos θ 1, cos θ 2]; Wherein,
Figure BSA00000359669100151
J=1,2 ... 6,8 ..., 12,
Figure BSA00000359669100152
It is distance between the corners of the mouth of the left and right sides in first two field picture of a Chinese character pronunciation image sequence.
Step 4: the motion vector acquisition module is constructed the lip movement proper vector p of every two field picture based on the lip geometric properties vector through the normalization operation i, p iBe one 13 dimension transversal vector, p i=(g i'-g I-1')/Δ t, wherein, g 0'=g 1', Δ t is the time interval of two successive frames; Then with lip movement proper vector p iExport the eigenmatrix constructing module to;
Step 4: the eigenmatrix Z of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure test data e, e=1,2 ..., 400, then with the eigenmatrix Z of the Chinese character pronunciation image sequence of test data eExport converting characteristic matrix acquisition module to.The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 4.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms associating proper vector v i, v iBe one 26 dimensional vector,
Figure BSA00000359669100153
Step 4.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of test data e={ v 1, v 2..., v n∈ R 26 * 30
Step 5: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses the eigenmatrix Z of transformation matrix T to the Chinese character pronunciation image sequence of test data eChange, obtain converting characteristic matrix Z e'=T TZ e, and with the converting characteristic matrix Z of the Chinese character pronunciation image sequence of test data e' store storer B into.
Step 6: canonical correlation discriminatory analysis module reads the converting characteristic matrix Z of whole training datas from storer A f', from storer B, read the converting characteristic matrix Z of the Chinese character pronunciation image sequence of current test data e', the canonical correlation discriminant analysis method that adopts people such as T.-K.Kim to propose then calculates the converting characteristic matrix Z of this test data e' with the converting characteristic matrix Z of each training data f' canonical correlation coefficient and; Owing to may there be the Chinese character of repetition in the training data, therefore the canonical correlation coefficient of same Chinese character correspondence and having more than 1 or 1, so further calculate each the Chinese character correspondence in the training data canonical correlation coefficient and mean value, and from these mean values, take out maximal value, human-computer interaction module is arrived in the corresponding Chinese character output in training data of this maximal value.
Step 7: human-computer interaction module shows the Chinese character that canonical correlation discriminatory analysis module transmits.
Through above-mentioned steps, can finish automatic identification to test data, the recognition accuracy of this system is shown in the row of the 2nd in the table 1; For effect of the present invention is described, also carried out 2 experiments simultaneously:
1. under the situation of identical experimental situation, training data, test data, with the virtual electrostatic field Snake of the convolution of using among the present invention model change traditional Snake model into, other function is constant, the recognition accuracy that obtains is shown in the row of the 3rd in the table 1;
2. under the situation of identical experimental situation, training data, test data, change the canonical correlation analysis method of using among the present invention into continuous Hidden Markov Model (HMM) (Continuous Hidden Markov Model, CHMM), other function is constant, the recognition accuracy that obtains is shown in the row of the 4th in the table 1.
The recognition accuracy comparative result (%) of table 1 distinct methods
(1) (2) (3)
" zero " 90.0 73.5 88.5
" one " 92.0 75.0 90.5
" two " 86.5 76.0 83.0
" three " 93.0 81.5 92.5
" four " 95.0 83.0 95.5
" five " 89.5 73.0 91.0
" I " 96.0 82.0 95.0
" love " 97.0 82.5 95.5
" north " 93.5 81.5 94.0
" capital " 90.0 75.5 88.0
Experiment shows that the system that the present invention proposes has higher recognition accuracy.
The above only is a preferred implementation of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention; can also make some improvement; perhaps part technical characterictic wherein is equal to replacement, these improvement and replace and also should be considered as protection scope of the present invention.

Claims (1)

1. an automatic lip reading recognition system that is applicable to Chinese comprises: head mounted image-sensing head, human-computer interaction module, lip locations of contours module, geometric vector acquisition module, motion vector acquisition module, eigenmatrix constructing module, transformation matrix T acquisition module, converting characteristic matrix acquisition module, storer A, storer B, canonical correlation discriminatory analysis module;
Its annexation is: the output terminal of head mounted image-sensing head is connected with the input end of human-computer interaction module; The output terminal of human-computer interaction module is connected with the input end of lip locations of contours module; The output terminal of lip locations of contours module is connected with the input end of geometric vector acquisition module; The output terminal of geometric vector acquisition module is connected with the input end of motion vector acquisition module with the eigenmatrix constructing module; The output terminal of motion vector acquisition module is connected with the input end of eigenmatrix constructing module; The output terminal of eigenmatrix constructing module is connected with the input end of transformation matrix T acquisition module with converting characteristic matrix acquisition module; Transformation matrix T acquisition module is connected with storer A; Converting characteristic matrix acquisition module is connected with storer B with storer A; Storer A also is connected with the input end of canonical correlation discriminatory analysis module with storer B; The output terminal of canonical correlation discriminatory analysis module is connected with the input end of human-computer interaction module;
The major function of each module and equipment is:
The major function of head mounted image-sensing head is: obtain the Chinese character pronunciation image sequence that the subject sends;
The major function of human-computer interaction module is: 1. a closed contour curve is provided, and the position for the subject adjusts the head mounted image-sensing head makes the subject's that the head mounted image-sensing head obtains lip-region be comprised in this closed contour curve; 2. obtain the Chinese character pronunciation image sequence that the head mounted image-sensing head is taken; 3. the result to canonical correlation discriminatory analysis module exports;
Lip locations of contours module functions is: the lip track algorithm that people such as use Lv Kun propose in document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " positions the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively, obtain the lip contour curve, and export it to geometric vector acquisition module;
The major function of geometric vector acquisition module is: obtain lip geometric properties vector in the lip contour curve of the every two field picture from the Chinese character pronunciation image sequence of lip locations of contours module output; And in order to remedy lip difference and the image zoom proportional difference between the different subjects, lip geometric properties vector is done the normalization operation, obtain the lip geometric properties vector after normalization is operated, and export motion vector acquisition module and eigenmatrix constructing module to;
The major function of motion vector acquisition module is: based on the lip geometric properties vector through the normalization operation, construct the lip movement proper vector of every two field picture, export the lip movement proper vector to the eigenmatrix constructing module then;
The major function of eigenmatrix constructing module is: the eigenmatrix of structure Chinese character pronunciation image sequence, and the eigenmatrix with the Chinese character pronunciation image sequence exports transformation matrix T acquisition module and converting characteristic matrix acquisition module to then;
The major function of transformation matrix T acquisition module is: at the eigenmatrix of the Chinese character pronunciation image sequence of training data, the canonical correlation discriminant analysis method that people such as employing T.-K.Kim propose in document " Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations " is handled, obtain transformation matrix T, and be stored to storer A;
The major function of converting characteristic matrix acquisition module is: use transformation matrix T successively the eigenmatrix of the Chinese character pronunciation image sequence of training data to be changed, obtain the converting characteristic matrix, and with the converting characteristic matrix stores of the Chinese character pronunciation image sequence of training data to storer A;
Storer A: the converting characteristic matrix of the Chinese character pronunciation image sequence of memory mapping matrix T and training data;
Storer B: the converting characteristic matrix of the Chinese character pronunciation image sequence of store test data;
Canonical correlation discriminatory analysis module: from storer B, obtain the converting characteristic matrix of current test data and each training data among the storer A the converting characteristic matrix canonical correlation coefficient and, then further to these canonical correlation coefficients and handling, obtain the recognition result of current test data, and this recognition result is outputed to human-computer interaction module;
The course of work of described automatic lip reading recognition system is divided into systematic training process and system testing process:
The workflow of systematic training process is:
Step 1.1: choose m Chinese character as training data, m 〉=5 and m are positive integer;
Step 1.2: human-computer interaction module shows a closed contour curve;
Step 1.3: tested people is fixed on head with the head mounted image-sensing head; Tested people adjusts the position of head mounted image-sensing head, makes its latter half of directly taking tested face, and photographic images is sent to human-computer interaction module and shows; Tested people adjusts the position of head mounted image-sensing head once more, makes subject's lip-region be comprised in the closed contour curve described in the step 1.2;
Step 1.4: the subject pronounces to the Chinese character of the m described in the step 1.1 with the word speed of 1 Chinese character p.s., and the shooting speed of head mounted image-sensing head is a per second n frame simultaneously, and n 〉=25 and n are positive integer; Therefore the video flowing of each Chinese character pronunciation is made up of the n frame image sequence; The n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; The head mounted image-sensing head is sent to human-computer interaction module with the Chinese character pronunciation image sequence of taking;
Step 1.5: the Chinese character pronunciation image sequence that human-computer interaction module is taken the head mounted image-sensing head described in curve of closed contour described in the step 1.2 and the step 1.4 is sent to lip locations of contours module;
Step 1.6: people such as lip locations of contours module use Lv Kun position the lip profile on the every two field picture in the Chinese character pronunciation image sequence successively at the lip track algorithm that document " the lip track algorithm of virtual electrostatic field Snake model based on convolution " proposes, obtain the lip contour curve, and export it to geometric vector acquisition module; When wherein the lip profile of first image in each Chinese character pronunciation image sequence being positioned, the closed contour curve that the initial curve of the virtual electrostatic field Snake of convolution model adopts human-computer interaction module to provide; When the lip profile of other image in this Chinese character pronunciation image sequence was positioned, the initial curve of the virtual electrostatic field Snake of convolution model adopted the lip positioning result curve of the previous image of this image;
Step 1.7: the geometric vector acquisition module obtains lip geometric properties vector successively in the lip contour curve of every two field picture from the Chinese character pronunciation image sequence, use g iExpression, i represents the serial number of each two field picture in the Chinese character pronunciation image sequence, 1≤i≤n and i are positive integer; And in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector after normalization is operated, use g i' expression; Export the lip geometric properties vector after the normalization operation to motion vector acquisition module and eigenmatrix constructing module then; The concrete operations step of obtaining through the lip geometric properties vector after the normalization operation is:
Step 1.7.1: calculate the extreme value of lip contour curve horizontal direction, obtain the point coordinate of the left and right sides corners of the mouth;
Step 1.7.2: 2 of the left and right sides corners of the mouths are linked to each other with straight line, are the center of circle with the mid point of left and right sides corners of the mouth point, claim this center of circle for a some O, rotate this straight line in the direction of the clock 5 times, rotate 30 at every turn and spend; Every rotation once will obtain two line segments of straight line and lip curve intersection, obtains 12 line segments altogether, begins to use L respectively by clockwise order from the left corners of the mouth 1~L 12The length of representing these 12 line segments claims the length L of these 12 line segments 1~L 12Be the radiation vector; The straight line of left and right sides corners of the mouth point-to-point transmission is revolved when turning 90 degrees, become an A respectively and put B with the last intersection point and the following intersection point of lip curve intersection;
Step 1.7.3: optionally from 2 of the left and right sides corners of the mouths a bit claim this point to be a some Q, will put Q respectively with an A with put B and link to each other with straight line; ∠ AQO θ 1Expression, ∠ BQO θ 2Expression can be according to L 1~L 12, obtain θ 1And θ 2Angle, and then obtain θ 1And θ 2Cosine value;
Step 1.7.4:L 1~L 12And θ 1And θ 2Cosine value just constituted lip geometric properties vector in the two field picture; Because L 1And L 7Be to connect half of left and right sides corners of the mouth line segment length, thus their value equate, therefore in lip geometric properties vector, remove L 7, i.e. lip geometric properties in two field picture vector g i=[L 1..., L 6, L 8... L 12, cos θ 1, cos θ 2] t
Step 1.7.5: in order to remedy the lip difference and the image zoom proportional difference of tested of difference, to lip geometric properties vector g iDo the normalization operation, obtain the lip geometric properties vector after normalization is operated, use g i' expression; g i' be one 13 dimension transversal vector, g i'=[L 1' ..., L 6', L 8' ... L 12', cos θ 1, cos θ 2]; Wherein,
Figure FSA00000359669000041
J=1,2 ... 6,8 ..., 12, It is distance between the corners of the mouth of the left and right sides in first two field picture of a Chinese character pronunciation image sequence;
Step 1.8: the motion vector acquisition module is constructed the lip movement proper vector of every two field picture based on the lip geometric properties vector through the normalization operation, uses p iExpression, p iBe one 13 dimension transversal vector, p i=(g i'-g I-1')/Δ t, wherein, g 0'=g 1', Δ t is the time interval of two successive frames; Then with lip movement proper vector p iExport the eigenmatrix constructing module to;
Step 1.9: the eigenmatrix of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure training data, use Z fExpression, wherein f represents the serial number of the Chinese character pronunciation image sequence of training data, 1≤f≤m and f are positive integer; Then with the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fExport transformation matrix T acquisition module and converting characteristic matrix acquisition module respectively to; The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 1.9.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms the associating proper vector, use v iExpression, v iBe one 26 dimensional vector,
Figure FSA00000359669000043
Step 1.9.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of training data f={ v 1, v 2..., v n∈ R 26 * n
Step 1.10: transformation matrix T acquisition module is at the eigenmatrix Z of the Chinese character pronunciation image sequence of m training data f, the canonical correlation discriminant analysis method that people such as employing T.-K.Kim propose is handled, and obtains transformation matrix T ∈ R 26 * r, r<26, and r is positive integer, R represents real number, and stores transformation matrix T into storer A;
Step 1.11: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses transformation matrix T successively to the eigenmatrix Z of the Chinese character pronunciation image sequence of training data fChange, obtain converting characteristic matrix Z f'=T TZ f, and the converting characteristic matrix Z of the Chinese character pronunciation image sequence of training data f' store storer A into;
Through the operation of above-mentioned steps, can finish training to described automatic lip reading recognition system;
The workflow of system testing process is:
Step 2.1: choose the individual Chinese character of m ' as test data from m training data, m '≤m and m ' are positive integer;
Step 2.2: human-computer interaction module shows a closed contour curve;
Step 2.3: tested people is fixed on head with the head mounted image-sensing head; Tested people adjusts the position of head mounted image-sensing head, makes its latter half of directly taking tested face, and photographic images is sent to human-computer interaction module and shows; Tested people adjusts the position of head mounted image-sensing head once more, makes subject's lip-region be comprised in the closed contour curve described in the step 2.2;
Step 2.4: the subject pronounces to the individual Chinese character of the m ' described in the step 2.1 with the word speed of 1 Chinese character p.s., and the shooting speed of head mounted image-sensing head is a per second n frame simultaneously; Therefore the video flowing of each Chinese character pronunciation is made up of the n frame image sequence; The n frame image sequence of a Chinese character is called a Chinese character pronunciation image sequence; The head mounted image-sensing head is sent to human-computer interaction module with the Chinese character pronunciation image sequence of taking;
Step 2.5: human-computer interaction module is sent to lip locations of contours module with the Chinese character pronunciation image sequence described in curve of closed contour described in the step 2.2 and the step 2.4;
Step 2.6: identical with the operation of step 1.6 in the systematic training process;
Step 2.7: identical with the operation of step 1.7 in the systematic training process;
Step 2.8: identical with the operation of step 1.8 in the systematic training process;
Step 2.9: the eigenmatrix of the Chinese character pronunciation image sequence of eigenmatrix constructing module structure test data, use Z eExpression, wherein e represents the serial number of the Chinese character pronunciation image sequence of test data, 1≤e≤m ' and e are positive integer; Then with the eigenmatrix Z of the Chinese character pronunciation image sequence of test data eExport converting characteristic matrix acquisition module to; The concrete operations step of the eigenmatrix of structure Chinese character pronunciation image sequence is:
Step 2.9.1: successively the every two field picture in the Chinese character pronunciation image sequence is done following operation: lip geometric properties vector is connected with the lip movement proper vector, forms associating proper vector v i, v iBe one 26 dimensional vector,
Figure FSA00000359669000061
Step 2.9.2: the eigenmatrix of Chinese character pronunciation image sequence is by the associating proper vector v of the every two field picture in this Chinese character pronunciation image sequence iCombine, so the eigenmatrix Z of the Chinese character pronunciation image sequence of test data e={ v 1, v 2..., v n∈ R 26 * n
Step 2.10: converting characteristic matrix acquisition module reads transformation matrix T from storer A, and uses the eigenmatrix Z of transformation matrix T to the Chinese character pronunciation image sequence of test data eChange, obtain converting characteristic matrix Z e'=T TZ e, and with the converting characteristic matrix Z of the Chinese character pronunciation image sequence of test data e' store storer B into;
Step 2.11: canonical correlation discriminatory analysis module reads the converting characteristic matrix Z of whole training datas from storer A f', from storer B, read the converting characteristic matrix Z of the Chinese character pronunciation image sequence of current test data e', the canonical correlation discriminant analysis method that adopts people such as T.-K.Kim to propose in document " Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations " then calculates the converting characteristic matrix Z of this test data e' with the converting characteristic matrix Z of each training data f' canonical correlation coefficient and; Owing to may there be the Chinese character of repetition in the training data, therefore the canonical correlation coefficient of same Chinese character correspondence and having more than 1 or 1, so further calculate each the Chinese character correspondence in the training data canonical correlation coefficient and mean value, and from these mean values, take out maximal value, human-computer interaction module is arrived in the corresponding Chinese character output in training data of this maximal value;
Step 2.12: human-computer interaction module shows the Chinese character that canonical correlation discriminatory analysis module transmits;
Through above-mentioned steps, can finish automatic Classification and Identification to test data.
CN2010105582532A 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language Expired - Fee Related CN102004549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105582532A CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105582532A CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Publications (2)

Publication Number Publication Date
CN102004549A true CN102004549A (en) 2011-04-06
CN102004549B CN102004549B (en) 2012-05-09

Family

ID=43811953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105582532A Expired - Fee Related CN102004549B (en) 2010-11-22 2010-11-22 Automatic lip language identification system suitable for Chinese language

Country Status (1)

Country Link
CN (1) CN102004549B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN105787428A (en) * 2016-01-08 2016-07-20 上海交通大学 Method for lip feature-based identity authentication based on sparse coding
CN106250829A (en) * 2016-07-22 2016-12-21 中国科学院自动化研究所 Digit recognition method based on lip texture structure
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data
CN107122646A (en) * 2017-04-26 2017-09-01 大连理工大学 A kind of method for realizing lip reading unblock
CN107992812A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip reading recognition methods and device
CN108596107A (en) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 Lip reading recognition methods and its device, AR equipment based on AR equipment
CN109389098A (en) * 2018-11-01 2019-02-26 重庆中科云丛科技有限公司 A kind of verification method and system based on lip reading identification
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
US10275685B2 (en) 2014-12-22 2019-04-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content
CN110580336A (en) * 2018-06-08 2019-12-17 北京得意音通技术有限责任公司 Lip language word segmentation method and device, storage medium and electronic equipment
CN111326152A (en) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 Voice control method and device
CN111898420A (en) * 2020-06-17 2020-11-06 北方工业大学 Lip language recognition system
CN112053160A (en) * 2020-09-03 2020-12-08 中国银行股份有限公司 Intelligent bracelet for lip language recognition, lip language recognition system and method
WO2021051603A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Coordinate transformation-based lip cutting method and apparatus, device, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 20070630 Tae-Kyun Kim et al. Discriminative learning and recognition of image set classes using canonical correlations 全文 1 第29卷, 第6期 2 *
《第六届和谐人机环境联合学术会议》 20101024 吕坤 等 基于卷积虚拟静电场Snake模型的唇形跟踪算法 , 2 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275685B2 (en) 2014-12-22 2019-04-30 Dolby Laboratories Licensing Corporation Projection-based audio object extraction from audio content
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN104808794B (en) * 2015-04-24 2019-12-10 北京旷视科技有限公司 lip language input method and system
CN105787428A (en) * 2016-01-08 2016-07-20 上海交通大学 Method for lip feature-based identity authentication based on sparse coding
CN106250829A (en) * 2016-07-22 2016-12-21 中国科学院自动化研究所 Digit recognition method based on lip texture structure
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data
CN107025439B (en) * 2017-03-22 2020-04-24 天津大学 Lip region feature extraction and normalization method based on depth data
CN107122646A (en) * 2017-04-26 2017-09-01 大连理工大学 A kind of method for realizing lip reading unblock
CN107992812A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip reading recognition methods and device
CN108596107A (en) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 Lip reading recognition methods and its device, AR equipment based on AR equipment
US11527242B2 (en) 2018-04-26 2022-12-13 Beijing Boe Technology Development Co., Ltd. Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view
CN110580336A (en) * 2018-06-08 2019-12-17 北京得意音通技术有限责任公司 Lip language word segmentation method and device, storage medium and electronic equipment
CN109389098A (en) * 2018-11-01 2019-02-26 重庆中科云丛科技有限公司 A kind of verification method and system based on lip reading identification
CN111326152A (en) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 Voice control method and device
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
WO2021051603A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Coordinate transformation-based lip cutting method and apparatus, device, and storage medium
CN111898420A (en) * 2020-06-17 2020-11-06 北方工业大学 Lip language recognition system
CN112053160A (en) * 2020-09-03 2020-12-08 中国银行股份有限公司 Intelligent bracelet for lip language recognition, lip language recognition system and method
CN112053160B (en) * 2020-09-03 2024-04-23 中国银行股份有限公司 Intelligent bracelet for lip language identification, lip language identification system and method

Also Published As

Publication number Publication date
CN102004549B (en) 2012-05-09

Similar Documents

Publication Publication Date Title
CN102004549A (en) Automatic lip language identification system suitable for Chinese language
US10891472B2 (en) Automatic body movement recognition and association system
Bheda et al. Using deep convolutional networks for gesture recognition in american sign language
CN101964064B (en) Human face comparison method
Gao et al. Sign language recognition based on HMM/ANN/DP
Matthews et al. Extraction of visual features for lipreading
Luettin et al. Speechreading using probabilistic models
Geetha et al. A vision based dynamic gesture recognition of indian sign language on kinect based depth images
CN111223483A (en) Lip language identification method based on multi-granularity knowledge distillation
CN103092329A (en) Lip reading technology based lip language input method
CN113838174B (en) Audio-driven face animation generation method, device, equipment and medium
CN111178157A (en) Chinese lip language identification method from cascade sequence to sequence model based on tone
Zhang et al. BoMW: Bag of manifold words for one-shot learning gesture recognition from kinect
CN110110603A (en) A kind of multi-modal labiomaney method based on facial physiologic information
Borg et al. Phonologically-meaningful subunits for deep learning-based sign language recognition
CN111243065A (en) Voice signal driven face animation generation method
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN105335755A (en) Media segment-based speaking detection method and system
CN110096987B (en) Dual-path 3DCNN model-based mute action recognition method
Zheng et al. Review of lip-reading recognition
CN111950452A (en) Face recognition method
Yang et al. Sign language recognition system based on weighted hidden Markov model
Axyonov et al. Method of multi-modal video analysis of hand movements for automatic recognition of isolated signs of Russian sign language
Brock et al. Augmenting sparse corpora for enhanced sign language recognition and generation
Rahman et al. Lip Reading Bengali Words

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120509

Termination date: 20171122