WO2020253051A1 - 唇语的识别方法及其装置 - Google Patents

唇语的识别方法及其装置 Download PDF

Info

Publication number
WO2020253051A1
WO2020253051A1 PCT/CN2019/117773 CN2019117773W WO2020253051A1 WO 2020253051 A1 WO2020253051 A1 WO 2020253051A1 CN 2019117773 W CN2019117773 W CN 2019117773W WO 2020253051 A1 WO2020253051 A1 WO 2020253051A1
Authority
WO
WIPO (PCT)
Prior art keywords
lip language
lip
key points
facial
recognition model
Prior art date
Application number
PCT/CN2019/117773
Other languages
English (en)
French (fr)
Inventor
庞烨
王健宗
王义文
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253051A1 publication Critical patent/WO2020253051A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the field of big data technology, and in particular to a method and device for identifying lip language.
  • the lip language recognition technology can determine the content of the lip language according to the changes of the user's lips.
  • the lip language recognition algorithm has the problem of poor generalization ability when extracting image features.
  • the color, texture, shape and other external information in the image will affect the final recognition result, making the lip language recognition inaccurate.
  • the embodiments of the present application provide a method and device for lip language recognition to solve the problem of low accuracy of lip language recognition in the prior art.
  • an embodiment of the present application provides a method for lip recognition.
  • the method includes: acquiring multiple frames of facial images of a user; determining multiple key points of lips in each frame of the facial image, and each The coordinates corresponding to the key points of the lips; generating the lip language codes corresponding to the multiple frames of facial images according to the coordinates corresponding to each of the key points of the lips in each frame of the facial image; and inputting the lip language codes Preset lip language recognition model to identify the content of lip language.
  • an embodiment of the present application provides a device for recognizing lip language.
  • the device includes: a first acquiring module, configured to acquire multiple frames of facial images of a user; and a determining module, configured to determine each frame of the facial image Multiple key points of the lip and the coordinates corresponding to each key point of the lip; the first generating module is used to generate the coordinates corresponding to each key point of the lip in each frame of the facial image
  • the lip language codes corresponding to the multiple frames of facial images; the first input module is used to input the lip language codes into a preset lip language recognition model to identify the content of the lip language.
  • an embodiment of the present application provides a computer device, including a memory and a processor, the memory is configured to store information including program instructions, the processor is configured to control the execution of the program instructions, and the program instructions are executed by the processor. When loaded and executed, the steps of the above lip language recognition method are realized.
  • an embodiment of the present application provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a stored program, wherein the non-volatile computer is controlled while the program is running.
  • the device where the volatile readable storage medium is located executes the aforementioned lip language recognition method.
  • the corresponding lip language code is generated by the coordinates of the key points of the lips in the multi-frame facial image, and then the content of the lip language is recognized through the lip language code, which solves the accuracy of lip language recognition in the prior art
  • the problem of not being high achieves the effect of improving the generalization ability and recognition accuracy of the lip language recognition method.
  • FIG. 1 is a schematic flowchart of an optional lip language recognition method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of another lip language recognition method provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of the structure of a recurrent neural network provided by an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a lip language recognition model provided by an embodiment of the application.
  • FIG. 5 is a schematic flowchart of another lip language recognition method proposed by an embodiment of the application.
  • FIG. 6 is a flowchart of an example of a lip language recognition method provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a lip language recognition device provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of another lip language recognition device provided by an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of another lip language recognition device provided by an embodiment of the application.
  • Fig. 10 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • first, second, third, etc. may be used in the embodiments of the present application to describe the preset range, etc., these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other.
  • the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.
  • the word “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrase “if determined” or “if detected (statement or event)” can be interpreted as “when determined” or “in response to determination” or “when detected (statement or event) )” or “in response to detection (statement or event)”.
  • the lip language recognition algorithm has a problem of poor generalization ability when extracting image features.
  • the color, texture, shape and other external information in the image will affect the final recognition result. It has an impact and makes lip recognition inaccurate.
  • the embodiments of the present application provide a method for recognizing lip language, which generates corresponding lip language codes through the coordinates of key points of lips in multiple frames of facial images, and then recognizes the content of lip language through lip language codes It avoids the influence of skin color and texture in facial images on lip recognition, and improves the generalization ability and recognition accuracy of lip recognition methods.
  • FIG. 1 is a schematic flowchart of an optional lip language recognition method provided by an embodiment of the present application. As shown in Figure 1, the method includes:
  • Step S101 Acquire multiple frames of facial images of the user.
  • the user's multiple frames of facial images are obtained through the user's facial video.
  • facial video can be disassembled into multiple frames of continuous facial images in chronological order.
  • Step S102 Determine multiple key lip points in each frame of the face image, and the coordinates corresponding to each key lip point.
  • each frame of face image includes corresponding key points of the lips.
  • the coordinates corresponding to the same key point of the lip in the facial images of different frames will vary with the change of the user's pronunciation.
  • one possible implementation is to input each frame of facial images into the facial key point recognition model to identify multiple facial key points in the facial image, Key points of the face are labeled. According to the labeling information, multiple key points of lips are filtered from multiple key points of the face. Establish a plane coordinate system to determine the coordinates corresponding to multiple key points of the lips.
  • Dlib is a facial key point recognition model, with C++ and python interfaces, which can perform facial key point recognition and annotate facial key points. Specifically, the positions of 68 facial key points can be obtained, and the 68 facial key points can be marked separately. It can be understood that when labeling face key points, different facial images are labelled using the same face key point labeling method.
  • the left eye includes 6 facial key points, which are marked as 37, 38, 39, 40, 41, and the 6 facial key points corresponding to the left eye in facial image B are also marked as 37 , 38, 39, 40, 41. That is to say, in the face image, the position of the face key point on the face can be judged by the number used for labeling.
  • the key points of the face of the lips are also marked with fixed numbers.
  • the key points of the face marked 49-64 in total of 16 lips can be used as the key points of the lips.
  • the plane coordinate system provided by the embodiment of the present application uses the center positions of multiple key points of the lips as the origin.
  • the origin of the plane coordinate system also moves, which can avoid the influence of face movement on lip language recognition .
  • the coordinates corresponding to multiple key lip points can be further determined.
  • Step S103 Generate lip language codes corresponding to multiple frames of facial images according to the coordinates corresponding to each lip key point in each frame of facial images.
  • each lip key point is a two-dimensional coordinate
  • each frame of facial image includes multiple lip key points.
  • one possible implementation is to arrange the coordinates corresponding to each key point of the lip in each frame of facial images in chronological order and labeling order to generate multiple frames of face The lip code corresponding to the image.
  • the first frame and the second frame of facial images include 3 key points of lips
  • the numbers marked in the facial key point recognition model are 40, 41, 42 respectively, and the 3 lips in the first frame of facial image
  • the coordinates of the key points are 40 (0, 1), 41 (-2, -0.5), 42 (2, -0.5)
  • the coordinates of the three key points of the lips in the second frame of facial image are 40 (0, 2), 41(-1, -0.5), 42(1, -0.5).
  • the first possible lip coding result is (0, 1, -2, -0.5, 2, -0.5, 0, 2, -1, -0.5, 1, -0.5)
  • the second possible lip coding The result is (0, 1, 0, 2, -2, -0.5, -1, -0.5, 2, -0.5, 1, -0.5).
  • the foregoing example is only for the convenience of explaining the manner in which the coordinates are arranged in time sequence and labeling order in the embodiment of the present application, and does not constitute a limitation to the embodiment of the present application.
  • Step S104 input the lip language code into a preset lip language recognition model to identify the content of the lip language.
  • multiple lip key points in the facial video of the user are used as a marking method for the user's lip changes, and the coordinates of the multiple lip key points in the multi-frame image are recorded through lip coding.
  • the lip encoding records the user's lip change information in the facial video by digital encoding. Therefore, the lip language recognition model can recognize the content of the lip language through the input lip language code.
  • the lip language recognition model recognizes the content of the lip language, it is done in units of phonemes. After determining a group of phonemes corresponding to multiple frames of facial images, the group of phonemes is decoded to obtain the content of the lip language.
  • the correct lip language content corresponding to a facial video is expressed in units of phonemes, which are three phonemes A1, B1, and C1.
  • phonemes which are three phonemes A1, B1, and C1.
  • the lip language recognition method obtains multiple frames of facial images of the user, determines multiple key lip points in each frame of the facial image, and coordinates corresponding to each key point of the lip . According to the coordinates corresponding to each key point of each lip in each frame of facial image, lip code corresponding to multiple frames of facial image is generated, and the lip code is input into a preset lip recognition model to identify the content of the lip.
  • the corresponding lip language code is generated by the coordinates of the key points of the lips in the multi-frame facial image, and then the content of the lip language is recognized through the lip language encoding, and the skin color, texture and other factors in the facial image are prevented from affecting the lip language
  • the influence of recognition improves the generalization ability and recognition accuracy of lip language recognition methods.
  • FIG. 2 is another lip language recognition method provided by the embodiment of the application. Schematic diagram of the process of language recognition method. As shown in Figure 2, based on the method flow shown in Figure 1, the lip language recognition model can be generated through the following steps:
  • Step S201 Obtain multiple lip language videos and the lip language content corresponding to each lip language video.
  • the multiple lip language videos provided in the embodiments of the present application are used to train a lip language recognition model. Therefore, the lip language content corresponding to each lip language video is stored in advance.
  • the lip language content is the correct recognition result of the lip language video.
  • Step S202 According to the multiple lip language videos, a lip language code corresponding to each lip language video is generated.
  • the method of generating the lip language code corresponding to each lip language video here is the same as the method of generating the lip language code corresponding to the multi-frame facial image, and will not be repeated here.
  • Step S203 Input the lip code corresponding to each lip language video into the lip language recognition model.
  • lip language recognition models include convolutional neural networks and recurrent neural networks.
  • a preferred implementation is to set the number of layers of the convolutional neural network to 4, the size of the convolution kernel to 3, and the number of layers of the recurrent neural network It is 3, and the number of hidden neurons in each layer is 1024.
  • the neurons of the convolutional neural network can obtain the characteristics of the data in a certain area. For example, if the coefficient of the convolution kernel of the convolution neuron is 5, the convolution kernel will multiply and add the data at the current position and the two adjacent data before and after, and then put the calculated result into the current Data location.
  • the recurrent neural network is a memory network that can obtain calculation results based on the hidden layer at the previous moment and the input at the current moment, and display the predicted probability.
  • Step S204 Compare the output of the lip language recognition model with the lip language content corresponding to each lip language video to train the parameters of the lip language recognition model.
  • One possible implementation is to use the loss function CTCLOSS to compare the output of the lip recognition model with the lip content corresponding to each lip video to quantify the difference between the two.
  • the output of the lip language recognition model is the result of processing the lip language encoding by the lip language recognition model, and the lip language content corresponding to the lip language video is the correct recognition result.
  • the parameters of the lip language recognition model can continuously improve the recognition accuracy of the lip language recognition model.
  • the parameter training of the convolutional neural network is mainly to train the coefficients of the convolution kernel to optimize the extraction coefficients of the lip coding.
  • the parameter training of the recurrent neural network is mainly to train the weight parameters of each hidden neuron to improve the accuracy of lip recognition.
  • the lip code is input to the convolutional neural network, and then the output of the convolutional neural network is input to the recurrent neural network, and the loss function CTCLOSS is used to calculate the difference to optimize the convolutional neural network and The parameters of the recurrent neural network.
  • the lip language code to be recognized is input into the convolutional neural network and the recurrent neural network successively to obtain the recognition result of the lip language.
  • FIG. 5 is another lip language recognition method proposed by the embodiment of the application. Schematic diagram of the process of language recognition method. As shown in FIG. 5, based on the method flow shown in FIG. 1, in step S104, after inputting the lip language code into the preset lip language recognition model to identify the content of the lip language, the method further includes:
  • Step S301 Acquire voices corresponding to multiple frames of facial images.
  • multi-frame facial images provided by the embodiments of the present application are obtained from the user's facial video, so the corresponding voice can be extracted therefrom.
  • Step S302 Input the voice into a preset voice recognition model to recognize the content of the voice.
  • a possible way of generating the speech recognition model includes the following steps:
  • Step S11 Obtain multiple voices and voice content corresponding to each voice.
  • the multiple voices provided in the embodiments of the present application are for training a voice recognition model. Therefore, the voice content corresponding to each voice is stored in advance.
  • Step S12 generating voice features of the voice.
  • one possible implementation is to reduce the noise of the speech to reduce the impact of environmental noise on the training results, and then use the Mel frequency cepstrum coefficient to extract the speech of the speech from the denoised speech feature.
  • the speech is windowed to obtain multiple frames of speech signals.
  • Fast Fourier transform is performed on each frame of speech signal to obtain the frequency spectrum of the speech signal.
  • Mel filter the frequency spectrum of the speech signal to extract the speech features of the speech.
  • a preferred implementation manner is that the voice provided by the embodiment of the application is a voice signal of 8 kHz, that is, there are 8000 sampling points per second.
  • every 256 sampling points are regarded as a processing unit, and the corresponding time length is 32 ms.
  • every 32ms is regarded as a time window.
  • the human voice is in a fixed frequency band, so other voices in the voice can be filtered out through Mel filtering to obtain a clean human voice.
  • Step S13 input the voice features into the voice recognition model.
  • the speech recognition model includes a deep neural network.
  • Step S14 Compare the output of the voice recognition model with the voice content corresponding to each voice to train the parameters of the voice recognition model.
  • Kaldi provides a package of speech recognition models based on deep neural networks. Put clean speech signals into the deep neural network provided by Kaldi for training, and you can get them in a short time. Good speech recognition model.
  • Step S303 Generate language content corresponding to multiple frames of facial images according to the content of the lip language and the content of the voice.
  • the content of the lip language and the voice content need to be combined to comprehensively determine the recognition result of the language content.
  • the correct language content corresponding to a facial video is expressed in units of phonemes, which are three phonemes A1, B1, and C1.
  • the result of the recognition of the lip recognition model is that the probability of the first phoneme being A1 is 90%, the probability of being A2 is 8%, and the probability of being A3 is 2%.
  • the result of the speech recognition model recognition is that the first phoneme is A1
  • the probability is 80%, the probability of being A2 is 20%, and the probability of being A3 is 0%.
  • the probability of the first phoneme being A1 is 85%, the probability of being A2 is 14%, and the probability of being A3 is 1%.
  • the first phoneme of the language content is identified as A1.
  • the recognition result of the language content can be obtained.
  • the recognition result of combining the lip language recognition model and the voice recognition model is realized, and the recognition accuracy of the language content corresponding to the facial video is further improved.
  • the lip language recognition method provided by the embodiments of the present application combines the recognition results of the lip language recognition model and the speech recognition model, an example is given below.
  • multiple frames of facial images and corresponding voices are extracted from the lip language video.
  • the coordinates corresponding to the key points of the lip are extracted from each frame of facial image, the lip language code is obtained according to the coordinates, and the lip language code is input into the lip language recognition model to obtain the recognized content and the corresponding probability.
  • FIG. 7 is a schematic structural diagram of a lip language recognition device provided by an embodiment of the application. As shown in FIG. 7, the device includes: a first acquisition module 410, a determination module 420, a first generation module 430, and a first input module 440.
  • the first acquiring module 410 is configured to acquire multiple frames of facial images of the user.
  • the determining module 420 is used to determine multiple key lip points in each frame of the facial image, and the coordinates corresponding to each key lip point.
  • the first generating module 430 is configured to generate lip language codes corresponding to multiple frames of facial images according to the coordinates corresponding to each lip key point in each frame of facial images.
  • the first input module 440 is configured to input the lip language code into a preset lip language recognition model to recognize the content of the lip language.
  • the determining module 420 includes: an input sub-module 421 for inputting each frame of facial images into the facial key point recognition model to identify Multiple facial key points in the facial image are shown.
  • the labeling sub-module 422 is used for labeling multiple facial key points.
  • the screening sub-module 423 is used for screening multiple key points of lips from multiple key points of the face according to the label information.
  • the establishment sub-module 424 is used to establish a plane coordinate system to determine the coordinates corresponding to multiple key points of the lips.
  • the establishing sub-module 424 includes: a calculation unit 4241, which is used to calculate the number of key points of the lip. Central location.
  • the setting unit 4242 is used to use the center position as the origin of the plane coordinate system.
  • the determining unit 4243 is configured to determine the coordinates corresponding to multiple key points of the lips according to the origin of the plane coordinate system.
  • the first generating module 430 is configured to arrange the coordinates corresponding to each key point of the lip in each frame of facial image in chronological order Arrange with the labeling order to generate lip language codes corresponding to multiple frames of facial images.
  • the lip language recognition device acquires multiple frames of facial images of the user, determines multiple key lip points in each frame of the facial image, and coordinates corresponding to each lip key point . According to the coordinates corresponding to each key point of each lip in each frame of facial image, lip code corresponding to multiple frames of facial image is generated, and the lip code is input into a preset lip recognition model to identify the content of the lip.
  • the corresponding lip language code is generated by the coordinates of the key points of the lips in the multi-frame facial image, and then the content of the lip language is recognized through the lip language encoding, and the skin color, texture and other factors in the facial image are prevented from affecting the lip language.
  • the influence of recognition improves the generalization ability and recognition accuracy of lip language recognition methods.
  • FIG. 8 is a schematic structural diagram of another lip language recognition device provided by an embodiment of the application. As shown in FIG. 8, based on the device structure shown in FIG. 7, the device further includes: a second acquisition module 510, a second generation module 520, a second input module 530, and a first comparison module 540.
  • the second acquiring module 510 is configured to acquire a plurality of lip language videos and the lip language content corresponding to each lip language video.
  • the second generation module 520 is configured to generate a lip language code corresponding to each lip language video according to a plurality of lip language videos.
  • the second input module 530 is configured to input the lip code corresponding to each lip language video into the lip language recognition model, where the lip language recognition model includes a convolutional neural network and a recurrent neural network.
  • the first comparison module 540 is configured to compare the output of the lip language recognition model with the lip language content corresponding to each lip language video to train the parameters of the lip language recognition model.
  • the first comparison module 540 is used to compare the lip language recognition model with a loss function. The output is compared with the corresponding lip language content of each lip language video.
  • FIG. 9 is a schematic structural diagram of another lip language recognition device provided by an embodiment of the application. As shown in FIG. 9, based on the device structure shown in FIG. 7, the device further includes: a third acquisition module 610, a third input module 620, and a third generation module 630.
  • the third acquiring module 610 is configured to acquire voices corresponding to multiple frames of facial images.
  • the third input module 620 is configured to input the voice into a preset voice recognition model to recognize the content of the voice.
  • the third generation module 630 is configured to generate language content corresponding to multiple frames of facial images according to the content of the lip language and the content of the voice.
  • the device further includes: a fourth obtaining module 640, configured to obtain multiple voices and voice content corresponding to each voice.
  • the fourth generating module 650 is used to generate voice features of the voice.
  • the fourth input module 660 is used to input voice features into a voice recognition model, where the voice recognition model includes a deep neural network.
  • the second comparison module 670 is used to compare the output of the voice recognition model with the voice content corresponding to each voice to train the parameters of the voice recognition model.
  • the fourth generation module 650 includes: a noise reduction sub-module 651, which is configured to perform noise reduction on the voice.
  • the extraction sub-module 652 is configured to use the Mel frequency cepstrum coefficient to extract the voice features of the voice from the voice after noise reduction.
  • the extraction sub-module 652 includes: a windowing unit 6521, configured to perform windowing on the voice to obtain a multi-frame voice signal.
  • the transform unit 6522 is used to perform fast Fourier transform on each frame of speech signal to obtain the frequency spectrum of the speech signal.
  • the filtering unit 6523 is used to perform Mel filtering on the frequency spectrum of the speech signal to extract speech features of the speech.
  • the recognition result of combining the lip language recognition model and the voice recognition model is realized, and the recognition accuracy of the language content corresponding to the facial video is further improved.
  • the embodiments of the present application also propose a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the characteristic is that the processor executes the computer program as follows: The steps of the lip language recognition method of the foregoing method embodiment.
  • Fig. 10 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device 50 of this embodiment includes a processor 51, a memory 52, and a computer program 53 stored in the memory 52 and running on the processor 51.
  • the computer program 53 is executed by the processor 51, In order to avoid repetition, the method for recognizing lip language in the embodiment is not repeated here.
  • the computer program is executed by the processor 51, the function of each model/unit in the lip language recognition device in the embodiment is realized. To avoid repetition, it will not be repeated here.
  • the computer device 50 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device may include, but is not limited to, a processor 51 and a memory 52.
  • FIG. 10 is only an example of the computer device 50, and does not constitute a limitation on the computer device 50. It may include more or less components than those shown in the figure, or combine certain components, or different components.
  • computer equipment may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 51 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 52 may be an internal storage unit of the computer device 50, such as a hard disk or memory of the computer device 50.
  • the memory 52 may also be an external storage device of the computer device 50, such as a plug-in hard disk equipped on the computer device 50, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) etc.
  • the memory 52 may also include both an internal storage unit of the computer device 50 and an external storage device.
  • the memory 52 is used to store computer programs and other programs and data required by the computer equipment.
  • the memory 52 can also be used to temporarily store data that has been output or will be output.
  • the embodiments of the present application also propose a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium stores a computer program, which is characterized in that the computer program is executed when the processor is executed.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer non-volatile readable storage medium.
  • the above-mentioned software functional unit is stored in a computer non-volatile readable storage medium, and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute this application Part of the steps of each embodiment method.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种唇语的识别方法及装置,涉及大数据技术领域。唇语的识别方法包括:获取用户的多帧面部图像(S101);确定每帧面部图像中的多个唇部关键点,以及每个唇部关键点对应的坐标(S102);根据每帧面部图像中每个唇部关键点对应的坐标,生成多帧面部图像对应的唇语编码(S103);将唇语编码输入预设的唇语识别模型,以识别出唇语的内容(S104)。由此,实现了通过多帧面部图像中唇部关键点的坐标来生成对应的唇语编码,进而通过唇语编码识别出唇语的内容,避免了面部图像中肤色、纹理因素对唇语识别的影响,提高了唇语识别方法的泛化能力和识别准确度,能够解决现有技术中唇语识别准确度不高的技术问题。

Description

唇语的识别方法及其装置
本申请要求于2019年6月18日提交中国专利局、申请号为201910524060.6、申请名称为“唇语的识别方法及其装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
【技术领域】
本申请涉及大数据技术领域,尤其涉及一种唇语的识别方法及其装置。
【背景技术】
唇语识别技术可以根据用户的唇部变化,确定唇语的内容。
相关技术中,唇语识别算法对图像特征的提取时存在泛化能力较差的问题,图像中的颜色、纹理、形状等外部信息会对最后的识别结果产生影响,使得唇语识别不准确。
【申请内容】
有鉴于此,本申请实施例提供了一种唇语识别的方法和装置,用以解决现有技术中唇语识别准确度不高的问题。
一方面,本申请实施例提供了一种唇语识别的方法,所述方法包括:获取用户的多帧面部图像;确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;以及将所述唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
一方面,本申请实施例提供了一种唇语的识别装置,所述装置包括:第一获取模块,用于获取用户的多帧面部图像;确定模块,用于确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;第一生成模块,用于根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;第一输入模块,用于将所述唇语编码输入预设 的唇语识别模型,以识别出唇语的内容。
一方面,本申请实施例提供了一种计算机设备,包括存储器和处理器,所述存储器用于存储包括程序指令的信息,所述处理器用于控制程序指令的执行,所述程序指令被处理器加载并执行时实现上述的唇语的识别方法的步骤。
一方面,本申请实施例提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质包括存储的程序,其中,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备执行上述的唇语的识别方法。
在本申请实施例中,通过多帧面部图像中唇部关键点的坐标来生成对应的唇语编码,进而通过唇语编码识别出唇语的内容,解决了现有技术中唇语识别准确度不高的问题,达到了提高唇语识别方法的泛化能力和识别准确度的效果。
【附图说明】
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。
图1是本申请实施例所提供的一种可选的唇语的识别方法的流程示意图;
图2为本申请实施例所提供的另一种唇语的识别方法的流程示意图;
图3为本申请实施例所提供的递归神经网络的结构示意图;
图4为本申请实施例所提供的唇语识别模型的结构示意图;
图5为本申请实施例所提出的又一种唇语的识别方法的流程示意图;
图6为本申请实施例所提供的唇语的识别方法的一个示例的流程图;
图7为本申请实施例所提供的一种唇语的识别装置的结构示意图;
图8为本申请实施例所提供的另一种唇语的识别装置的结构示意图;
图9为本申请实施例所提供的又一种唇语的识别装置的结构示意图;以及
图10是本申请实施例提供的一种计算机设备的示意图。
【具体实施方式】
为了更好的理解本申请的技术方案,下面结合附图对本申请实施例进行详细描述。
应当明确,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应当理解,尽管在本申请实施例中可能采用术语第一、第二、第三等来描述预设范围等,但这些预设范围不应限于这些术语。这些术语仅用来将预设范围彼此区分开。例如,在不脱离本申请实施例范围的情况下,第一预设范围也可以被称为第二预设范围,类似地,第二预设范围也可以被称为第一预设范围。
取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。
基于上述现有技术的描述可以知道,相关技术中,唇语识别算法对图像特征的提取时存在泛化能力较差的问题,图像中的颜色、纹理、形状等外部信息会对最后的识别结果产生影响,使得唇语识别不准确。
针对这一问题,本申请实施例提供了一种唇语的识别方法,通过多帧面部图像中唇部关键点的坐标来生成对应的唇语编码,进而通过唇语编码识别出唇语的内容,避免了面部图像中肤色、纹理等因素对唇语识别的影响,提高了唇语识别方法的泛化能力和识别准确度。
图1是本申请实施例所提供的一种可选的唇语的识别方法的流程示意图。如图1所示,该方法包括:
步骤S101,获取用户的多帧面部图像。
其中,用户的多帧面部图像是通过用户的面部视频获得的。
应当理解,面部视频可以按时间顺序拆解为多帧连续的面部图像。
步骤S102,确定每帧面部图像中的多个唇部关键点,以及每个唇部关键点对应的坐标。
应当理解,每帧面部图像都包括相应的唇部关键点。而不同帧的面部图像中同一个唇部关键点对应的坐标会随着用户发音的变化而不同。
为了确定每帧面部图像中的唇部关键点,一种可能的实现方式是,将每帧面部图像输入面部关键点识别模型,以识别出面部图像中的多个面部关键点,并对多个面部关键点进行标注。根据标注信息,从多个面部关键点中筛选出多个唇部关键点。建立平面坐标系,以确定多个唇部关键点对应的坐标。
需要说明的是,Dlib是一个面部关键点识别模型,拥有C++和python接口,可以进行面部关键点识别,并对面部关键点进行标注。具体地,可以获得68个面部关键点的位置,并对这68个面部关键点分别进行标注。可以理解,在对面部关键点进行标注时,对不同的面部图像采用相同的面部关键点标注方式进行标注。
举例来说,在面部图像A中左眼包括6个面部关键点,分别标注为37、38、39、40、41,而在面部图像B中左眼对应的6个面部关键点同样标注为37、38、39、40、41。也就是说,在面部图像中,可以通过标注所用的数字判断该面部关键点在面部所处的位置。
同样的,在面部图像中,嘴唇部位的面部关键点也是采用固定的数字进行标注,比如可以采用标注为49-64共16个嘴唇部位的面部关键点作为唇部关键点。
应当理解,在确定了多个唇部关键点之后,为了确定每个唇部关键点在不同帧的面部图像中的位置变化,需要建立平面坐标系,以确定多个唇部关键点对应的坐标。
需要特别说明的是,考虑到在不同的面部图像中人脸的位置会发生变化,为了避免人脸的移动对唇语的识别造成影响,一种可能的实现方式是,计算多个唇部关键点的中心位置,并将中心位置作为平面坐标系的原点。
本申请实施例所提供的平面坐标系以多个唇部关键点的中心位置作为原点,在人脸移动时,平面坐标系的原点也随之移动,可以避免人脸移动对唇语识别的影响。
在确定了平面坐标系的原点之后,可以进一步确定多个唇部关键点对应的坐标。
步骤S103,根据每帧面部图像中每个唇部关键点对应的坐标,生成多帧面部图像对应的唇语编码。
应当理解,每个唇部关键点对应的坐标是一个二维坐标,而每帧面部图像则包括多个唇部关键点。
为了生成多帧面部图像对应的唇语编码,一种可能的是实现方式是,将每帧面部图像中每个唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成多帧面部图像对应的唇语编码。
举例来说,第一帧和第二帧面部图像分别包括3个唇部关键点,在面部关键点识别模型中标注的数字分别为40,41,42,第一帧面部图像中3个唇部关键点的坐标分别为40(0,1),41(-2,-0.5),42(2,-0.5),第二帧面部图像中3个唇部关键点的坐标分别为40(0,2),41(-1,-0.5),42(1,-0.5)。第一种可能的唇语编码结果是(0,1,-2,-0.5,2,-0.5,0,2,-1,-0.5,1,-0.5),第二种可能的唇语编码结果是(0,1,0,2,-2,-0.5,-1,-0.5,2,-0.5,1,-0.5)。需要说明的是,上述例子仅仅是为了便于说明本申请实施例将坐标按照时间顺序和标注顺序进行排列的方式,不构成对本申请实施例的限制。
步骤S104,将唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
可以理解,本申请实施例将用户在面部视频中的多个唇部关键点作为用户唇部变化的标记方法,通过唇语编码记录多帧图像中多个唇部关键点的坐标。也就说是,唇部编码以数字编码的方式记录了面部视频中用户的唇部变化信息。因此,唇语识别模型可以通过输入的唇语编码识别出唇语的内容。
应当理解的是,在唇语识别模型识别出唇语的内容时,是以音素为单位进行的。在确定了多帧面部图像对应的一组音素后,再对该组音素进行解码处理,得到唇语的内容。
举例来说,面部视频对应的正确唇语内容,以音素为单位进行表示,为A1,B1,C1三个音素。通过对唇语编码进行识别,确定第一个音素为A1的概率为90%,为A2的概率为8%,为A3的概率为2%。根据概率识别出唇语内容的第一个音素为A1。
综上所述,本申请实施例所提供的唇语的识别方法,获取用户的多帧面部 图像,确定每帧面部图像中的多个唇部关键点,以及每个唇部关键点对应的坐标。根据每帧面部图像中每个唇部关键点对应的坐标,生成多帧面部图像对应的唇语编码,将唇语编码输入预设的唇语识别模型,以识别出唇语的内容。由此,实现了通过多帧面部图像中唇部关键点的坐标来生成对应的唇语编码,进而通过唇语编码识别出唇语的内容,避免了面部图像中肤色、纹理等因素对唇语识别的影响,提高了唇语识别方法的泛化能力和识别准确度。
为了更加清楚地说明本申请实施例所提供的唇语识别模型的生成方法,本申请实施例还提出了另一种唇语的识别方法,图2为本申请实施例所提供的另一种唇语的识别方法的流程示意图。如图2所示,基于图1所示的方法流程,唇语识别模型可以通过以下步骤生成:
步骤S201,获取多个唇语视频,以及每个唇语视频对应的唇语内容。
应当理解,本申请实施例所提供的多个唇语视频,是用来训练唇语识别模型的。因此,预先存储着每个唇语视频对应的唇语内容。
其中,唇语内容为唇语视频的正确识别结果。
步骤S202,根据多个唇语视频,生成每个唇语视频分别对应的唇语编码。
需要说明的是,此处生成每个唇语视频分别对应的唇语编码的方法与前述生成多帧面部图像对应的唇语编码的方法步骤相同,此处不再赘述。
步骤S203,将每个唇语视频对应的唇部编码输入唇语识别模型。
其中,唇语识别模型包括卷积神经网络和递归神经网络。
需要特别说明的是,为了让唇语识别模型的性能更优,一种优选的实现方式是,设置卷积神经网络的层数为4,卷积核大小为3,设置递归神经网络的层数为3,每层隐藏神经元数量为1024个。
应当理解,卷积神经网络的神经元能够获取到一定区域内的数据的特征。举例来说,若卷积神经元的卷积核的系数为5,该卷积核就与当前位置的数据和前后两个临近的数据进行相乘再相加,再将计算的结果放入当前数据位置。
如图3所示,递归神经网络是一个记忆网络,能够基于前一时刻的隐层和当前时刻的输入得到计算结果,显示预测概率。
步骤S204,将唇语识别模型的输出与每个唇语视频对应的唇语内容进行比较,以训练唇语识别模型的参数。
一种可能的实现方式是,使用损失函数CTC LOSS对唇语识别模型的输出与每个唇语视频对应的唇语内容进行比较,以量化二者之间的差异。
应当理解,唇语识别模型的输出是唇语识别模型对唇语编码进行处理后得到的结果,而唇语视频对应的唇语内容则是正确的识别结果,通过对二者进行比较,不断训练唇语识别模型的参数,可以使得唇语识别模型的识别准确度不断提高。
需要特别说明的是,对卷积神经网络的参数训练主要是训练卷积核的系数,以优化唇语编码的提取系数。
对递归神经网络的参数训练主要是训练每个隐藏神经元的权重参数,以提升唇语识别的准确度。
从而,实现了对唇语识别模型的参数训练和性能优化。
为了更加清楚地说明本申请实施例所提供的唇语识别模型的结构,下面进行举例说明。
如图4所示,在训练过程中,将唇语编码输入卷积神经网络,再将卷积神经网络的输出输入递归神经网络,使用损失函数CTC LOSS计算差值,以优化卷积神经网络和递归神经网络的参数。在使用过程中,将待识别的唇语编码先后输入卷积神经网络和递归神经网络,即可得到唇语的识别结果。
为了进一步提升本申请实施例所提供的唇语的识别方法的识别准确度,本申请实施例还提出了又一种唇语的识别方法,图5为本申请实施例所提出的又一种唇语的识别方法的流程示意图。如图5所示,基于图1所示的方法流程,在步骤S104,将唇语编码输入预设的唇语识别模型,以识别出唇语的内容之后,该方法还包括:
步骤S301,获取多帧面部图像对应的语音。
可以理解,本申请实施例所提供的多帧面部图像是通过用户的面部视频获得的,因此可以从中提取出对应的语音。
步骤S302,将语音输入预设的语音识别模型,以识别出语音的内容。
其中,本申请实施例所提供的语音识别模型的一种可能的生成方式包括以下步骤:
步骤S11,获取多个语音,以及每个语音对应的语音内容。
应当理解,本申请实施例所提供的多个语音是为了训练语音识别模型的。因此,预先存储着每个语音对应的语音内容。
步骤S12,生成语音的语音特征。
为了生成语音的语音特征,一种可能的实现方式是,对语音进行降噪,以 降低环境噪声对训练结果的影响,然后使用梅尔频率倒谱系数从降噪后的语音中提取语音的语音特征。
具体地,对语音进行分窗,以获得多帧语音信号。对每帧语音信号进行快速傅里叶变换,以获取语音信号的频谱。对语音信号的频谱进行梅尔滤波,以提取语音的语音特征。
一种优选的实现方式是,本申请实施例所提供的语音为8kHz的语音信号,即1秒钟有8000个采样点。本申请实施例将其中每256个采样点作为一个处理单位,对应的时间长度为32ms。也就是说,对语音进行分窗后每32ms作为一个时间窗。
需要说明的是,人类说话的声音在固定的频段内,因此通过梅尔滤波可以将语音中的其他声音进行滤除,得到干净的人类说话的声音。
最后对处理之后的干净的语音信号进行特征提取。
步骤S13,将语音特征输入语音识别模型。
其中,语音识别模型包括深度神经网络。
步骤S14,将语音识别模型的输出与每个语音对应的语音内容进行比较,以训练语音识别模型的参数。
一种可能的实现方式是,kaldi提供了一种基于深度神经网络的语音识别模型的程序包,将干净的语音信号放入kaldi提供的深度神经网络进行训练,即可在很短的时间内得到较好的语音识别模型。
步骤S303,根据唇语的内容和语音的内容,生成多帧面部图像对应的语言内容。
需要说明的是,在本申请实施例所提供的又一种唇语的识别方法中,需要结合唇语的内容和语音的内容,以综合判断语言内容的识别结果。
举例来说,面部视频对应的正确语言内容,以音素为单位进行表示,为A1,B1,C1三个音素。唇语识别模型识别的结果为第一个音素为A1的概率为90%,为A2的概率为8%,为A3的概率为2%,语音识别模型识别的结果为第一个音素为A1的概率为80%,为A2的概率为20%,为A3的概率为0%。对两个模型识别的概率进行归一化处理后,得到第一个音素为A1的概率为85%,为A2的概率为14%,为A3的概率为1%。根据综合概率识别出语言内容的第一个音素为A1。
同样的,对得到的一组音素进行解码处理,即可得到语言内容的识别结果。
从而,实现了结合唇语识别模型和语音识别模型的识别结果,进一步提升对面部视频对应的语言内容的识别准确度。
为了更加清楚地说明本申请实施例所提供的唇语的识别方法是如何结合唇语识别模型和语音识别模型的识别结果的,下面进行举例说明。
如图6所示,从唇语视频中分别提取多帧面部图像和对应的语音。从每帧面部图像中提取唇部关键点对应的坐标,根据坐标得到唇语编码,将唇语编码输入唇语识别模型,得到识别出的内容以及对应的概率。对于语音进行信号采样和信号处理得到语音特征,将语音特征输入语音识别模型,得到识别出的内容以及对应的概率。对两个模型输出的概率进行归一化处理后,得到语言内容的识别结果。
为了实现上述实施例,本申请实施例还提出一种唇语的识别装置,图7为本申请实施例所提供的一种唇语的识别装置的结构示意图。如图7所示,该装置包括:第一获取模块410,确定模块420,第一生成模块430,第一输入模块440。
第一获取模块410,用于获取用户的多帧面部图像。
确定模块420,用于确定每帧面部图像中的多个唇部关键点,以及每个唇部关键点对应的坐标。
第一生成模块430,用于根据每帧面部图像中每个唇部关键点对应的坐标,生成多帧面部图像对应的唇语编码。
第一输入模块440,用于将唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
进一步地,为了确定每帧面部图像中的唇部关键点,一种可能的实现方式是,确定模块420包括:输入子模块421,用于将每帧面部图像输入面部关键点识别模型,以识别出面部图像中的多个面部关键点。标注子模块422,用于对多个面部关键点进行标注。筛选子模块423,用于根据标注信息,从多个面部关键点中筛选出多个唇部关键点。建立子模块424,用于建立平面坐标系,以确定多个唇部关键点对应的坐标。
进一步地,为了确定每个唇部关键点在不同帧的面部图像中的位置变化,一种可能的实现方式是,建立子模块424包括:计算单元4241,用于计算多个唇部关键点的中心位置。设置单元4242,用于将中心位置作为平面坐标系的原点。确定单元4243,用于根据平面坐标系的原点,确定多个唇部关键点 对应的坐标。
进一步地,为了生成多帧面部图像对应的唇语编码,一种可能的是实现方式是,第一生成模块430,用于将每帧面部图像中每个唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成多帧面部图像对应的唇语编码。
需要说明的是,前述对唇语的识别方法实施例的解释说明也适用于该实施例的唇语的识别装置,此处不再赘述。
综上所述,本申请实施例所提供的唇语的识别装置,获取用户的多帧面部图像,确定每帧面部图像中的多个唇部关键点,以及每个唇部关键点对应的坐标。根据每帧面部图像中每个唇部关键点对应的坐标,生成多帧面部图像对应的唇语编码,将唇语编码输入预设的唇语识别模型,以识别出唇语的内容。由此,实现了通过多帧面部图像中唇部关键点的坐标来生成对应的唇语编码,进而通过唇语编码识别出唇语的内容,避免了面部图像中肤色、纹理等因素对唇语识别的影响,提高了唇语识别方法的泛化能力和识别准确度。
为了实现上述实施例,本申请实施例还提出另一种唇语的识别装置,图8为本申请实施例所提供的另一种唇语的识别装置的结构示意图。如图8所示,基于图7所示的装置结构,该装置还包括:第二获取模块510,第二生成模块520,第二输入模块530,第一比较模块540。
第二获取模块510,用于获取多个唇语视频,以及每个唇语视频对应的唇语内容。
第二生成模块520,用于根据多个唇语视频,生成每个唇语视频分别对应的唇语编码。
第二输入模块530,用于将每个唇语视频对应的唇部编码输入唇语识别模型,其中,唇语识别模型包括卷积神经网络和递归神经网络。
第一比较模块540,用于将唇语识别模型的输出与每个唇语视频对应的唇语内容进行比较,以训练唇语识别模型的参数。
进一步地,为了将唇语识别模型的输出与每个唇语视频对应的唇语内容进行比较,一种可能的实现方式是,第一比较模块540,用于使用损失函数对唇语识别模型的输出与每个唇语视频对应的唇语内容进行比较。
需要说明的是,前述对唇语的识别方法实施例的解释说明也适用于该实施例的唇语的识别装置,此处不再赘述。
从而,实现了对唇语识别模型的参数训练和性能优化。
为了实现上述实施例,本申请实施例还提出又一种唇语的识别装置,图9为本申请实施例所提供的又一种唇语的识别装置的结构示意图。如图9所示,基于图7所示的装置结构,该装置还包括:第三获取模块610,第三输入模块620,第三生成模块630。
第三获取模块610,用于获取多帧面部图像对应的语音。
第三输入模块620,用于将语音输入预设的语音识别模型,以识别出语音的内容。
第三生成模块630,用于根据唇语的内容和语音的内容,生成多帧面部图像对应的语言内容。
进一步地,为了生成语音识别模型,一种可能的实现方式是,该装置还包括:第四获取模块640,用于获取多个语音,以及每个语音对应的语音内容。第四生成模块650,用于生成语音的语音特征。第四输入模块660,用于将语音特征输入语音识别模型,其中,语音识别模型包括深度神经网络。第二比较模块670,用于将语音识别模型的输出与每个语音对应的语音内容进行比较,以训练语音识别模型的参数。
进一步地,为了生成语音的语音特征,一种可能的实现方式是,第四生成模块650包括:降噪子模块651,用于对语音进行降噪。提取子模块652,用于使用梅尔频率倒谱系数从降噪后的语音中提取语音的语音特征。
进一步地,为了提取语音的语音特征,一种可能的实现方式是,提取子模块652包括:分窗单元6521,用于对语音进行分窗,以获得多帧语音信号。变换单元6522,用于对每帧语音信号进行快速傅里叶变换,以获取语音信号的频谱。滤波单元6523,用于对语音信号的频谱进行梅尔滤波,以提取语音的语音特征。
需要说明的是,前述对唇语的识别方法实施例的解释说明也适用于该实施例的唇语的识别装置,此处不再赘述。
从而,实现了结合唇语识别模型和语音识别模型的识别结果,进一步提升对面部视频对应的语言内容的识别准确度。
为了实现上述实施例,本申请实施例还提出一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,其特征在于,处理器执行计算机程序时实现如前述方法实施例的唇语的识别方法的步骤。
图10是本申请实施例提供的一种计算机设备的示意图。如图10所示, 该实施例的计算机设备50包括:处理器51、存储器52以及存储在存储器52中并可在处理器51上运行的计算机程序53,该计算机程序53被处理器51执行时实现实施例中的唇语的识别方法,为避免重复,此处不一一赘述。或者,该计算机程序被处理器51执行时实现实施例中唇语的识别装置中各模型/单元的功能,为避免重复,此处不一一赘述。
计算机设备50可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可包括,但不仅限于,处理器51、存储器52。本领域技术人员可以理解,图10仅仅是计算机设备50的示例,并不构成对计算机设备50的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如计算机设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器51可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器52可以是计算机设备50的内部存储单元,例如计算机设备50的硬盘或内存。存储器52也可以是计算机设备50的外部存储设备,例如计算机设备50上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器52还可以既包括计算机设备50的内部存储单元也包括外部存储设备。存储器52用于存储计算机程序以及计算机设备所需的其他程序和数据。存储器52还可以用于暂时地存储已经输出或者将要输出的数据。
为了实现上述实施例,本申请实施例还提出一种计算机非易失性可读存储介质,计算机非易失性可读存储介质存储有计算机程序,其特征在于,计算机程序被处理器执行时实现如前述方法实施例的唇语的识别方法的步骤。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方 法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机非易失性可读存储介质中。上述软件功能单元存储在一个计算机非易失性可读存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)或处理器(Processor)执行本申请各个实施例方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (20)

  1. 一种唇语的识别方法,其特征在于,所述方法包括:
    获取用户的多帧面部图像;
    确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;
    根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;以及
    将所述唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
  2. 如权利要求1所述的方法,其特征在于,所述确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标,包括:
    将每帧所述面部图像输入面部关键点识别模型,以识别出所述面部图像中的多个面部关键点,并对所述多个面部关键点进行标注;
    根据标注信息,从所述多个面部关键点中筛选出所述多个唇部关键点;
    建立平面坐标系,以确定所述多个唇部关键点对应的坐标。
  3. 如权利要求2所述的方法,其特征在于,所述建立平面坐标系,以确定所述多个唇部关键点对应的坐标,包括:
    计算所述多个唇部关键点的中心位置,并将所述中心位置作为所述平面坐标系的原点;
    根据所述平面坐标系的原点,确定所述多个唇部关键点对应的坐标。
  4. 如权利要求2或3所述的方法,其特征在于,所述根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码,包括:
    将每帧所述面部图像中每个所述唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成所述多帧面部图像对应的唇语编码。
  5. 如权利要求1所述的方法,其特征在于,所述唇语识别模型通过以下步骤生成:
    获取多个唇语视频,以及每个所述唇语视频对应的唇语内容;
    根据所述多个唇语视频,生成每个所述唇语视频分别对应的所述唇语编码;
    将每个所述唇语视频对应的所述唇部编码输入所述唇语识别模型,其中,所述唇语识别模型包括卷积神经网络和递归神经网络;
    将所述唇语识别模型的输出与所述每个所述唇语视频对应的唇语内容进行 比较,以训练所述唇语识别模型的参数。
  6. 一种唇语的识别装置,其特征在于,所述装置包括:
    第一获取模块,用于获取用户的多帧面部图像;
    确定模块,用于确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;
    第一生成模块,用于根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;
    第一输入模块,用于将所述唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
  7. 如权利要求6所述的装置,其特征在于,所述确定模块包括:
    输入子模块,用于将每帧所述面部图像输入面部关键点识别模型,以识别出所述面部图像中的多个面部关键点;
    标注子模块,用于对所述多个面部关键点进行标注;
    筛选子模块,用于根据标注信息,从所述多个面部关键点中筛选出所述多个唇部关键点;
    建立子模块,用于建立平面坐标系,以确定所述多个唇部关键点对应的坐标。
  8. 如权利要求7所述的装置,其特征在于,所述建立子模块包括:
    计算单元,用于计算所述多个唇部关键点的中心位置;
    设置单元,用于将所述中心位置作为所述平面坐标系的原点;
    确定单元,用于根据所述平面坐标系的原点,确定所述多个唇部关键点对应的坐标。
  9. 如权利要求7或8所述的装置,其特征在于,
    所述第一生成模块,用于将每帧所述面部图像中每个所述唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成所述多帧面部图像对应的唇语编码。
  10. 如权利要求6所述的装置,其特征在于,所述装置还包括:
    第二获取模块,用于获取多个唇语视频,以及每个所述唇语视频对应的唇语内容;
    第二生成模块,用于根据所述多个唇语视频,生成每个所述唇语视频分别对应的所述唇语编码;
    第二输入模块,用于将每个所述唇语视频对应的所述唇部编码输入所述唇语 识别模型,其中,所述唇语识别模型包括卷积神经网络和递归神经网络;
    第一比较模块,用于将所述唇语识别模型的输出与所述每个所述唇语视频对应的唇语内容进行比较,以训练所述唇语识别模型的参数。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现以下步骤:
    获取用户的多帧面部图像;
    确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;
    根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;以及
    将所述唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
  12. 如权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机程序时还实现以下步骤:
    将每帧所述面部图像输入面部关键点识别模型,以识别出所述面部图像中的多个面部关键点,并对所述多个面部关键点进行标注;
    根据标注信息,从所述多个面部关键点中筛选出所述多个唇部关键点;
    建立平面坐标系,以确定所述多个唇部关键点对应的坐标。
  13. 如权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机程序时还实现以下步骤:
    计算所述多个唇部关键点的中心位置,并将所述中心位置作为所述平面坐标系的原点;
    根据所述平面坐标系的原点,确定所述多个唇部关键点对应的坐标。
  14. 如权利要求12或13所述的计算机设备,其特征在于,所述处理器执行所述计算机程序时还实现以下步骤:
    将每帧所述面部图像中每个所述唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成所述多帧面部图像对应的唇语编码。
  15. 如权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机程序时还实现以下步骤:
    获取多个唇语视频,以及每个所述唇语视频对应的唇语内容;
    根据所述多个唇语视频,生成每个所述唇语视频分别对应的所述唇语编码;
    将每个所述唇语视频对应的所述唇部编码输入所述唇语识别模型,其中,所述唇语识别模型包括卷积神经网络和递归神经网络;
    将所述唇语识别模型的输出与所述每个所述唇语视频对应的唇语内容进行比较,以训练所述唇语识别模型的参数。
  16. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现以下步骤:
    获取用户的多帧面部图像;
    确定每帧所述面部图像中的多个唇部关键点,以及每个所述唇部关键点对应的坐标;
    根据每帧所述面部图像中每个所述唇部关键点对应的坐标,生成所述多帧面部图像对应的唇语编码;以及
    将所述唇语编码输入预设的唇语识别模型,以识别出唇语的内容。
  17. 如权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还实现以下步骤:
    将每帧所述面部图像输入面部关键点识别模型,以识别出所述面部图像中的多个面部关键点,并对所述多个面部关键点进行标注;
    根据标注信息,从所述多个面部关键点中筛选出所述多个唇部关键点;
    建立平面坐标系,以确定所述多个唇部关键点对应的坐标。
  18. 如权利要求17所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还实现以下步骤:
    计算所述多个唇部关键点的中心位置,并将所述中心位置作为所述平面坐标系的原点;
    根据所述平面坐标系的原点,确定所述多个唇部关键点对应的坐标。
  19. 如权利要求16或17所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还实现以下步骤:
    将每帧所述面部图像中每个所述唇部关键点对应的坐标按照时间顺序和标注顺序进行排列,以生成所述多帧面部图像对应的唇语编码。
  20. 如权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还实现以下步骤:
    获取多个唇语视频,以及每个所述唇语视频对应的唇语内容;
    根据所述多个唇语视频,生成每个所述唇语视频分别对应的所述唇语编码;
    将每个所述唇语视频对应的所述唇部编码输入所述唇语识别模型,其中,所述唇语识别模型包括卷积神经网络和递归神经网络;
    将所述唇语识别模型的输出与所述每个所述唇语视频对应的唇语内容进行比较,以训练所述唇语识别模型的参数。
PCT/CN2019/117773 2019-06-18 2019-11-13 唇语的识别方法及其装置 WO2020253051A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910524060.6 2019-06-18
CN201910524060.6A CN110415701A (zh) 2019-06-18 2019-06-18 唇语的识别方法及其装置

Publications (1)

Publication Number Publication Date
WO2020253051A1 true WO2020253051A1 (zh) 2020-12-24

Family

ID=68359220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117773 WO2020253051A1 (zh) 2019-06-18 2019-11-13 唇语的识别方法及其装置

Country Status (2)

Country Link
CN (1) CN110415701A (zh)
WO (1) WO2020253051A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011245A (zh) * 2021-01-28 2021-06-22 南京大学 基于超声波感知与知识蒸馏的唇语识别系统及方法
CN113096223A (zh) * 2021-04-25 2021-07-09 北京大米科技有限公司 图像生成方法、存储介质和电子设备
CN113179449A (zh) * 2021-04-22 2021-07-27 清华珠三角研究院 语音及动作驱动图像的方法、系统、装置及存储介质
CN116189271A (zh) * 2023-04-20 2023-05-30 深圳曼瑞德科技有限公司 基于智能手表识别唇语的数据处理方法及系统
WO2024001539A1 (zh) * 2022-06-30 2024-01-04 上海商汤智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415701A (zh) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 唇语的识别方法及其装置
CN111178157A (zh) * 2019-12-10 2020-05-19 浙江大学 一种基于音调的级联序列到序列模型的中文唇语识别方法
CN114500900A (zh) * 2022-02-24 2022-05-13 北京云迹科技股份有限公司 丢失对象的寻找方法及装置
CN114821794A (zh) * 2022-04-29 2022-07-29 上海商汤临港智能科技有限公司 图像处理方法及模型生成方法、装置、车辆、存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637071A (zh) * 2011-02-09 2012-08-15 英华达(上海)电子有限公司 应用于多媒体输入设备的多媒体输入方法
CN105139503A (zh) * 2015-10-12 2015-12-09 北京航空航天大学 一种唇动口型识别门禁系统及识别方法
US20160148616A1 (en) * 2014-11-26 2016-05-26 Panasonic Intellectual Property Corporation Of America Method and apparatus for recognizing speech by lip reading
CN107992812A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种唇语识别方法及装置
CN108763897A (zh) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 身份合法性的校验方法、终端设备及介质
CN109409195A (zh) * 2018-08-30 2019-03-01 华侨大学 一种基于神经网络的唇语识别方法及系统
CN110415701A (zh) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 唇语的识别方法及其装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5917270B2 (ja) * 2011-05-27 2016-05-11 キヤノン株式会社 音検出装置及びその制御方法、プログラム
CN202110564U (zh) * 2011-06-24 2012-01-11 华南理工大学 结合视频通道的智能家居语音控制系统
CN104834900B (zh) * 2015-04-15 2017-12-19 常州飞寻视讯信息科技有限公司 一种联合声像信号进行活体检测的方法和系统
CN106157957A (zh) * 2015-04-28 2016-11-23 中兴通讯股份有限公司 语音识别方法、装置及用户设备
CN108491808B (zh) * 2018-03-28 2021-11-23 百度在线网络技术(北京)有限公司 用于获取信息的方法及装置
CN109389085B (zh) * 2018-10-09 2020-09-25 清华大学 基于参数化曲线的唇语识别模型训练方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637071A (zh) * 2011-02-09 2012-08-15 英华达(上海)电子有限公司 应用于多媒体输入设备的多媒体输入方法
US20160148616A1 (en) * 2014-11-26 2016-05-26 Panasonic Intellectual Property Corporation Of America Method and apparatus for recognizing speech by lip reading
CN105139503A (zh) * 2015-10-12 2015-12-09 北京航空航天大学 一种唇动口型识别门禁系统及识别方法
CN107992812A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种唇语识别方法及装置
CN108763897A (zh) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 身份合法性的校验方法、终端设备及介质
CN109409195A (zh) * 2018-08-30 2019-03-01 华侨大学 一种基于神经网络的唇语识别方法及系统
CN110415701A (zh) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 唇语的识别方法及其装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011245A (zh) * 2021-01-28 2021-06-22 南京大学 基于超声波感知与知识蒸馏的唇语识别系统及方法
CN113011245B (zh) * 2021-01-28 2023-12-12 南京大学 基于超声波感知与知识蒸馏的唇语识别系统及方法
CN113179449A (zh) * 2021-04-22 2021-07-27 清华珠三角研究院 语音及动作驱动图像的方法、系统、装置及存储介质
CN113179449B (zh) * 2021-04-22 2022-04-12 清华珠三角研究院 语音及动作驱动图像的方法、系统、装置及存储介质
CN113096223A (zh) * 2021-04-25 2021-07-09 北京大米科技有限公司 图像生成方法、存储介质和电子设备
WO2024001539A1 (zh) * 2022-06-30 2024-01-04 上海商汤智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品
CN116189271A (zh) * 2023-04-20 2023-05-30 深圳曼瑞德科技有限公司 基于智能手表识别唇语的数据处理方法及系统
CN116189271B (zh) * 2023-04-20 2023-07-14 深圳曼瑞德科技有限公司 基于智能手表识别唇语的数据处理方法及系统

Also Published As

Publication number Publication date
CN110415701A (zh) 2019-11-05

Similar Documents

Publication Publication Date Title
WO2020253051A1 (zh) 唇语的识别方法及其装置
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
US11158324B2 (en) Speaker separation model training method, two-speaker separation method and computing device
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
US9672829B2 (en) Extracting and displaying key points of a video conference
WO2019223457A1 (zh) 混合语音识别方法、装置及计算机可读存储介质
US20200211550A1 (en) Whispering voice recovery method, apparatus and device, and readable storage medium
US20150325240A1 (en) Method and system for speech input
CN108804453B (zh) 一种视音频识别方法及装置
CN111524527B (zh) 话者分离方法、装置、电子设备和存储介质
CN109360572B (zh) 通话分离方法、装置、计算机设备及存储介质
CN110767218A (zh) 端到端语音识别方法、系统、装置及其存储介质
CN114422825A (zh) 音视频同步方法、装置、介质、设备及程序产品
US11495210B2 (en) Acoustic based speech analysis using deep learning models
CN112786052A (zh) 语音识别方法、电子设备和存储装置
Jachimski et al. A comparative study of English viseme recognition methods and algorithms
CN112507311A (zh) 一种基于多模态特征融合的高安全性身份验证方法
CN112017633B (zh) 语音识别方法、装置、存储介质及电子设备
CN111326139A (zh) 一种语种识别方法、装置、设备及存储介质
WO2021196390A1 (zh) 声纹数据生成方法、装置、计算机装置及存储介质
Xia et al. Audiovisual speech recognition: A review and forecast
CN111554279A (zh) 一种基于Kinect的多模态人机交互系统
Atila et al. Turkish lip-reading using Bi-LSTM and deep learning models
CN114581812B (zh) 视觉语言识别方法、装置、电子设备及存储介质
CN115132201A (zh) 唇语识别方法、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933962

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933962

Country of ref document: EP

Kind code of ref document: A1