WO2023208134A1

WO2023208134A1 - Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product

Info

Publication number: WO2023208134A1
Application number: PCT/CN2023/091298
Authority: WO
Inventors: 康硕; 李潇婕; 王飞; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-04-29
Filing date: 2023-04-27
Publication date: 2023-11-02
Also published as: CN114821794A

Abstract

Disclosed in embodiments of the present disclosure are an image processing method and apparatus, a model generation method and apparatus, a vehicle, a storage medium, and a computer program product. The image processing method comprises: obtaining an image frame sequence comprising a mouth object; extracting mouth key point features from each image frame in the image frame sequence to obtain the mouth key point features of each image frame; generating syllable classification features according to the mouth key point features of the plurality of image frames in the image frame sequence, wherein the syllable classification features each represent a syllable category corresponding to a mouth shape of the mouth object in the image frame sequence; and determining, in a preset keyword library, a keyword matched with the syllable classification features.

Description

Image processing method and model generation method, device, vehicle, storage medium and computer program product

Cross-references to related applications

This disclosed embodiment is based on a Chinese patent application with application number 202210476318.1, application date is April 29, 2022, and the application name is "Image processing method and model generation method, device, vehicle, storage medium", and claims the Chinese patent Priority of the application, the entire content of this Chinese patent application is hereby incorporated by reference into this disclosure.

Technical field

The present disclosure relates to but is not limited to the field of information technology, and in particular, to an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.

Background technique

Lip recognition technology can use computer vision technology to identify faces from video images, extract the changing features of the mouth area of the face, and thereby identify the text content corresponding to the video.

Contents of the invention

In view of this, embodiments of the present disclosure provide at least an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.

The technical solution of the embodiment of the present disclosure is implemented as follows:

An embodiment of the present disclosure provides an image processing method. The method includes: acquiring an image frame sequence containing a mouth object; extracting mouth key point features for each image frame in the image frame sequence to obtain each of the image frame sequences. Mouth key point features of an image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence The syllable category corresponding to the mouth shape of the object is determined; the keyword matching the syllable classification feature is determined in the preset keyword database.

Embodiments of the present disclosure also provide a method for generating a lip recognition model. The method includes: obtaining a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag; Each sample image frame in the image frame sequence performs mouth key point feature extraction to obtain the mouth key point features of each sample image frame; using the model to be trained, according to multiple samples in the sample image frame sequence The mouth key point features of the image frame generate syllable classification features, and determine keywords matching the syllable classification features in the preset keyword library; wherein the syllable classification features represent the sample image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; based on the determined keywords and the keyword tags, update the network parameters of the model at least once to obtain a trained lip recognition model.

An embodiment of the present disclosure also provides an image processing device, which includes:

a first acquisition part configured to acquire a sequence of image frames containing the mouth object;

The first recognition part is configured to extract mouth key point features for each image frame in the image frame sequence, and obtain the mouth key point features of each image frame;

The first determining part is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the mouth object in the image frame sequence The syllable category corresponding to the mouth shape;

The first matching part is configured to determine keywords matching the syllable classification features in the preset keyword library.

An embodiment of the present disclosure also provides a device for generating a lip recognition model. The device includes:

The second acquisition part is configured to acquire a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;

The second recognition part is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;

The second matching part is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine them in the preset keyword library Keywords matching the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;

The update part is configured to perform at least on the network parameters of the model based on the determined keywords and the keyword tags. Once updated, a trained lip recognition model is obtained.

An embodiment of the present disclosure also provides a computer device, including a memory and a processor. The memory stores a computer program that can be run on the processor. When the processor executes the program, some or all of the steps in the above method are implemented. .

An embodiment of the present disclosure also provides a vehicle, including:

a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object;

A vehicle machine, connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.

Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented.

Embodiments of the present disclosure also provide a computer program, including computer readable code. When the computer readable code is run in a computer device, the processor in the computer device executes for implementing part or all of the above method. step.

Embodiments of the present disclosure also provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part of the above method is implemented. or all steps.

In the embodiment of the present disclosure, first, an image frame sequence whose image content includes a mouth object is obtained. In this way, an image frame sequence that records the change process of the mouth object when the set object speaks can be obtained; secondly, each image frame sequence in the image frame sequence is obtained. Extract key point features of the mouth in one image frame to obtain the key point features of the mouth in each image frame of multiple image frames in the image frame sequence. Compared with the mouth region image sequence obtained by cropping the face image for lip language Recognition, using key point features of the mouth for lip language recognition can reduce the amount of calculation required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method; and, when using key point features of the mouth for lip language recognition , involving the extraction of mouth key point features, therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, thereby improving the generalization ability of lip language recognition; again, according to the image frame sequence The mouth key point features in multiple image frames generate syllable classification features. The syllable classification features represent the syllable categories corresponding to the mouth shapes of the mouth objects in the image frame sequence. In this way, the syllable classification features are extracted from the mouth key point features. The syllable classification features Classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence. Using syllable classification features to assist lip recognition can improve the accuracy of lip recognition; finally, based on the syllable classification features, the preset The matching keywords are determined by matching in the keyword database. In this way, by representing the syllable classification features corresponding to the image frame sequence, the keywords corresponding to the syllables are determined according to the syllable classification features represented by the syllable classification features, thereby improving the key words obtained by image processing. word accuracy.

In the above scheme, the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence. Based on the syllable classification features, the preset key points are Keywords are obtained by matching in the lexicon. In this way, the amount of calculation required for the image processing process of lip recognition can be reduced, thereby reducing the hardware requirements for computer equipment; at the same time, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information. This improves the generalization ability of lip language recognition; in addition, by expressing the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the keywords obtained by image processing can be More accurate, thereby improving the accuracy of lip recognition.

It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit the technical solution of the present disclosure.

Description of the drawings

The accompanying drawings herein are incorporated into and constitute a part of this specification. They illustrate embodiments consistent with the disclosure and, together with the description, serve to explain the technical solutions of the disclosure.

Figure 1 is a schematic flow chart of an implementation of an image processing method provided by an embodiment of the present disclosure;

Figure 2 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;

Figure 3 is a schematic diagram of facial key points provided by an embodiment of the present disclosure;

Figure 4 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;

Figure 5 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;

Figure 6 is a schematic flowchart of the implementation of a method for generating a lip language recognition model provided by an embodiment of the present disclosure;

Figure 7 is a schematic structural diagram of a lip language recognition model provided by an embodiment of the present disclosure;

Figure 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure;

Figure 9 is a schematic structural diagram of a device for generating a lip language recognition model provided by an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure are further elaborated below in conjunction with the accompanying drawings and examples. The described embodiments should not be regarded as limiting the present disclosure. Those of ordinary skill in the art will All other embodiments obtained without creative efforts belong to the scope of protection of this disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

The terms "first/second/third" involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first/second/third" can be used interchangeably if permitted. The specific order or sequence may be changed so that the embodiments of the disclosure described herein can be implemented in an order other than that illustrated or described herein.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the disclosure only and is not intended to be limiting of the disclosure.

In scenes where the environmental noise is too loud or it is inconvenient to speak, lip recognition can make up for the limitations of speech recognition, thereby enhancing the robustness of human-computer interaction.

In the image processing process of lip language recognition in related technologies, first, find the position corresponding to the face in the image through face detection, then cut out the mouth area in the image to obtain an image sequence of the mouth area image, and finally , input the image sequence into a three-dimensional convolutional neural network (3D convolutional neural network) for feature extraction, and input the features into the time series prediction network for classification. However, since the image sequence of the mouth area image is not sensitive to mouth movement information, the accuracy of lip recognition is not high, and three-dimensional convolution consumes a lot of computing resources and has high hardware requirements, resulting in the above-mentioned lip recognition. The method is difficult to apply on a large scale.

Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device. Among them, computer equipment can refer to cars, servers, laptops, tablets, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable gaming devices ) and other equipment with data processing capabilities.

Below, the technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present disclosure.

Figure 1 is a schematic flow chart of an image processing method provided by an embodiment of the present disclosure. As shown in Figure 1, the method includes the following steps S101 to S104:

Step S101: Obtain an image frame sequence containing a mouth object.

Here, the computer device acquires multiple image frames. The multiple image frames can be captured by a collection component such as a camera to capture the set object during the speaking process. The multiple image frames are sorted according to the time parameters corresponding to each image frame, and we obtain Raw image frame sequence. Wherein, the multiple image frames in the image frame sequence at least include the mouth object of the same setting object. The subjects are usually humans, but can also be other expressive animals, such as orangutans. In some embodiments, the image frame sequence at least covers the entire process of the setting object saying a sentence. For example, multiple image frames in the image frame sequence at least cover the entire process of the setting object saying "turn on the music." Moreover, the number of image frames included in the image frame sequence may not be fixed. For example, the number of frames in the image frame sequence may be 40 frames, 50 frames, or 100 frames. Here, the original image frame sequence can be directly used as the image frame sequence used for subsequent image processing; the original image sequence can also be further processed to obtain the image frame sequence used for subsequent image processing. For example, the original image sequence can be interpolated to obtain the settings. Frame number of image frame sequence. Therefore, the image frames in the image frame sequence in various embodiments of the present disclosure may be actually collected using the acquisition component, or may be generated based on the actual collected image frames.

In some embodiments, the computer device can obtain multiple image frames by calling a camera, or it can obtain them from other computer devices; for example, the computer device is a vehicle, and the vehicle can obtain it through a vehicle-mounted camera. To obtain images, you can also use wireless transmission with the mobile terminal to obtain images collected by the mobile terminal. It should be noted that at least one image frame in the image frame sequence can be derived from a video, where one video can include multiple video frames, each video frame corresponds to an image frame, and the image frames in the image frame sequence can be continuous. Video frames can also be discontinuous video frames selected from multiple video frames at fixed or non-fixed time intervals. During implementation, multiple image frames collected in advance can be obtained, or multiple image frames can be obtained by collecting images of the set object in real time, which is not limited here.

In this way, an image frame sequence can be obtained that records the change process of the mouth object when the set object speaks.

Step S102: Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.

Here, when performing mouth key feature extraction on at least one image frame in the image frame sequence, the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the image frame, and based on at least one image frame The position information of the mouth key points is determined to determine a mouth key point feature corresponding to each image frame, thereby obtaining at least one mouth key point feature of the image frame sequence. Among them, the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the image frame, that is, the same mouth key point in different The position information in the image frame is related to the mouth shape of the mouth object in this image frame.

In some embodiments, the method of determining the characteristics of the mouth key points corresponding to the image frame is based on the position information of the mouth key points in the image frame. The formula can be to sort the position information of each mouth key point in an image frame according to the key point number corresponding to each mouth key point, to obtain a position sequence, and then use the position sequence as the mouth key point feature. For example, each image frame includes 4 mouth key points, and the coordinates of the mouth key points are (x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ), (x ₄ , y ₄ ), the determined key point features of the mouth corresponding to the image frame are [(x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ), (x ₄ , y ₄ ) ]. Among them, the key point serial number corresponding to the mouth key point is the number corresponding to the mouth key point among the numbers preset for the facial key points. For example, in the schematic diagram of the facial key points shown in Figure 3, 106 is preset. key points, and each key point is numbered. The key points are numbered from 0 to 105. The key points No. 84-103 are the mouth key points used to describe the mouth.

In some embodiments, when the image frame sequence includes two image frames, or more than two image frames, the characteristics of the mouth key points corresponding to the image frames are determined based on the position information of the mouth key points in the image frames. The method can be to calculate the difference information of the position information of the mouth key points in each image frame and the adjacent frame of the image frame, and calculate each mouth key in an image frame according to the corresponding key point serial number. The difference information of the points is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frame can be the previous image frame and/or the subsequent image frame of the image frame in the image frame sequence. , that is to say, the difference information of the position information includes at least one of the following: difference information between this image frame and the previous image frame; difference information between this image frame and the next image frame. For example, when determining the mouth key point characteristics corresponding to the image frame based on the difference information between the image frame and the previous image frame, each image frame includes 4 mouth key points, and the mouth key point is in the first The coordinates in the image frame are (x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ), (x ₄ , y ₄ ), and the mouth key point is in the second image frame The coordinates are (x' ₁ , y' ₁ ), (x' ₂ , y' ₂ ), (x' ₃ , y' ₃ ), (x' ₄ , y' ₄ ) respectively. In this way, the determined The mouth key point features corresponding to the two image frames are [(x' ₁ -x ₁ , y' ₁ -y ₁ ), (x' ₂ -x ₂ , y' ₂ -y ₂ ), (x' ₃ - x ₃ , y' ₃ -y ₃ ), (x' ₄ -x ₄ , y' ₄ -y ₄ )].

In this way, compared with using mouth region image sequences for lip recognition, using key point features of the mouth for lip recognition can reduce the amount of calculations required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method. requirements, thereby making the image processing method universally applicable to various computer devices. Moreover, lip recognition using key point features of the mouth involves the extraction of key point features of the mouth. Therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, improving lip recognition. generalization ability and accuracy.

Step S103: Generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence.

Here, feature extraction can be performed on the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features, where the syllable classification features represent at least one preset syllable category corresponding to the image frame sequence, and each preset The syllable category represents at least one syllable with the same or similar mouth shape, that is, the syllable classification feature may represent a syllable category corresponding to the mouth shape of the mouth object in the image frame sequence. Each element in the syllable classification feature can be used to indicate whether there is a syllable type in the image frame sequence, thereby determining at least one syllable corresponding to the mouth shape contained in the image in the image frame sequence. Here, the syllable types can be divided into a set number of preset syllable categories in advance according to the similarity of the mouth shapes. Each preset syllable category includes at least one syllable type with the same or similar mouth shape. The set number can be based on the language. The type is set; among them, the degree of mouth shape similarity can be determined manually based on experience or through machine learning. Taking Chinese as an example, without considering tones, Chinese characters have a total of 419 syllable types. These 419 types of syllables can be divided into 100 categories according to the corresponding mouth shapes, and the length of the corresponding syllable classification feature is 100; for other Languages, such as English, can combine phonetic symbols to divide syllable types into a set number of preset syllable categories, and set the length of syllable classification features based on the correspondence between syllables and mouth shapes.

In some embodiments, spatio-temporal features corresponding to each mouth key point feature can be obtained by performing spatio-temporal feature extraction on at least two mouth key point features of the image frame sequence, and syllable classification features can be determined based on the spatio-temporal features. Here, the temporal prediction network and/or the fully convolutional network can be used to extract spatiotemporal features to obtain the spatiotemporal features corresponding to each mouth key point feature. In some implementations, a flatten layer or other methods can be used to splice at least two spatio-temporal features, and then the spliced spatio-temporal features can be classified to obtain syllable classification features.

In this way, syllable classification features are extracted from the mouth key point features. The syllable classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence. Then, the syllable classification features are used to assist lip language recognition, which can Improve the accuracy of lip recognition.

Step S104: Determine keywords matching the syllable classification features in the preset keyword database.

In some embodiments, a certain number of keywords are preset in the keyword library, and each keyword can be matched with a specific syllable classification feature, so that lip language recognition can be obtained based on the matching results of the keywords and the syllable classification features. Image processing results. Among them, after the keyword is determined, the keyword can be output directly, or the sequence number of the keyword in the keyword database can be output.

In some implementations, the preset keywords in the preset keyword library can be set according to specific application scenarios. For example, in a driving scenario, the preset keywords can be set to "turn on the audio", "turn on the left side". car windows" etc. It should be noted that the preset keyword library represents the storage form of keywords.

In some implementations, the matching keywords can be determined by combining the detection results obtained by speech detection and the recognition results obtained by lip recognition; for example, the weights of the detection results of speech detection and the recognition results of lip recognition can be set separately, and the weighted The calculation results are used as the basis for matching. Among them, speaking detection may include but is not limited to whether the mouth object is in a speaking state, the speaking interval in the speaking state, etc. The process of conducting at least one test.

In this way, by determining the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the accuracy of the keywords obtained by image processing is improved.

In the embodiment of the present disclosure, the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence. According to the syllable classification features, Keywords are obtained by matching in the preset keyword library. In this way, lip recognition results are obtained by extracting features from two-dimensional image frames, which can reduce the amount of calculation required for image processing of lip recognition and reduce the hardware requirements for computer equipment; at the same time, the appearance of different face shapes, textures, etc. Information facial images can achieve good recognition results, thereby improving the generalization ability of lip language recognition; in addition, by representing the syllable classification features corresponding to the image frame sequence, the syllable-corresponding word is determined based on the syllable category represented by the syllable classification features. Keywords of words can make the keywords obtained by image processing more accurate, thereby improving the accuracy of lip recognition.

In some implementations, the speaking interval of the set object in the video is detected through lip movement recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S101 can be implemented through the following steps S1011 and S1012:

Step S1011: Obtain a video in which the image frame includes the mouth object.

Here, the computer device captures the set object through a collection component such as a camera, and obtains a video in which the image frame includes the mouth object.

Step S1012: Perform lip movement recognition on the mouth object, and determine multiple video frames in which the mouth object is in a speaking state as an image frame sequence.

Here, first, lip motion recognition technology is used to crop the video to obtain a video recording the speaking process of the set object. The image of the video contains the mouth object in a speaking state; then, multiple video frames are selected from the cropped video. Image as a sequence of image frames.

In the above scheme, the image frame sequence can at least cover the complete process of the set object speaking, and the video is cropped through lip movement recognition technology, which can reduce the image frames in the image frame sequence that are not related to the speaking process. For the images obtained through this scheme Perform image processing on the frame sequence and obtain keywords matching the image sequence, which can further improve the accuracy of lip recognition and reduce the amount of calculation required for the image processing process of lip recognition.

As mentioned above, the number of image frames included in the image frame sequence used for image processing may not be fixed. In some implementations, frame interpolation processing can be performed on the original image sequence collected to obtain an image frame sequence including a preset number of image frames.

In some implementations, performing frame interpolation processing on the acquired original image sequence may include the following step S1013 or step S1014:

Step S1013: Perform image frame interpolation on the acquired original image sequence including the mouth object to obtain the image frame sequence.

The method of performing frame interpolation processing on the acquired original image sequence to obtain an image frame sequence including a preset number of image frames may be to perform image interpolation processing based on the image frames in the original image sequence to generate a preset number of image frames. , based on the generated image frames and/or collected image frames, an image frame sequence for subsequent mouth key point feature extraction is obtained.

Step S1014: Based on the obtained mouth key points in the original image sequence containing the mouth object, interpolate frames on the original image sequence to obtain the image frame sequence.

The method of performing frame interpolation processing on the collected original image sequence to obtain an image frame sequence including a preset number of image frames may be to generate newly inserted image frames based on the position information of the mouth key points in the original image sequence, Among them, the position information of the mouth key points in the newly inserted image frame is predicted based on the position information of the mouth key points in the original image sequence, thereby realizing the interpolation of the original image sequence and obtaining the preset corresponding to the image frame sequence. Amount of key point information to achieve subsequent mouth key point feature extraction.

Among them, the number of image frames can be preset based on experience. The larger the preset number of frames, the higher the accuracy of recognition, but the greater the computing resources consumed, which affects the hardware computing efficiency; in some implementations, the accuracy is comprehensively considered , hardware computing efficiency and the number of keywords, etc. In practical applications, the default frame number can be set to 60.

In this way, the image frame sequence after frame interpolation is used for lip recognition, and there is no requirement on the number of frames of the original image sequence collected, which can improve the robustness of the image recognition method for lip recognition.

In some implementations, the position information of the mouth key points in each image frame and adjacent frames is used to determine the mouth key point characteristics of the image frame. That is, the above step S102 can be implemented by the steps shown in Figure 2 .

Figure 2 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 2:

Step S201: Determine the position information of at least two mouth key points of the mouth object in each image frame.

The image frame sequence includes at least two image frames, and position information of mouth key points associated with the mouth object in each image frame is extracted. Among them, the number of mouth key points is at least two, and they are distributed at least on the upper and lower lips in the image. The number and distribution location of mouth key points are usually related to the key point identification algorithm. For example, in the 68-point key point detection algorithm, the number of mouth key points is 16. The position information of each mouth key point can be represented by a position parameter, for example, it can be represented by two-dimensional coordinates in the image coordinate system, which include width coordinates (abscissa) and height coordinates (ordinate). Here, the position information of the mouth key points is also related to the mouth shape of the mouth object in the image. The position information of the same mouth key point in different images changes as the mouth shape changes. Take the 106 facial key point diagram shown in Figure 3 as an example. The diagram includes a total of 106 key points numbered 0-105, which can describe the facial contour, eyebrows, eyes, nose, Mouth and other features, among which key points 84-103 are mouth key points used to describe the mouth. Here, the positions of key point No. 93 in the two frames of images corresponding to different speech contents are different. For example, when the ordinate of key point No. 93 is smaller in the image, it means that the mouth is opened to a greater degree. At this time, in Among the syllables "ah" and "oh", the possibility of corresponding to "ah" is higher.

Step S202: For each image frame in the image frame sequence, determine the mouth key corresponding to the image frame based on the position information of the mouth key point in the image frame and adjacent frames of the image frame. point features.

For each first image frame in the sequence of image frames, the position information of the mouth key points in at least two image frames including the first image frame may be used to calculate the mouth key points in the first image frame. Features, mouth key point features may include inter-frame difference information and/or intra-frame difference information. The first image frame may be any image frame in the image frame sequence. The inter-frame difference information can represent the difference information of the position information of the same mouth key point in different image frames, and the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same image frame. Here, the position information of each mouth key point in the first image frame and the position information of the mouth key point in adjacent frames of the first image frame are used to calculate the position information of the mouth key point in different image frames. Inter-frame difference information; and/or, using the position information of at least two mouth key points including the mouth key point in the first image frame, calculate the frame of the mouth key point in the first image frame internal difference information.

Compared with using mouth region image sequences for lip recognition, embodiments of the present disclosure use the position information of multiple mouth key points in multiple image frames to obtain mouth key point features, so that the mouth key point features can represent images. The frame sequence corresponds to the changing process of the mouth key points during the speaking process, so as to better extract the mouth shape change characteristics of the set object during the speaking process; in this way, using the mouth key point features for lip language recognition can improve lip language Recognition accuracy.

In some implementations, the difference in position information of the mouth key points in adjacent frames and the difference in position information of the preset mouth key point pairs in the same image frame are used to determine the characteristics of the mouth key points, that is, the above step S202 This can be achieved through the following steps S2021 and S2022:

Step S2021, for each mouth key point, according to the position information of the mouth key point in the image frame, and the position of the mouth key point in adjacent image frames of the image frame Information, determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point.

In some embodiments, when calculating the mouth key point features corresponding to each first image frame, for each mouth key point, according to the position information of the mouth key point in the first image frame, and the mouth The position information of the mouth key point in each second image frame of at least one second image frame is calculated, and the difference information of the position information of the mouth key point in the first image frame and each second image frame is calculated. Wherein, the second image frame is an image frame adjacent to the first image frame, that is, an adjacent frame of the first image frame; the difference information may be the first height difference, the first width difference, or the first The combination of the height difference and the first width difference; the first width difference is the width difference of the mouth key point in the two image frames (the first image frame and the second image frame) (that is, the mouth key point is in the two image frames). The first height difference is the height difference of the mouth key point in the two image frames (that is, the difference in the ordinate of the mouth key point in the two image frames). ). In some implementations, when calculating the difference, the difference can be set to the position information of the subsequent image frame minus the position information of the previous image frame, or can be set to the position information of the previous image frame minus the subsequent image. Frame position information. In this way, for each mouth key point, using each second image frame of the first image frame and at least one second image frame, the same number of difference information as the second image frame can be obtained, and these difference information are determined is the inter-frame difference information of the mouth key point in the first image frame.

For example, the coordinates of a mouth key point in three consecutive image frames are (x ₁ , y ₁ ), (x' ₁ , y' ₁ ), (x" ₁ , y" ₁ ), and the second The first image frame is the first image frame, and the first image frame and the third image frame before and after are the second image frame. Calculate the first height difference and the first width difference to obtain the key point of the mouth in the first image frame. The inter-frame difference information in is (x' ₁ -x ₁ , y' ₁ -y ₁ ,x" ₁ -x' ₁ , y" ₁ -y' ₁ ).

Step S2022: For each mouth key point, determine the second height difference and/or the second width between the mouth key point in the image frame and other mouth key points of the same mouth object. Difference, determine the intra-frame difference information of the mouth key point.

In some embodiments, when determining the mouth key point characteristics corresponding to each first image frame, for each mouth key point, calculate the relationship between the mouth key point and other mouth key points of the same mouth object. the second height difference and/or the second width difference, and determine the second height difference and/or the second width difference as each mouth key point in the corresponding preset mouth key point pair in the first image frame Intra-frame difference information in . Among them, other mouth key points can be fixed mouth key points, such as the mouth key points corresponding to the lip beads, such as key point No. 98 shown in Figure 3; they can also be set to satisfy the settings of each mouth key point. Positional relationship between the key points of the mouth. Here, the two mouth key points are used as a preset mouth key point pair. Moreover, when setting the preset mouth key point pair, the position information of the mouth key point in the image can be considered. That is to say, the relationship between the two mouth key points belonging to the same preset mouth key point pair satisfies the set Determine the positional relationship; for example, determine the two mouth key points located on the upper and lower lips of the mouth object as a mouth key point pair; you can also determine the two mouth key points whose width difference information in the image is less than the preset value The key points are determined as preset mouth key point pairs. In this way, the mouth shape of the mouth object in the first image frame can be better represented by using the second height difference of the preset mouth key point pair.

In some implementations, one mouth key point can form a preset mouth key point pair with two or more mouth key points respectively. That is to say, each mouth key point can belong to multiple mouth key points. right. At this time, the second height difference of each mouth key point pair to which the mouth key point belongs is determined respectively, and at least two second height differences are used to perform a weighted sum to determine the position of the mouth key point in the first image frame. middle intra-frame difference information. Taking the schematic diagram of 106 facial key points shown in Figure 3 as an example, key point No. 86 can form a preset mouth key point pair with key point No. 103 and key point No. 94 respectively. That is to say, key point No. 86 belongs to The two key points of the mouth are correct. When calculating the intra-frame difference information of key point No. 86, first, calculate the second height difference of each mouth key point pair where key point No. 86 is located; then, perform a weighted sum of the two second height differences, Determine the intra-frame difference information of key point No. 86 in the first image frame. In this way, placing a mouth key point in the center of at least two mouth key points and calculating the intra-frame difference information of the mouth key point can improve the deviation of the mouth key point feature calculation caused by the recognition error of one key point. , Lip recognition based on such key point features of the mouth can improve the accuracy of lip recognition.

In some implementations, through steps S2021 and S2022, inter-frame difference information and intra-frame difference information of a mouth key point in the first image frame are obtained respectively, and the inter-frame difference information and intra-frame difference information can be spliced. , obtain an element in the mouth key point feature corresponding to the mouth key point in the first image frame, thereby based on the inter-frame difference information of each mouth key point in the first image frame and the first image frame Intra-frame difference information determines the mouth key point feature elements corresponding to each mouth key point in the first image frame, and determines the mouth corresponding to the first image frame based on the mouth key point feature elements corresponding to all mouth key points. Key point features.

In the embodiment of the present disclosure, the inter-frame difference information of the position information of each mouth key point in adjacent image frames and the intra-frame difference information of the position information of the mouth key point and the preset mouth key point are used, The mouth key point features are obtained so that the mouth key point features can represent the differences between the mouth key points that satisfy the set relationship, improving the accuracy of mouth shape determination in each frame of image; and, the mouth key point features are also It can represent the changing process of mouth key points between frames during speaking corresponding to the image frame sequence; in this way, the changing characteristics of the mouth shape during speaking can be better extracted, thereby improving the accuracy of lip recognition.

In some implementations, spatio-temporal features are extracted from the mouth key points in the image frame sequence to obtain the spatio-temporal features corresponding to the mouth object in each image frame, and syllable features are classified based on the spatio-temporal features to obtain the syllables corresponding to the mouth object. Classification features, that is, the above-mentioned step S103, can be implemented through the steps shown in Figure 4.

Figure 4 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 4:

Step S401: Perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.

As mentioned before, at least one mouth key point feature of the image frame sequence can be obtained. Each mouth key point feature is calculated from the position information of the mouth key point. The position information of the mouth key point indicates that the mouth object is in an image. At the middle position of the frame, each mouth key point feature corresponds to an image frame. For each mouth key point feature, the spatial features of the mouth object in the corresponding image frame can be extracted from the mouth key point feature using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used for extraction to obtain spatial features.

In some implementations, the speaking interval of the set object in the video is detected through lip motion recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S401 can be implemented through the following steps S4011 and S4012:

Step S4011, fuse inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain inter-frame difference features and intra-frame difference features of the mouth object in each image frame. Differential characteristics.

As mentioned earlier, each mouth key point feature is calculated from the position information of the mouth key point. The position information of the mouth key point represents the position of the mouth object in an image frame. Each mouth key point feature corresponds to a image frame. The inter-frame difference information can represent the difference information of the position information of the same mouth key point in different frames, and the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same frame. In some embodiments, the inter-frame difference information of multiple mouth key points in each image frame is fused, and the intra-frame difference information of multiple mouth key points in each image frame is fused to obtain the above The inter-frame difference features and intra-frame difference features of the mouth object in each image frame; among them, the way to fuse the inter-frame difference information and/or the intra-frame difference information can be by using a convolutional neural network, a recurrent neural network, etc. In this way, a convolution kernel of a preset size is used to fuse the information of multiple mouth key points to achieve the fusion of inter-frame and/or frame difference information of multiple mouth key points.

For example, a mouth key point corresponds to an element in the mouth key point feature, and the mouth key point includes a 5-dimensional feature, where: the first 4 dimensions of the 5-dimensional feature are inter-frame difference information, respectively. The width difference between the key point in the first image frame and the previous image frame, the height difference between the first image frame and the previous image frame, the width difference between the first image frame and the next image frame, the difference in the first image frame and the previous image frame, The height difference between the image frame and the subsequent image frame; the fifth dimension is the intra-frame difference information, that is, the height difference between the mouth key point and other mouth key points of the same mouth object in the same image frame and/or Width difference. When fusing the inter-frame difference information and/or intra-frame difference information of multiple mouth key points of a specific image frame, each dimension of the 5-dimensional features is separately analyzed in at least two mouth key points (that is, the mouth (between the elements of the mouth key point feature) for feature extraction, and the first 4 dimensions of the obtained features are used as the inter-frame difference features of the mouth object in this specific image frame, and the fifth dimension is used as the mouth object in this specific image. Intra-frame difference features within frames.

Step S4012: Fusion of inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames to obtain spatial features of the mouth object in each image frame.

In some embodiments, the way to fuse the inter-frame difference features and intra-frame difference features of multiple image frames can be implemented by using a convolutional neural network, a recurrent neural network, etc., and using a convolution kernel of a preset size to fuse multiple image frames. Information on key points of the mouth is fused to achieve each The fusion between the inter-frame difference information and the intra-frame difference information of a mouth key point obtains the spatial characteristics of the mouth object in each image frame.

In the above steps S4011 to S4012, the inter-frame difference information and the intra-frame difference information of at least two mouth key points of the mouth object in each image frame are respectively fused to obtain the inter-frame difference information between the mouth key points. The inter-frame difference features of the difference information, and the intra-frame difference features representing the intra-frame difference information between the mouth key points, and then the inter-frame difference features and intra-frame difference features of the mouth key points in each image frame are characterized. Fusion can better extract the spatial features of the mouth object in each image frame and improve the accuracy of determining the mouth shape in each frame of image.

Step S402: Perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.

In some embodiments, for each third image frame in at least one image frame, spatial features of the mouth object in at least two image frames including the third image frame can be used to perform feature extraction to obtain the mouth object. The corresponding spatiotemporal features in the third image frame. The spatiotemporal features of the mouth object can be extracted from the spatial features using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used to extract temporal features to obtain spatiotemporal features.

In some implementations, temporal feature extraction of the spatial features of the mouth object in multiple image frames can be performed multiple times. Taking one temporal feature extraction as an example, a 1×5 convolution kernel is used for feature extraction. The secondary convolution extracts the spatial features of two image frames before and after the third image frame, and the extracted spatiotemporal features include information of five image frames.

Since the more times temporal features are extracted and the larger the convolution kernel is used, the spatiotemporal features corresponding to each image frame can represent more information of the image frames, allowing the information between frames to be communicated, so the corresponding receptive field becomes larger. It is conducive to learning words composed of multi-frame images and the timing between different words, which can improve the accuracy of lip recognition, but it requires more computing resources and affects the hardware computing efficiency; comprehensive consideration of accuracy and hardware computing efficiency , in actual applications, the number of image feature extraction times can be set to 5 times.

Step S403: Extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.

In some embodiments, the syllable classification features of the mouth object are obtained by extracting syllable classification features from the spatiotemporal features corresponding to each image frame in at least two image frames; wherein, the syllable classification features can represent the same as the mouth object. At least one syllable corresponding to the mouth shape that appears during the speaking process of the subject, and each element in the syllable classification feature is used to determine whether there is a preset syllable type during the speaking process, thereby determining whether the image frame in the image frame sequence contains At least one syllable corresponding to the mouth shape. The syllable classification features of mouth objects can be extracted from spatio-temporal features using any suitable feature extraction method. For example, fully connected layers, global average pooling layers, and other methods can be used to extract syllable classification features from spatiotemporal features to obtain syllable classification features.

Embodiments of the present disclosure support the use of convolutional neural networks for spatiotemporal feature extraction; compared with using time series prediction networks such as recurrent neural networks (recursive neural networks) to extract spatiotemporal features, the amount of calculation required to extract spatiotemporal features through convolutional neural networks is less. It can reduce the consumption of computing resources and reduce the hardware requirements for computer equipment used to implement lip recognition. In particular, since the use of convolutional neural networks can reduce the requirements for chip computing capabilities, the image processing method provided by the embodiments of the present disclosure can be implemented with more lightweight chips, allowing more hardware to support the lip reading of the embodiments of the present disclosure. The image processing method in the recognition process improves the versatility of lip recognition. For example, computer equipment such as cars and machines can also realize lip recognition.

Embodiments of the present disclosure also provide an image processing method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes the following steps S501 to S504:

Step S501: Obtain an image frame sequence containing a mouth object.

Here, step S501 corresponds to the aforementioned step S101, and during implementation, reference may be made to the specific implementation of the aforementioned step S101.

Step S502: Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.

Here, step S502 corresponds to the aforementioned step S102, and during implementation, reference may be made to the specific implementation of the aforementioned step S102.

Step S503: Use the trained syllable feature extraction network to process the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features.

During implementation, the syllable feature extraction network can be any suitable network for feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc.; those skilled in the art can select a syllable feature extraction network based on the actual situation. The appropriate network structure is not limited by the embodiments of this disclosure.

Step S504: Use the trained classification network to determine keywords matching the syllable classification features in the preset keyword library.

When implemented, the classification network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure for the classification network according to the actual situation, which is not limited by the embodiments of the present disclosure.

In the embodiment of the present disclosure, a trained syllable feature extraction network is used to process the key point features of the mouth to obtain syllable classification features; the trained classification network is used to determine the key matching the syllable classification features in the preset keyword library word. In this way, since each neural network in the deep learning model can be learned and optimized, the accuracy of the extracted syllable classification features and the keywords matching the syllable classification features can be improved, thereby making the keywords obtained by image processing more accurate. Accurate, improve the accuracy of lip recognition.

In some implementations, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification The feature extraction sub-network, that is, the above step S503, can be implemented through the following steps S5031 to S5033:

Step S5031: Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.

During implementation, the spatial feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc. Those skilled in the art can select an appropriate network structure based on the actual spatial feature extraction method for each mouth key point feature, which is not limited by the embodiments of the present disclosure.

Step S5032: Use the temporal feature extraction sub-network to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.

Here, the temporal feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc. Those skilled in the art can select an appropriate network structure based on the actual method of performing at least one temporal feature extraction on the spatial features of the mouth object in at least one image frame, which is not limited by the embodiments of the present disclosure.

Step S5033: Use the classification feature extraction sub-network to extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.

Here, the classification feature extraction sub-network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure based on the actual classification feature extraction method for each spatio-temporal feature of the mouth object, which is not limited by the embodiments of the present disclosure.

Embodiments of the present disclosure also provide a method of generating a lip recognition model, which method can be executed by a processor of a computer device. As shown in Figure 6, the method includes the following steps S601 to S604:

Step S601: Obtain a sample image frame sequence including a mouth object.

In some embodiments, the computer device obtains a sequence of sample image frames that have been labeled with keyword tags. The sequence of sample image frames includes multiple sample image frames. The sample images in the sequence of sample image frames are arranged according to the time corresponding to each sample image frame. Parameter sorting. Furthermore, the number of sample image frames included in the sample image frame sequence may not be fixed. For example, the number of sample image frames included in the sample image frame sequence may be 40 frames, 50 frames, or 100 frames.

In this way, a sample image frame sequence that at least covers the complete process of the set subject speaking a sentence can be obtained.

Step S602: Perform mouth key point feature extraction on each sample image frame in the sample image frame sequence to obtain mouth key point features of each sample image frame.

Here, when performing mouth key point extraction on at least one sample image frame in the sample image frame sequence, the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the sample image frame, and based on at least The position information of the mouth key point of a sample image frame is used to determine a mouth key point feature corresponding to each sample image frame, thereby obtaining at least one mouth key point feature of the sample image frame sequence. Among them, the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the sample image frame, that is, the same mouth key point in The position information of different sample image frames is related to the mouth shape of the mouth object in this sample image frame.

In some embodiments, the method of determining the characteristics of the mouth key points corresponding to the sample image frame based on the position information of the mouth key points of the sample image frame may be to calculate a sample according to the key point serial number corresponding to each mouth key point. The position information of each mouth key point in the image frame is sorted to obtain a position sequence, and the position sequence is used as the mouth key point feature.

In some embodiments, when the sample image frame sequence includes two sample image frames, or more than two sample image frames, the mouth corresponding to the sample image frame is determined based on the position information of the mouth key point in the sample image frame. The way to obtain the mouth key point features can be to calculate the difference information of the position information of the mouth key points in each sample image frame and the adjacent frame of the sample image frame, and calculate a sample image frame according to the corresponding key point serial number. The difference information of each mouth key point in is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frames can be the previous sample image frame and/or the subsequent sample image frame. A sample image frame.

Here, steps S601 to S602 respectively correspond to the aforementioned steps S101 to S102. During implementation, reference may be made to the specific implementation of the aforementioned steps S101 to S102.

Step S603: Using the model to be trained, generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine the syllable classification features in the preset keyword database Keywords for feature matching.

Wherein, the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence.

Here, the model to be trained can be any suitable deep learning model, and is not limited here. During implementation, those skilled in the art can use an appropriate network structure to construct the model to be trained according to the actual situation.

The model to be trained is used to process the mouth key point features of multiple sample image frames in the sample image frame sequence to generate syllable classification features. The syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence. The process of determining keywords matching the syllable classification features in the preset keyword library corresponds to the process of processing the key point features of the mouth in steps S103 to S104 in the previous embodiment. During implementation, you can refer to the above Specific implementation of step S103 to step S104.

In this way, syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition.

Step S604: Update the network parameters of the model at least once based on the determined keywords and the keyword tags, Get a trained lip recognition model.

Here, it can be determined whether to update the network parameters of the model based on the determined keywords and keyword tags. When it is determined to update the network parameters of the model, an appropriate parameter learning difficulty update algorithm is used to update the network parameters of the model, and the model with updated parameters is used to re-determine the matching keywords based on the re-determined keywords. and keyword tags to determine whether to continue updating the network parameters of the model. When it is determined not to continue updating the network parameters of the model, the finally updated model is determined to be the trained lip recognition model.

In some implementations, the loss value can be determined based on the determined keywords and keyword tags, and when the loss value does not meet the preset conditions, the network parameters of the model are updated. When the loss value meets the preset conditions Or when the number of updates to the network parameters of the model reaches a set threshold, the update of the network parameters of the model is stopped, and the final updated model is determined as the trained lip recognition model. The preset conditions may include, but are not limited to, at least one of the loss value being less than the set loss threshold, the change in the loss value converging, and the like. During implementation, the preset conditions may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.

The method of updating the network parameters of the model may be determined based on the actual situation, and may include but is not limited to at least one of the gradient descent method, Newton's momentum method, etc., which is not limited here.

In the embodiment of the present disclosure, during the model training process, syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition by the trained lip recognition model. Moreover, since the syllable classification features are determined based on the key point features of the mouth, the syllable classification features can better reflect the syllables corresponding to the mouth shapes in the image frame sequence, and the syllable classification features can be used to assist lip language recognition, so that image processing can be achieved The keywords are more precise and improve the accuracy of lip recognition. Moreover, compared to using the mouth region image sequence obtained by cropping the face image for lip recognition, using the key point features of the mouth for lip recognition can reduce the amount of calculation required in the image processing process, thereby reducing the need to perform image processing. The hardware requirements of the computer equipment of the method; and, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, so that based on the key point features of the mouth, the recognition of face shapes, textures not involved in the model training process can be improved The recognition ability of image categories is improved, thereby improving the generalization ability of lip language recognition.

In some embodiments, the model includes a syllable feature extraction network and a classification network, and the above step S603 may include the following steps S6031 to S6032:

Step S6031: Use the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence.

Step S6032: Use the classification network to determine keywords matching the syllable classification features in the preset keyword library.

Here, steps S6031 to S6032 respectively correspond to the aforementioned steps S503 to S504. During implementation, reference may be made to the specific implementation of the aforementioned steps S503 to S504.

In some embodiments, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network. The above step S6031 may include the following steps S60311 to S60313:

Step S60311: Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame.

Step S60312: Use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object.

Step S60313: Use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.

Here, steps S60311 to S60313 respectively correspond to the aforementioned steps S5031 to S5033. During implementation, reference may be made to the specific implementation of the aforementioned steps S5031 to S5033.

The following describes the application of the image processing method provided by the embodiments of the present disclosure in actual scenarios, taking image processing for lip recognition in Chinese as an example.

Figure 7 is a schematic structural diagram of a lip recognition model provided by an embodiment of the present disclosure. As shown in Figure 7, the lip recognition model structure includes: a single-frame feature extraction network 701, an inter-frame feature fusion network 702, and a feature sequence classification network 703. Among them, the single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012, and the feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.

Embodiments of the present disclosure provide an image processing method that generates an image frame sequence of the subject speaking based on the lip movement recognition detection results, uses the characteristics of the key points of the face as the input of the lip language recognition model, and uses monosyllables to assist in detecting syllables in the speaking sequence. , and use the syllable feature layer to classify speech sequences. The image processing method according to the embodiment of the present disclosure will be described below with reference to FIG. 7 .

Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device. Among them, computer equipment may refer to equipment with data processing capabilities such as vehicles and machines. The image processing method may include the following steps one to four:

Step 1, input preprocessing.

The input video sequence obtained by the computer device is a non-fixed frame, and the video sequence may include a non-fixed number of video frames. The key point sequence corresponds to 106 key points in each image frame. Take out the 20 key points of the mouth object, and then use the interpolation method (for example, bilinear interpolation method) to generate the 20 key points into a length of 60 The position sequence of the key points of the image frame. 20 mouth key points are used as feature dimensions. Each key point in the sequence corresponds to a feature of length 5 in each image frame, thereby obtaining mouth key point features 704 corresponding to 60 frames. Each mouth key point feature 704 corresponds to one image frame and 20 key points. Each key point in each image frame corresponds to a 5-dimensional feature.

In some implementations, the first four dimensions of the feature are obtained based on the coordinate difference between the current image frame and the previous and subsequent image frames, and the fifth dimension of the feature is obtained based on the height difference between the preset key point pairs in the current frame. Among them, the first 4 dimensions can reflect the mouth shape changes between the current image frame and the previous and subsequent image frames, and the fifth dimension reflects the mouth shape in the current image frame. Here, the collected videos can be processed through methods such as lip movement recognition, so that each video can at least cover the process of the set object (usually a person) speaking a sentence, and each sentence corresponds to a keyword. In this way, there is a one-to-one relationship between video and keywords. Moreover, no matter how many frames the video speech sequence is acquired, the interpolation method can be used to obtain a 60-frame position sequence.

Here, the more frames in the position sequence, the lower the computational efficiency, but the performance of lip recognition is improved. Considering the recognition performance, computing efficiency and the word count distribution of the keywords to be detected, the number of frames in the position sequence is set to 60 frames. Among them, the performance can be the accuracy of lip recognition.

Step 2: Single frame feature extraction.

The computer equipment implements single frame feature extraction through the single frame feature extraction network 701 in Figure 7. The single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012.

Input the key point features 704 of the mouth into the lip language recognition model, and use the spatial feature extraction network 7011 to independently extract features of the key point features 704 of the mouth in each image frame with a 1×1 convolution kernel, repeating 2 times. The above-mentioned convolution inputs the features extracted through two convolutions into the spatial feature fusion network 7012. In the spatial feature fusion network 7012, first, a 5×1 convolution kernel is used to fuse the 5-dimensional features of each key point to obtain the spatial features of each image frame. Then, each image frame is used to go through the spatial feature extraction network. 7011 extracts the features 705 and uses a 1×1 convolution kernel to fuse the features between the 20 key points to obtain the spatial features 706 of the image frame and complete the single frame feature extraction.

In some implementations, the convolution kernel may be a residual block kernel (Residual Block kernel).

Step 3: Inter-frame feature fusion.

The computer device implements inter-frame feature fusion of adjacent image frames through the inter-frame feature fusion network 702 in Figure 7.

Input the spatial features 706 of each image frame into the inter-frame feature fusion network 702, use a 1×5 convolution kernel to convolve in the sequence length dimension, and combine the spatial features 706 of each image frame with the two image frames before and after. Spatial features 706 are fused, and the above-mentioned convolution is repeated 5 times to increase the receptive field, so that information between frames can be communicated, and the correlation between adjacent frames can be strengthened, which is beneficial to learning the time sequence between keywords and Chinese characters composed of multiple frames.

This step will occupy a certain amount of computing resources. In order to improve the performance of lip recognition, the convolution kernel size can be increased and the number of repetitions will be increased, which will accordingly affect the computing efficiency. Considering the accuracy and hardware computing efficiency, in actual applications, the number of extractions can be set to 5 times, and the convolution kernel size can be set to 5.

Step 4: Feature sequence classification.

The computer device implements classification of the feature sequence through the feature sequence classification network 703 in Figure 7, and obtains the keyword sequence number corresponding to the video sequence. Among them, the feature sequence includes spatiotemporal features of multiple image frames. The feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.

The spatio-temporal features are input into the "flat layer + second linear layer + nonlinear activation (relu) layer" in the syllable feature layer 7031 for processing. The spatio-temporal features of all image frames are merged into one-dimensional vectors 707 to realize the spatio-temporal processing of multiple image frames. Feature fusion of features. The one-dimensional vector 707 is input into the third linear layer in the syllable feature layer 7031 for 100-class single syllable auxiliary classification to obtain syllable classification features. The syllable classification features are input into the first linear layer 7032 to output the keyword sequence number of the video sequence to be detected. Among them, the third linear layer can be a normalized exponential function (Softmax function) and is trained with a binary cross-entropy loss (BCEloss) function as the loss function. The first linear layer 7032 can be trained using the focal loss function as the loss function, and the softmax function can be used for prediction; in practical applications, the first linear layer 7032 can be a margin linear layer, consisting of a fully connected layer or a global average Pooling layer implementation. Compared with using the global average pooling layer, the direct expansion of the fully connected layer is equivalent to each frame corresponding to a learnable position embedding, so that the position sequence information of each frame in the sentence can be recorded.

In some implementations, a detection algorithm for lip recognition using syllable-assisted learning is used. Currently, regardless of pitch, there are a total of 419 categories of pronunciation for all Chinese characters. These 419 categories of syllables can be divided into 100 categories according to mouth shape. Syllables with the same mouth shape are classified into the same category. The characteristics of length 100 (corresponding to the aforementioned The syllable classification feature in the embodiment) is placed before the fully connected layer of the final classification, and the output of this feature is used as the auxiliary supervision of the 100 classification. At this time, the output of the syllable feature layer 7031 represents which syllables are shared in the lip sequence, and the syllables Classifying the output results of feature layer 7031 can effectively reduce the learning difficulty of fully connected layer classification, thereby improving performance. Among them, the syllable feature layer 7031 can be implemented using a linear layer.

In the embodiment of the present disclosure, the monosyllable auxiliary strategy significantly improves performance; and these keywords used for matching can be stored in the form of a preset keyword library. When adding new keywords used for matching, they can be added in the The preset keyword library has been added accordingly to facilitate keyword updates.

It should be noted that during implementation, the above-mentioned coordinate difference value may correspond to the difference information of the position information in the previous embodiment, the video sequence may correspond to the image frame sequence in the previous embodiment, and the single-frame feature extraction network 701 may correspond to Spatial characteristics in the aforementioned embodiments Extraction sub-network, the inter-frame feature fusion network 702 may correspond to the temporal feature extraction sub-network in the previous embodiment, the syllable feature layer 7031 may correspond to the syllable classification feature extraction sub-network in the previous embodiment, and the first linear layer 7032 may correspond to The classification network in the previous embodiment.

In the field of human-computer interaction, the application of speech recognition still has certain limitations, such as situations where noise or music volume is loud, and it is inconvenient to speak. In this case, lip recognition can make up for the inconvenience caused by the limitations of speech recognition to a certain extent. . Lip recognition can detect the keywords corresponding to what the speaker said in that interval based on the speech interval detected by lip movement recognition. For example, in the car cabin, voice recognition is the main means of human-computer interaction, but when the car makes a lot of noise on the highway, or when the music is played loudly, the voice recognition cannot accurately recognize the user's voice; or, When someone is sleeping in the car, it is inconvenient for the user to use voice to interact. At this time, through lip recognition, the user only needs to use mouth shape to simulate speaking, and the car machine can detect the user's instructions, thereby completing human-computer interaction.

Compared with lip recognition technology in the related art, the embodiments of the present disclosure utilize key point recognition, which takes up less computing resources and can learn the inter-frame motion information of lips, making it easier to deploy, more efficient and more accurate. The image processing method provided by the embodiment of the present disclosure supports the recognition of 35 types of commonly used keywords when used for lip language recognition, and the recognition recall rate reaches 81% while controlling the false alarm rate to less than one thousandth.

Based on the foregoing embodiments, embodiments of the present disclosure also provide an image processing device, which includes various units and modules included in each unit, which can be implemented by a processor in a computer device; of course, it can also It can be realized through specific logic circuits; during the implementation process, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), or a digital signal processor (Digital Signal Processor, DSP) Or Field Programmable Gate Array (FPGA), etc.

Figure 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure. As shown in Figure 8, the image processing device 800 includes: a first acquisition part 810, a first recognition part 820, a first determination part 830 and a A matching part 840, where:

The first acquisition part 810 is configured to acquire an image frame sequence containing a mouth object;

The first recognition part 820 is configured to extract mouth key point features for each image frame in the image frame sequence to obtain the mouth key point features of each image frame;

The first determining part 830 is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence The syllable category corresponding to the subject's mouth shape;

The first matching part 840 is configured to determine keywords matching the syllable classification features in the preset keyword library.

In some embodiments, in the case where the image frame sequence includes at least two frames of images, the first identification part 820 includes: a first determining sub-part configured to determine at least two of the mouth objects. position information of the key point of the mouth in each image frame; a second determination sub-part configured to, for each image frame in the sequence of image frames, determine the position information of the key point of the mouth according to the phase of the image frame and the image frame; The position information of the mouth key points in adjacent frames determines the characteristics of the mouth key points corresponding to the image frame.

In some embodiments, the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point; the second determination sub-part includes: a first determination unit, configured For each mouth key point, according to the position information of the mouth key point in the image frame and the position information of the mouth key point in adjacent image frames of the image frame, Determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point; the second determination unit is configured to, for each of the mouth key points, calculate a second height difference and/or a second width difference between the mouth key point in the image frame and other mouth key points of the same mouth object. , determine the intra-frame difference information of the mouth key point.

In some embodiments, the first determination part 830 includes: a first extraction sub-part configured to perform spatial feature extraction on mouth key point features of each image frame to obtain the mouth object. Spatial features in each image frame; the second extraction sub-part is configured to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object; The third extraction sub-part is configured to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.

In some embodiments, the first extraction sub-part includes: a first extraction unit configured to perform inter-frame difference information and intra-frame difference information on a plurality of mouth key points of the mouth object. Fusion to obtain the inter-frame difference features and intra-frame difference features of the mouth object in each image frame; the second extraction unit is configured to obtain the inter-frame difference features of the mouth object in multiple image frames Fusion with intra-frame difference features to obtain the spatial features of the mouth object in each image frame.

In some embodiments, the first determining part 830 includes: a third determining sub-part configured to use a trained syllable feature extraction network to determine mouth key point features of multiple image frames in the image frame sequence. Perform processing to obtain syllable classification features; the first matching part 840 includes: a first matching sub-part configured to use the trained classification network to determine in the preset keyword library that matches the syllable classification features. Key words.

In some embodiments, the first acquisition part 810, including the frame interpolation sub-part, is configured to: perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or, Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.

In some embodiments, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification feature extraction sub-network; the third determination sub-part includes: a third extraction unit configured to utilize The spatial feature extraction subnetwork, Perform spatial feature extraction on the mouth key point features of each image frame respectively to obtain the spatial features of the mouth object in each image frame; the fourth extraction unit is configured to utilize the temporal feature extraction sub-network , perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatiotemporal features of the mouth object; the fifth extraction unit is configured to utilize the classification feature extraction sub-network, based on The spatiotemporal features of the mouth object are subjected to syllable classification feature extraction to obtain the syllable classification features of the mouth object.

The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. In some embodiments, the functions or parts of the device provided by the embodiments of the present disclosure can be used to perform the methods described in the above method embodiments. For technical details not disclosed in the embodiments of the device of the present disclosure, please refer to the methods of the present disclosure. be understood from the description of the embodiments.

Based on the foregoing embodiments, embodiments of the present disclosure provide a device for generating a lip recognition model. The device includes each unit included and each part included in each unit, which can be implemented by a processor in a computer device; Of course, it can also be implemented through specific logic circuits; during the implementation process, the processor can be CPU, MPU, DSP or FPGA, etc.

Figure 9 is a schematic structural diagram of a device for generating a lip recognition model provided by an embodiment of the present disclosure. As shown in Figure 9, the device 900 includes: a second acquisition part 910, a second recognition part 920, and a second matching part. 930 and updated section 940, which:

The second acquisition part 910 is configured to acquire a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag;

The second identification part 920 is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;

The second matching part 930 is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and generate syllable classification features in the preset keyword library Determine keywords that match the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;

The update part 940 is configured to update the network parameters of the model at least once based on the determined keywords and the keyword tags to obtain a trained lip recognition model.

In some embodiments, the model includes a syllable feature extraction network and a classification network; the second matching part 930 includes: a fourth determining sub-part configured to use the feature extraction network to determine the sample image according to the sample image. The mouth key point features of multiple sample image frames in the frame sequence generate syllable classification features; the fifth determination sub-part is configured to use the classification network to determine the syllable classification in the preset keyword library Keywords for feature matching.

In some embodiments, the feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network; the fourth determination sub-part includes: a sixth extraction unit configured to utilize The spatial feature extraction sub-network performs spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame; the seventh extraction unit is Configured to use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object; the eighth extraction unit is It is configured to use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.

An embodiment of the present disclosure provides a vehicle, including:

The above description of the vehicle embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the vehicle embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that in the embodiments of the present disclosure, if the above method is implemented in the form of a software functional part and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present disclosure can be embodied in the form of software products that are essentially or contribute to related technologies. The software product is stored in a storage medium and includes a number of instructions to enable a A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk and other media that can store program code. In this way, the embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of hardware, software, and firmware.

An embodiment of the present disclosure provides a computer device, including a memory and a processor. The memory stores a computer program that can be run on the processor. When the processor executes the program, some or all of the steps in the above method are implemented.

Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented. The computer-readable storage medium may be transient or non-transitory.

Embodiments of the present disclosure provide a computer program, which includes computer readable code. When the computer readable code is run in a computer device, the processor in the computer device executes a part for implementing the above method or All steps.

Embodiments of the present disclosure provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, some of the above methods are implemented or All steps. The computer program product can be implemented specifically through hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium. In other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and so on.

It should be noted here that the above description of various embodiments tends to emphasize the differences between the various embodiments, and the similarities or similarities may be referred to each other. The description of the above embodiments of equipment, storage media, computer programs and computer program products is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the disclosed equipment, storage media, computer programs and computer program products, please refer to the description of the disclosed method embodiments for understanding.

It should be noted that Figure 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure. As shown in Figure 10, the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002 and a memory 1003, where:

Processor 1001 generally controls the overall operation of computer device 1000 .

The communication interface 1002 can enable the computer device to communicate with other terminals or servers through a network.

The memory 1003 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various parts of the computer device 1000 (for example, image data, audio data, voice communication data and video communication data). The memory 1003 can be implemented by flash memory (FLASH) or random access memory (Random Access Memory, RAM).

Data can be transmitted between the processor 1001, the communication interface 1002 and the memory 1003 through the bus 1004.

It will be understood that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic associated with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the size of the serial numbers of the above steps/processes does not mean the order of execution. The execution order of each step/process should be determined by its functions and internal logic, and should not be The implementation process of the embodiments of the present disclosure constitutes no limitations. The above serial numbers of the embodiments of the present disclosure are only for description and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element.

In the several embodiments provided in this disclosure, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components may be combined, or can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be electrical, mechanical, or other forms. of.

The units described above as separate components may or may not be physically separated; the components shown as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure can be all integrated into one processing unit, or each unit can be separately used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

If the disclosed technical solution involves personal information, the products applying the disclosed technical solution will clearly inform the personal information processing rules and obtain the individual's independent consent before processing personal information. If the disclosed technical solution involves sensitive personal information, the product applying the disclosed technical solution must obtain the individual's separate consent before processing the sensitive personal information, and at the same time meet the requirement of "express consent". For example, setting up clear and conspicuous signs on personal information collection devices such as cameras to inform them that they have entered the scope of personal information collection, and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed to have agreed to the collection of his or her personal information; or On personal information processing devices, when using obvious logos/information to inform personal information processing rules, obtain personal authorization through pop-up messages or asking individuals to upload their personal information; among them, personal information processing rules may include personal information processing rules. Information processors, purposes of personal information processing, processing methods, types of personal information processed, etc.

Those of ordinary skill in the art can understand that all or part of the steps to implement the above method embodiments can be completed through hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the execution includes: Implementation of the above method The aforementioned steps include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks, and other media that can store program codes.

Alternatively, if the above-mentioned integrated units of the present disclosure are implemented in the form of software functional parts and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or that contributes to related technologies. The computer software product is stored in a storage medium and includes a number of instructions to enable a computer. A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: mobile storage devices, ROMs, magnetic disks or optical disks and other media that can store program codes.

The above are only embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, and should are covered by the protection scope of this disclosure.

Industrial applicability

The present disclosure relates to an image processing method, a model generation method, a device, a vehicle, a storage medium and a computer program product. The image processing method includes: acquiring an image frame sequence including a mouth object; and processing each image in the image frame sequence. Extract mouth key point features from each frame to obtain the mouth key point features of each image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; among them, the syllable classification features represent the image frames The syllable category corresponding to the mouth shape of the mouth object in the sequence; determine the keywords that match the syllable classification characteristics in the preset keyword library. The above solution can reduce the amount of calculation required in the image processing process of lip recognition, thereby reducing the hardware requirements for computer equipment; at the same time, it can achieve good recognition results for facial images with different face shapes, textures and other appearance information. This improves the generalization ability of lip language recognition; in addition, by expressing the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the keywords obtained by image processing can be More accurate, thereby improving the accuracy of lip recognition.

Claims

An image processing method including:

Get a sequence of image frames containing the mouth object;

Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the mouth key point features of each image frame;

Generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the image frame sequence;

Keywords matching the syllable classification features are determined in the preset keyword library.
The method according to claim 1, wherein the extraction of mouth key point features for each image frame in the image frame sequence to obtain the mouth key point features of each image frame includes:

Determine the position information of at least two mouth key points of the mouth object in each image frame;

For each image frame in the image frame sequence, the mouth key point characteristics corresponding to the image frame are determined based on the position information of the mouth key points in the image frame and adjacent frames of the image frame.
The method according to claim 2, wherein the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point;

Determining the characteristics of the mouth key points corresponding to the image frame based on the position information of the mouth key points in the image frame and adjacent frames of the image frame includes:

For each mouth key point, determine the position information of the mouth key point in the image frame and the position information of the mouth key point in adjacent frames of the image frame. The first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame is used as the inter-frame difference information of the mouth key point;

For each mouth key point, determine based on a second height difference and/or a second width difference between the mouth key point in the image frame and other mouth key points of the same mouth object. Intra-frame difference information of the mouth key points.
The method according to any one of claims 1 to 3, wherein generating syllable classification features based on mouth key point features of multiple image frames in the image frame sequence includes:

Perform spatial feature extraction on the mouth key point features of each image frame respectively to obtain the spatial features of the mouth object in each image frame;

Perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatiotemporal features of the mouth object;

Perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
The method according to claim 4, wherein the spatial feature extraction is performed on the mouth key point features of each image frame to obtain the spatial features of the mouth object in each image frame, including:

Fusion of inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain inter-frame difference features and intra-frame difference features of the mouth object in each image frame;

The inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames are fused to obtain the spatial features of the mouth object in each image frame.
The method according to any one of claims 1 to 5, wherein,

Generating syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence includes: using a trained syllable feature extraction network to classify the mouth key points of multiple image frames in the image frame sequence. The key point features are processed to obtain syllable classification features;

Determining keywords matching the syllable classification features in the preset keyword database includes: using a trained classification network to determine keywords matching the syllable classification features in the preset keyword database.
The method according to any one of claims 1 to 6, wherein said obtaining an image frame sequence containing a mouth object includes:

Perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or,

Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
A method for generating a lip recognition model, including:

Obtaining a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;

Perform mouth key point feature extraction on each sample image frame in the sample image frame sequence to obtain the mouth key point features of each sample image frame;

Using the model to be trained, syllable classification features are generated based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and syllable classification features are determined in a preset keyword library that match the syllable classification features. Keywords; wherein, the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;

Based on the determined keywords and the keyword tags, the network parameters of the model are updated at least once to obtain a trained lip recognition model.
The method according to claim 8, wherein the model includes a syllable feature extraction network and a classification network; using the model to be trained, according to the mouth of multiple sample image frames in the sample image frame sequence Key point features, generate syllable classification features, and determine keywords matching the syllable classification features in the preset keyword library, including:

Utilize the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence;

Using the classification network, keywords matching the syllable classification features are determined in the preset keyword library.
The method according to claim 9, wherein the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network;

The method of using the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence includes:

Using the spatial feature extraction sub-network, perform spatial feature extraction on the mouth key point features of each sample image frame, respectively, to obtain the spatial features of the mouth object in each sample image frame;

Using the temporal feature extraction sub-network, perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object;

The syllable classification feature extraction sub-network is used to extract syllable classification features based on the spatiotemporal features of the mouth object, and the syllable classification features of the mouth object are obtained.
An image processing device, including:

a first acquisition part configured to acquire a sequence of image frames containing the mouth object;

The first recognition part is configured to extract mouth key point features for each image frame in the image frame sequence, and obtain the mouth key point features of each image frame;

The first determining part is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the mouth object in the image frame sequence The syllable category corresponding to the mouth shape;

The first matching part is configured to determine keywords matching the syllable classification features in the preset keyword library.
The device of claim 11, wherein the first identification portion includes:

A first determining sub-section configured to determine position information of at least two mouth key points of the mouth object in each image frame;

The second determination sub-part is configured to, for each image frame in the image frame sequence, determine the image according to the position information of the mouth key point in the image frame and an adjacent frame of the image frame. The key point features of the mouth corresponding to the frame.
The device according to claim 12, wherein the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point;

The second determining sub-part includes:

The first determination unit is configured to, for each mouth key point, determine the position information of the mouth key point in the image frame and the phase of the mouth key point in the image frame. The position information in , determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point;

The second determination unit is configured to determine, for each mouth key point, a sum of second height differences between the mouth key point in the image frame and other mouth key points of the same mouth object. /or the second width difference determines the intra-frame difference information of the mouth key point.
The device according to any one of claims 11 to 13, wherein the first determining part includes:

The first extraction sub-part is configured to perform spatial feature extraction on the key point features of the mouth in each image frame, and obtain the spatial features of the mouth object in each image frame;

The second extraction sub-part is configured to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object;

The third extraction sub-part is configured to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
The device according to claim 14, wherein the first extraction sub-part includes:

The first extraction unit is configured to fuse inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain the inter-frame difference information of the mouth object in each image frame. Difference features and intra-frame difference features;

The second extraction unit is configured to fuse inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames to obtain spatial features of the mouth object in each image frame.
The device according to any one of claims 11 to 15, wherein the first determining part includes:

The third determination sub-part is configured to use the trained syllable feature extraction network to process the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features;

The first matching part includes:

The first matching sub-part is configured to use the trained classification network to determine keywords matching the syllable classification features in the preset keyword library.
The device according to any one of claims 11 to 16, wherein the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification feature extraction sub-network;

The third determining sub-part includes:

The third extraction unit is configured to use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each image frame, and obtain the spatial features of the mouth object in each image frame. ;

The fourth extraction unit is configured to use the temporal feature extraction sub-network to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object;

The fifth extraction unit is configured to use the classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object, and obtain the syllable classification features of the mouth object.
The device according to claim 11, wherein the first acquisition part includes:

The frame insertion sub-part is configured as:

Perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or,

Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
A device for generating a lip language recognition model, including:

The second acquisition part is configured to acquire a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;

The second recognition part is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;

The second matching part is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine them in the preset keyword library Keywords matching the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;

The update part is configured to update the network parameters of the model at least once based on the determined keywords and keyword tags to obtain a trained lip recognition model.
The device according to claim 19, wherein the model includes a syllable feature extraction network and a classification network;

The second matching part includes:

The fourth determining sub-part is configured to utilize the feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence;

The fifth determination sub-part is configured to utilize the classification network to determine keywords matching the syllable classification features in the preset keyword library.
The device according to claim 20, wherein the feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network;

The fourth determining sub-part includes:

The sixth extraction unit is configured to use the spatial feature extraction sub-network to perform spatial feature extraction on the mouth key point features of each sample image frame, and obtain the mouth object in each sample image frame. spatial characteristics;

A seventh extraction unit is configured to utilize the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames, to obtain the spatio-temporal features of the mouth object;

The eighth extraction unit is configured to use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object, and obtain the syllable classification features of the mouth object.
A computer device, including a memory and a processor, the memory stores a computer program that can be run on the processor, wherein the method of any one of claims 1 to 10 is implemented when the processor executes the program. A step of.
A vehicle including:

a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object;

A vehicle machine, connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
A computer-readable storage medium having a computer program stored thereon, wherein the steps of the method of any one of claims 1 to 10 are implemented when the computer program is executed by a processor.
A computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, it causes the computer to execute any of claims 1 to 10. The method described in one item.