WO2023208134A1 - Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product - Google Patents

Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product Download PDF

Info

Publication number
WO2023208134A1
WO2023208134A1 PCT/CN2023/091298 CN2023091298W WO2023208134A1 WO 2023208134 A1 WO2023208134 A1 WO 2023208134A1 CN 2023091298 W CN2023091298 W CN 2023091298W WO 2023208134 A1 WO2023208134 A1 WO 2023208134A1
Authority
WO
WIPO (PCT)
Prior art keywords
mouth
features
image frame
key point
syllable
Prior art date
Application number
PCT/CN2023/091298
Other languages
French (fr)
Chinese (zh)
Inventor
康硕
李潇婕
王飞
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023208134A1 publication Critical patent/WO2023208134A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • the present disclosure relates to but is not limited to the field of information technology, and in particular, to an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.
  • Lip recognition technology can use computer vision technology to identify faces from video images, extract the changing features of the mouth area of the face, and thereby identify the text content corresponding to the video.
  • embodiments of the present disclosure provide at least an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.
  • An embodiment of the present disclosure provides an image processing method.
  • the method includes: acquiring an image frame sequence containing a mouth object; extracting mouth key point features for each image frame in the image frame sequence to obtain each of the image frame sequences. Mouth key point features of an image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence
  • the syllable category corresponding to the mouth shape of the object is determined; the keyword matching the syllable classification feature is determined in the preset keyword database.
  • Embodiments of the present disclosure also provide a method for generating a lip recognition model.
  • the method includes: obtaining a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag; Each sample image frame in the image frame sequence performs mouth key point feature extraction to obtain the mouth key point features of each sample image frame; using the model to be trained, according to multiple samples in the sample image frame sequence
  • the mouth key point features of the image frame generate syllable classification features, and determine keywords matching the syllable classification features in the preset keyword library; wherein the syllable classification features represent the sample image frame sequence
  • the syllable category corresponding to the mouth shape of the middle mouth object; based on the determined keywords and the keyword tags, update the network parameters of the model at least once to obtain a trained lip recognition model.
  • An embodiment of the present disclosure also provides an image processing device, which includes:
  • a first acquisition part configured to acquire a sequence of image frames containing the mouth object
  • the first recognition part is configured to extract mouth key point features for each image frame in the image frame sequence, and obtain the mouth key point features of each image frame;
  • the first determining part is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the mouth object in the image frame sequence The syllable category corresponding to the mouth shape;
  • the first matching part is configured to determine keywords matching the syllable classification features in the preset keyword library.
  • An embodiment of the present disclosure also provides a device for generating a lip recognition model.
  • the device includes:
  • the second acquisition part is configured to acquire a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;
  • the second recognition part is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;
  • the second matching part is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine them in the preset keyword library Keywords matching the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
  • the update part is configured to perform at least on the network parameters of the model based on the determined keywords and the keyword tags. Once updated, a trained lip recognition model is obtained.
  • An embodiment of the present disclosure also provides a computer device, including a memory and a processor.
  • the memory stores a computer program that can be run on the processor.
  • the processor executes the program, some or all of the steps in the above method are implemented. .
  • An embodiment of the present disclosure also provides a vehicle, including:
  • a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object
  • a vehicle machine connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence
  • the syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented.
  • Embodiments of the present disclosure also provide a computer program, including computer readable code.
  • the processor in the computer device executes for implementing part or all of the above method. step.
  • Embodiments of the present disclosure also provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part of the above method is implemented. or all steps.
  • an image frame sequence whose image content includes a mouth object is obtained.
  • an image frame sequence that records the change process of the mouth object when the set object speaks can be obtained;
  • each image frame sequence in the image frame sequence is obtained. Extract key point features of the mouth in one image frame to obtain the key point features of the mouth in each image frame of multiple image frames in the image frame sequence.
  • Compared with the mouth region image sequence obtained by cropping the face image for lip language Recognition, using key point features of the mouth for lip language recognition can reduce the amount of calculation required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method; and, when using key point features of the mouth for lip language recognition , involving the extraction of mouth key point features, therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, thereby improving the generalization ability of lip language recognition; again, according to the image frame sequence
  • the mouth key point features in multiple image frames generate syllable classification features.
  • the syllable classification features represent the syllable categories corresponding to the mouth shapes of the mouth objects in the image frame sequence.
  • the syllable classification features are extracted from the mouth key point features.
  • the syllable classification features Classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence.
  • Using syllable classification features to assist lip recognition can improve the accuracy of lip recognition; finally, based on the syllable classification features, the preset
  • the matching keywords are determined by matching in the keyword database. In this way, by representing the syllable classification features corresponding to the image frame sequence, the keywords corresponding to the syllables are determined according to the syllable classification features represented by the syllable classification features, thereby improving the key words obtained by image processing. word accuracy.
  • the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence.
  • the preset key points are Keywords are obtained by matching in the lexicon.
  • Figure 1 is a schematic flow chart of an implementation of an image processing method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of facial key points provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure.
  • Figure 6 is a schematic flowchart of the implementation of a method for generating a lip language recognition model provided by an embodiment of the present disclosure
  • Figure 7 is a schematic structural diagram of a lip language recognition model provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a device for generating a lip language recognition model provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present disclosure.
  • first/second/third involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understood that “first/second/third” can be used interchangeably if permitted. The specific order or sequence may be changed so that the embodiments of the disclosure described herein can be implemented in an order other than that illustrated or described herein.
  • lip recognition can make up for the limitations of speech recognition, thereby enhancing the robustness of human-computer interaction.
  • Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device.
  • computer equipment can refer to cars, servers, laptops, tablets, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable gaming devices ) and other equipment with data processing capabilities.
  • Figure 1 is a schematic flow chart of an image processing method provided by an embodiment of the present disclosure. As shown in Figure 1, the method includes the following steps S101 to S104:
  • Step S101 Obtain an image frame sequence containing a mouth object.
  • the computer device acquires multiple image frames.
  • the multiple image frames can be captured by a collection component such as a camera to capture the set object during the speaking process.
  • the multiple image frames are sorted according to the time parameters corresponding to each image frame, and we obtain Raw image frame sequence.
  • the multiple image frames in the image frame sequence at least include the mouth object of the same setting object.
  • the subjects are usually humans, but can also be other expressive animals, such as orangutans.
  • the image frame sequence at least covers the entire process of the setting object saying a sentence. For example, multiple image frames in the image frame sequence at least cover the entire process of the setting object saying "turn on the music.”
  • the number of image frames included in the image frame sequence may not be fixed.
  • the number of frames in the image frame sequence may be 40 frames, 50 frames, or 100 frames.
  • the original image frame sequence can be directly used as the image frame sequence used for subsequent image processing; the original image sequence can also be further processed to obtain the image frame sequence used for subsequent image processing.
  • the original image sequence can be interpolated to obtain the settings. Frame number of image frame sequence. Therefore, the image frames in the image frame sequence in various embodiments of the present disclosure may be actually collected using the acquisition component, or may be generated based on the actual collected image frames.
  • the computer device can obtain multiple image frames by calling a camera, or it can obtain them from other computer devices; for example, the computer device is a vehicle, and the vehicle can obtain it through a vehicle-mounted camera.
  • at least one image frame in the image frame sequence can be derived from a video, where one video can include multiple video frames, each video frame corresponds to an image frame, and the image frames in the image frame sequence can be continuous.
  • Video frames can also be discontinuous video frames selected from multiple video frames at fixed or non-fixed time intervals.
  • multiple image frames collected in advance can be obtained, or multiple image frames can be obtained by collecting images of the set object in real time, which is not limited here.
  • an image frame sequence can be obtained that records the change process of the mouth object when the set object speaks.
  • Step S102 Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.
  • the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the image frame, and based on at least one image frame
  • the position information of the mouth key points is determined to determine a mouth key point feature corresponding to each image frame, thereby obtaining at least one mouth key point feature of the image frame sequence.
  • the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the image frame, that is, the same mouth key point in different
  • the position information in the image frame is related to the mouth shape of the mouth object in this image frame.
  • the method of determining the characteristics of the mouth key points corresponding to the image frame is based on the position information of the mouth key points in the image frame.
  • the formula can be to sort the position information of each mouth key point in an image frame according to the key point number corresponding to each mouth key point, to obtain a position sequence, and then use the position sequence as the mouth key point feature.
  • each image frame includes 4 mouth key points, and the coordinates of the mouth key points are (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ), the determined key point features of the mouth corresponding to the image frame are [(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ) ].
  • the key point serial number corresponding to the mouth key point is the number corresponding to the mouth key point among the numbers preset for the facial key points. For example, in the schematic diagram of the facial key points shown in Figure 3, 106 is preset. key points, and each key point is numbered. The key points are numbered from 0 to 105.
  • the key points No. 84-103 are the mouth key points used to describe the mouth.
  • the characteristics of the mouth key points corresponding to the image frames are determined based on the position information of the mouth key points in the image frames.
  • the method can be to calculate the difference information of the position information of the mouth key points in each image frame and the adjacent frame of the image frame, and calculate each mouth key in an image frame according to the corresponding key point serial number.
  • the difference information of the points is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frame can be the previous image frame and/or the subsequent image frame of the image frame in the image frame sequence.
  • the difference information of the position information includes at least one of the following: difference information between this image frame and the previous image frame; difference information between this image frame and the next image frame.
  • each image frame includes 4 mouth key points, and the mouth key point is in the first
  • the coordinates in the image frame are (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ), and the mouth key point is in the second image frame
  • the coordinates are (x' 1 , y' 1 ), (x' 2 , y' 2 ), (x' 3 , y' 3 ), (x' 4 , y' 4 ) respectively.
  • the mouth key point features corresponding to the two image frames are [(x' 1 -x 1 , y' 1 -y 1 ), (x' 2 -x 2 , y' 2 -y 2 ), (x' 3 - x 3 , y' 3 -y 3 ), (x' 4 -x 4 , y' 4 -y 4 )].
  • lip recognition using key point features of the mouth for lip recognition can reduce the amount of calculations required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method. requirements, thereby making the image processing method universally applicable to various computer devices.
  • lip recognition using key point features of the mouth involves the extraction of key point features of the mouth. Therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, improving lip recognition. generalization ability and accuracy.
  • Step S103 Generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence.
  • feature extraction can be performed on the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features, where the syllable classification features represent at least one preset syllable category corresponding to the image frame sequence, and each preset The syllable category represents at least one syllable with the same or similar mouth shape, that is, the syllable classification feature may represent a syllable category corresponding to the mouth shape of the mouth object in the image frame sequence.
  • Each element in the syllable classification feature can be used to indicate whether there is a syllable type in the image frame sequence, thereby determining at least one syllable corresponding to the mouth shape contained in the image in the image frame sequence.
  • the syllable types can be divided into a set number of preset syllable categories in advance according to the similarity of the mouth shapes.
  • Each preset syllable category includes at least one syllable type with the same or similar mouth shape.
  • the set number can be based on the language.
  • the type is set; among them, the degree of mouth shape similarity can be determined manually based on experience or through machine learning. Taking Chinese as an example, without considering tones, Chinese characters have a total of 419 syllable types.
  • syllables can be divided into 100 categories according to the corresponding mouth shapes, and the length of the corresponding syllable classification feature is 100; for other Languages, such as English, can combine phonetic symbols to divide syllable types into a set number of preset syllable categories, and set the length of syllable classification features based on the correspondence between syllables and mouth shapes.
  • spatio-temporal features corresponding to each mouth key point feature can be obtained by performing spatio-temporal feature extraction on at least two mouth key point features of the image frame sequence, and syllable classification features can be determined based on the spatio-temporal features.
  • the temporal prediction network and/or the fully convolutional network can be used to extract spatiotemporal features to obtain the spatiotemporal features corresponding to each mouth key point feature.
  • a flatten layer or other methods can be used to splice at least two spatio-temporal features, and then the spliced spatio-temporal features can be classified to obtain syllable classification features.
  • syllable classification features are extracted from the mouth key point features.
  • the syllable classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence. Then, the syllable classification features are used to assist lip language recognition, which can Improve the accuracy of lip recognition.
  • Step S104 Determine keywords matching the syllable classification features in the preset keyword database.
  • a certain number of keywords are preset in the keyword library, and each keyword can be matched with a specific syllable classification feature, so that lip language recognition can be obtained based on the matching results of the keywords and the syllable classification features.
  • Image processing results after the keyword is determined, the keyword can be output directly, or the sequence number of the keyword in the keyword database can be output.
  • the preset keywords in the preset keyword library can be set according to specific application scenarios. For example, in a driving scenario, the preset keywords can be set to "turn on the audio", “turn on the left side”. car windows” etc. It should be noted that the preset keyword library represents the storage form of keywords.
  • the matching keywords can be determined by combining the detection results obtained by speech detection and the recognition results obtained by lip recognition; for example, the weights of the detection results of speech detection and the recognition results of lip recognition can be set separately, and the weighted The calculation results are used as the basis for matching.
  • speaking detection may include but is not limited to whether the mouth object is in a speaking state, the speaking interval in the speaking state, etc. The process of conducting at least one test.
  • the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence.
  • Keywords are obtained by matching in the preset keyword library.
  • lip recognition results are obtained by extracting features from two-dimensional image frames, which can reduce the amount of calculation required for image processing of lip recognition and reduce the hardware requirements for computer equipment; at the same time, the appearance of different face shapes, textures, etc.
  • Information facial images can achieve good recognition results, thereby improving the generalization ability of lip language recognition; in addition, by representing the syllable classification features corresponding to the image frame sequence, the syllable-corresponding word is determined based on the syllable category represented by the syllable classification features. Keywords of words can make the keywords obtained by image processing more accurate, thereby improving the accuracy of lip recognition.
  • the speaking interval of the set object in the video is detected through lip movement recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S101 can be implemented through the following steps S1011 and S1012:
  • Step S1011 Obtain a video in which the image frame includes the mouth object.
  • the computer device captures the set object through a collection component such as a camera, and obtains a video in which the image frame includes the mouth object.
  • a collection component such as a camera
  • Step S1012 Perform lip movement recognition on the mouth object, and determine multiple video frames in which the mouth object is in a speaking state as an image frame sequence.
  • lip motion recognition technology is used to crop the video to obtain a video recording the speaking process of the set object.
  • the image of the video contains the mouth object in a speaking state; then, multiple video frames are selected from the cropped video.
  • Image as a sequence of image frames.
  • the image frame sequence can at least cover the complete process of the set object speaking, and the video is cropped through lip movement recognition technology, which can reduce the image frames in the image frame sequence that are not related to the speaking process.
  • the images obtained through this scheme Perform image processing on the frame sequence and obtain keywords matching the image sequence, which can further improve the accuracy of lip recognition and reduce the amount of calculation required for the image processing process of lip recognition.
  • the number of image frames included in the image frame sequence used for image processing may not be fixed.
  • frame interpolation processing can be performed on the original image sequence collected to obtain an image frame sequence including a preset number of image frames.
  • performing frame interpolation processing on the acquired original image sequence may include the following step S1013 or step S1014:
  • Step S1013 Perform image frame interpolation on the acquired original image sequence including the mouth object to obtain the image frame sequence.
  • the method of performing frame interpolation processing on the acquired original image sequence to obtain an image frame sequence including a preset number of image frames may be to perform image interpolation processing based on the image frames in the original image sequence to generate a preset number of image frames. , based on the generated image frames and/or collected image frames, an image frame sequence for subsequent mouth key point feature extraction is obtained.
  • Step S1014 Based on the obtained mouth key points in the original image sequence containing the mouth object, interpolate frames on the original image sequence to obtain the image frame sequence.
  • the method of performing frame interpolation processing on the collected original image sequence to obtain an image frame sequence including a preset number of image frames may be to generate newly inserted image frames based on the position information of the mouth key points in the original image sequence, Among them, the position information of the mouth key points in the newly inserted image frame is predicted based on the position information of the mouth key points in the original image sequence, thereby realizing the interpolation of the original image sequence and obtaining the preset corresponding to the image frame sequence. Amount of key point information to achieve subsequent mouth key point feature extraction.
  • the number of image frames can be preset based on experience.
  • the default frame number can be set to 60.
  • the image frame sequence after frame interpolation is used for lip recognition, and there is no requirement on the number of frames of the original image sequence collected, which can improve the robustness of the image recognition method for lip recognition.
  • the position information of the mouth key points in each image frame and adjacent frames is used to determine the mouth key point characteristics of the image frame. That is, the above step S102 can be implemented by the steps shown in Figure 2 .
  • FIG. 2 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 2:
  • Step S201 Determine the position information of at least two mouth key points of the mouth object in each image frame.
  • the image frame sequence includes at least two image frames, and position information of mouth key points associated with the mouth object in each image frame is extracted.
  • the number of mouth key points is at least two, and they are distributed at least on the upper and lower lips in the image.
  • the number and distribution location of mouth key points are usually related to the key point identification algorithm.
  • the number of mouth key points is 16.
  • the position information of each mouth key point can be represented by a position parameter, for example, it can be represented by two-dimensional coordinates in the image coordinate system, which include width coordinates (abscissa) and height coordinates (ordinate).
  • the position information of the mouth key points is also related to the mouth shape of the mouth object in the image.
  • the position information of the same mouth key point in different images changes as the mouth shape changes.
  • the diagram includes a total of 106 key points numbered 0-105, which can describe the facial contour, eyebrows, eyes, nose, Mouth and other features, among which key points 84-103 are mouth key points used to describe the mouth.
  • the positions of key point No. 93 in the two frames of images corresponding to different speech contents are different.
  • the ordinate of key point No. 93 is smaller in the image, it means that the mouth is opened to a greater degree.
  • the possibility of corresponding to "ah” is higher.
  • Step S202 For each image frame in the image frame sequence, determine the mouth key corresponding to the image frame based on the position information of the mouth key point in the image frame and adjacent frames of the image frame. point features.
  • the position information of the mouth key points in at least two image frames including the first image frame may be used to calculate the mouth key points in the first image frame.
  • mouth key point features may include inter-frame difference information and/or intra-frame difference information.
  • the first image frame may be any image frame in the image frame sequence.
  • the inter-frame difference information can represent the difference information of the position information of the same mouth key point in different image frames
  • the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same image frame.
  • the position information of each mouth key point in the first image frame and the position information of the mouth key point in adjacent frames of the first image frame are used to calculate the position information of the mouth key point in different image frames.
  • Inter-frame difference information; and/or, using the position information of at least two mouth key points including the mouth key point in the first image frame calculate the frame of the mouth key point in the first image frame internal difference information.
  • embodiments of the present disclosure use the position information of multiple mouth key points in multiple image frames to obtain mouth key point features, so that the mouth key point features can represent images.
  • the frame sequence corresponds to the changing process of the mouth key points during the speaking process, so as to better extract the mouth shape change characteristics of the set object during the speaking process; in this way, using the mouth key point features for lip language recognition can improve lip language Recognition accuracy.
  • the difference in position information of the mouth key points in adjacent frames and the difference in position information of the preset mouth key point pairs in the same image frame are used to determine the characteristics of the mouth key points, that is, the above step S202 This can be achieved through the following steps S2021 and S2022:
  • Step S2021 for each mouth key point, according to the position information of the mouth key point in the image frame, and the position of the mouth key point in adjacent image frames of the image frame Information, determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point.
  • the mouth key point features corresponding to each first image frame for each mouth key point, according to the position information of the mouth key point in the first image frame, and the mouth The position information of the mouth key point in each second image frame of at least one second image frame is calculated, and the difference information of the position information of the mouth key point in the first image frame and each second image frame is calculated.
  • the second image frame is an image frame adjacent to the first image frame, that is, an adjacent frame of the first image frame;
  • the difference information may be the first height difference, the first width difference, or the first The combination of the height difference and the first width difference;
  • the first width difference is the width difference of the mouth key point in the two image frames (the first image frame and the second image frame) (that is, the mouth key point is in the two image frames).
  • the first height difference is the height difference of the mouth key point in the two image frames (that is, the difference in the ordinate of the mouth key point in the two image frames).
  • the difference when calculating the difference, can be set to the position information of the subsequent image frame minus the position information of the previous image frame, or can be set to the position information of the previous image frame minus the subsequent image. Frame position information. In this way, for each mouth key point, using each second image frame of the first image frame and at least one second image frame, the same number of difference information as the second image frame can be obtained, and these difference information are determined is the inter-frame difference information of the mouth key point in the first image frame.
  • the coordinates of a mouth key point in three consecutive image frames are (x 1 , y 1 ), (x' 1 , y' 1 ), (x" 1 , y" 1 ), and the second
  • the first image frame is the first image frame
  • the first image frame and the third image frame before and after are the second image frame.
  • the inter-frame difference information in is (x' 1 -x 1 , y' 1 -y 1 ,x" 1 -x' 1 , y" 1 -y' 1 ).
  • Step S2022 For each mouth key point, determine the second height difference and/or the second width between the mouth key point in the image frame and other mouth key points of the same mouth object. Difference, determine the intra-frame difference information of the mouth key point.
  • each mouth key point when determining the mouth key point characteristics corresponding to each first image frame, calculate the relationship between the mouth key point and other mouth key points of the same mouth object. the second height difference and/or the second width difference, and determine the second height difference and/or the second width difference as each mouth key point in the corresponding preset mouth key point pair in the first image frame Intra-frame difference information in .
  • other mouth key points can be fixed mouth key points, such as the mouth key points corresponding to the lip beads, such as key point No. 98 shown in Figure 3; they can also be set to satisfy the settings of each mouth key point. Positional relationship between the key points of the mouth.
  • the two mouth key points are used as a preset mouth key point pair.
  • the position information of the mouth key point in the image can be considered. That is to say, the relationship between the two mouth key points belonging to the same preset mouth key point pair satisfies the set Determine the positional relationship; for example, determine the two mouth key points located on the upper and lower lips of the mouth object as a mouth key point pair; you can also determine the two mouth key points whose width difference information in the image is less than the preset value
  • the key points are determined as preset mouth key point pairs. In this way, the mouth shape of the mouth object in the first image frame can be better represented by using the second height difference of the preset mouth key point pair.
  • one mouth key point can form a preset mouth key point pair with two or more mouth key points respectively. That is to say, each mouth key point can belong to multiple mouth key points. right.
  • the second height difference of each mouth key point pair to which the mouth key point belongs is determined respectively, and at least two second height differences are used to perform a weighted sum to determine the position of the mouth key point in the first image frame. middle intra-frame difference information.
  • key point No. 86 can form a preset mouth key point pair with key point No. 103 and key point No. 94 respectively. That is to say, key point No. 86 belongs to The two key points of the mouth are correct.
  • inter-frame difference information and intra-frame difference information of a mouth key point in the first image frame are obtained respectively, and the inter-frame difference information and intra-frame difference information can be spliced.
  • obtain an element in the mouth key point feature corresponding to the mouth key point in the first image frame thereby based on the inter-frame difference information of each mouth key point in the first image frame and the first image frame
  • Intra-frame difference information determines the mouth key point feature elements corresponding to each mouth key point in the first image frame, and determines the mouth corresponding to the first image frame based on the mouth key point feature elements corresponding to all mouth key points. Key point features.
  • the inter-frame difference information of the position information of each mouth key point in adjacent image frames and the intra-frame difference information of the position information of the mouth key point and the preset mouth key point are used,
  • the mouth key point features are obtained so that the mouth key point features can represent the differences between the mouth key points that satisfy the set relationship, improving the accuracy of mouth shape determination in each frame of image; and, the mouth key point features are also It can represent the changing process of mouth key points between frames during speaking corresponding to the image frame sequence; in this way, the changing characteristics of the mouth shape during speaking can be better extracted, thereby improving the accuracy of lip recognition.
  • spatio-temporal features are extracted from the mouth key points in the image frame sequence to obtain the spatio-temporal features corresponding to the mouth object in each image frame, and syllable features are classified based on the spatio-temporal features to obtain the syllables corresponding to the mouth object.
  • Classification features that is, the above-mentioned step S103, can be implemented through the steps shown in Figure 4.
  • FIG. 4 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 4:
  • Step S401 Perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.
  • each mouth key point feature of the image frame sequence can be obtained.
  • Each mouth key point feature is calculated from the position information of the mouth key point.
  • the position information of the mouth key point indicates that the mouth object is in an image.
  • each mouth key point feature corresponds to an image frame.
  • the spatial features of the mouth object in the corresponding image frame can be extracted from the mouth key point feature using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used for extraction to obtain spatial features.
  • the speaking interval of the set object in the video is detected through lip motion recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S401 can be implemented through the following steps S4011 and S4012:
  • Step S4011 fuse inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain inter-frame difference features and intra-frame difference features of the mouth object in each image frame. Differential characteristics.
  • each mouth key point feature is calculated from the position information of the mouth key point.
  • the position information of the mouth key point represents the position of the mouth object in an image frame.
  • Each mouth key point feature corresponds to a image frame.
  • the inter-frame difference information can represent the difference information of the position information of the same mouth key point in different frames
  • the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same frame.
  • the inter-frame difference information of multiple mouth key points in each image frame is fused, and the intra-frame difference information of multiple mouth key points in each image frame is fused to obtain the above
  • the inter-frame difference features and intra-frame difference features of the mouth object in each image frame can be by using a convolutional neural network, a recurrent neural network, etc. In this way, a convolution kernel of a preset size is used to fuse the information of multiple mouth key points to achieve the fusion of inter-frame and/or frame difference information of multiple mouth key points.
  • a mouth key point corresponds to an element in the mouth key point feature
  • the mouth key point includes a 5-dimensional feature
  • the first 4 dimensions of the 5-dimensional feature are inter-frame difference information, respectively.
  • the fifth dimension is the intra-frame difference information, that is, the height difference between the mouth key point and other mouth key points of the same mouth object in the same image frame and/or Width difference.
  • each dimension of the 5-dimensional features is separately analyzed in at least two mouth key points (that is, the mouth (between the elements of the mouth key point feature) for feature extraction, and the first 4 dimensions of the obtained features are used as the inter-frame difference features of the mouth object in this specific image frame, and the fifth dimension is used as the mouth object in this specific image.
  • Intra-frame difference features within frames.
  • Step S4012 Fusion of inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames to obtain spatial features of the mouth object in each image frame.
  • the way to fuse the inter-frame difference features and intra-frame difference features of multiple image frames can be implemented by using a convolutional neural network, a recurrent neural network, etc., and using a convolution kernel of a preset size to fuse multiple image frames.
  • Information on key points of the mouth is fused to achieve each
  • the fusion between the inter-frame difference information and the intra-frame difference information of a mouth key point obtains the spatial characteristics of the mouth object in each image frame.
  • the inter-frame difference information and the intra-frame difference information of at least two mouth key points of the mouth object in each image frame are respectively fused to obtain the inter-frame difference information between the mouth key points.
  • the inter-frame difference features of the difference information, and the intra-frame difference features representing the intra-frame difference information between the mouth key points, and then the inter-frame difference features and intra-frame difference features of the mouth key points in each image frame are characterized. Fusion can better extract the spatial features of the mouth object in each image frame and improve the accuracy of determining the mouth shape in each frame of image.
  • Step S402 Perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.
  • spatial features of the mouth object in at least two image frames including the third image frame can be used to perform feature extraction to obtain the mouth object.
  • the spatiotemporal features of the mouth object can be extracted from the spatial features using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used to extract temporal features to obtain spatiotemporal features.
  • temporal feature extraction of the spatial features of the mouth object in multiple image frames can be performed multiple times.
  • a 1 ⁇ 5 convolution kernel is used for feature extraction.
  • the secondary convolution extracts the spatial features of two image frames before and after the third image frame, and the extracted spatiotemporal features include information of five image frames.
  • the spatiotemporal features corresponding to each image frame can represent more information of the image frames, allowing the information between frames to be communicated, so the corresponding receptive field becomes larger. It is conducive to learning words composed of multi-frame images and the timing between different words, which can improve the accuracy of lip recognition, but it requires more computing resources and affects the hardware computing efficiency; comprehensive consideration of accuracy and hardware computing efficiency , in actual applications, the number of image feature extraction times can be set to 5 times.
  • Step S403 Extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  • the syllable classification features of the mouth object are obtained by extracting syllable classification features from the spatiotemporal features corresponding to each image frame in at least two image frames; wherein, the syllable classification features can represent the same as the mouth object. At least one syllable corresponding to the mouth shape that appears during the speaking process of the subject, and each element in the syllable classification feature is used to determine whether there is a preset syllable type during the speaking process, thereby determining whether the image frame in the image frame sequence contains At least one syllable corresponding to the mouth shape.
  • the syllable classification features of mouth objects can be extracted from spatio-temporal features using any suitable feature extraction method. For example, fully connected layers, global average pooling layers, and other methods can be used to extract syllable classification features from spatiotemporal features to obtain syllable classification features.
  • Embodiments of the present disclosure support the use of convolutional neural networks for spatiotemporal feature extraction; compared with using time series prediction networks such as recurrent neural networks (recursive neural networks) to extract spatiotemporal features, the amount of calculation required to extract spatiotemporal features through convolutional neural networks is less. It can reduce the consumption of computing resources and reduce the hardware requirements for computer equipment used to implement lip recognition.
  • the image processing method provided by the embodiments of the present disclosure can be implemented with more lightweight chips, allowing more hardware to support the lip reading of the embodiments of the present disclosure.
  • the image processing method in the recognition process improves the versatility of lip recognition. For example, computer equipment such as cars and machines can also realize lip recognition.
  • Embodiments of the present disclosure also provide an image processing method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes the following steps S501 to S504:
  • Step S501 Obtain an image frame sequence containing a mouth object.
  • step S501 corresponds to the aforementioned step S101, and during implementation, reference may be made to the specific implementation of the aforementioned step S101.
  • Step S502 Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.
  • step S502 corresponds to the aforementioned step S102, and during implementation, reference may be made to the specific implementation of the aforementioned step S102.
  • Step S503 Use the trained syllable feature extraction network to process the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features.
  • the syllable feature extraction network can be any suitable network for feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc.; those skilled in the art can select a syllable feature extraction network based on the actual situation.
  • the appropriate network structure is not limited by the embodiments of this disclosure.
  • Step S504 Use the trained classification network to determine keywords matching the syllable classification features in the preset keyword library.
  • the classification network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure for the classification network according to the actual situation, which is not limited by the embodiments of the present disclosure.
  • a trained syllable feature extraction network is used to process the key point features of the mouth to obtain syllable classification features; the trained classification network is used to determine the key matching the syllable classification features in the preset keyword library word.
  • the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification
  • the feature extraction sub-network that is, the above step S503, can be implemented through the following steps S5031 to S5033:
  • Step S5031 Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.
  • the spatial feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc. Those skilled in the art can select an appropriate network structure based on the actual spatial feature extraction method for each mouth key point feature, which is not limited by the embodiments of the present disclosure.
  • Step S5032 Use the temporal feature extraction sub-network to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.
  • the temporal feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc.
  • Those skilled in the art can select an appropriate network structure based on the actual method of performing at least one temporal feature extraction on the spatial features of the mouth object in at least one image frame, which is not limited by the embodiments of the present disclosure.
  • Step S5033 Use the classification feature extraction sub-network to extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  • the classification feature extraction sub-network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure based on the actual classification feature extraction method for each spatio-temporal feature of the mouth object, which is not limited by the embodiments of the present disclosure.
  • Embodiments of the present disclosure also provide a method of generating a lip recognition model, which method can be executed by a processor of a computer device. As shown in Figure 6, the method includes the following steps S601 to S604:
  • Step S601 Obtain a sample image frame sequence including a mouth object.
  • the computer device obtains a sequence of sample image frames that have been labeled with keyword tags.
  • the sequence of sample image frames includes multiple sample image frames.
  • the sample images in the sequence of sample image frames are arranged according to the time corresponding to each sample image frame. Parameter sorting.
  • the number of sample image frames included in the sample image frame sequence may not be fixed.
  • the number of sample image frames included in the sample image frame sequence may be 40 frames, 50 frames, or 100 frames.
  • Step S602 Perform mouth key point feature extraction on each sample image frame in the sample image frame sequence to obtain mouth key point features of each sample image frame.
  • the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the sample image frame, and based on at least The position information of the mouth key point of a sample image frame is used to determine a mouth key point feature corresponding to each sample image frame, thereby obtaining at least one mouth key point feature of the sample image frame sequence.
  • the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the sample image frame, that is, the same mouth key point in The position information of different sample image frames is related to the mouth shape of the mouth object in this sample image frame.
  • the method of determining the characteristics of the mouth key points corresponding to the sample image frame based on the position information of the mouth key points of the sample image frame may be to calculate a sample according to the key point serial number corresponding to each mouth key point.
  • the position information of each mouth key point in the image frame is sorted to obtain a position sequence, and the position sequence is used as the mouth key point feature.
  • the mouth corresponding to the sample image frame is determined based on the position information of the mouth key point in the sample image frame.
  • the way to obtain the mouth key point features can be to calculate the difference information of the position information of the mouth key points in each sample image frame and the adjacent frame of the sample image frame, and calculate a sample image frame according to the corresponding key point serial number.
  • the difference information of each mouth key point in is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frames can be the previous sample image frame and/or the subsequent sample image frame.
  • steps S601 to S602 respectively correspond to the aforementioned steps S101 to S102.
  • steps S601 to S602 respectively correspond to the aforementioned steps S101 to S102.
  • steps S101 to S102 respectively correspond to the aforementioned steps S101 to S102.
  • Step S603 Using the model to be trained, generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine the syllable classification features in the preset keyword database Keywords for feature matching.
  • the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence.
  • the model to be trained can be any suitable deep learning model, and is not limited here.
  • those skilled in the art can use an appropriate network structure to construct the model to be trained according to the actual situation.
  • the model to be trained is used to process the mouth key point features of multiple sample image frames in the sample image frame sequence to generate syllable classification features.
  • the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence.
  • the process of determining keywords matching the syllable classification features in the preset keyword library corresponds to the process of processing the key point features of the mouth in steps S103 to S104 in the previous embodiment. During implementation, you can refer to the above Specific implementation of step S103 to step S104.
  • syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition.
  • Step S604 Update the network parameters of the model at least once based on the determined keywords and the keyword tags, Get a trained lip recognition model.
  • the network parameters of the model can be determined whether to update the network parameters of the model based on the determined keywords and keyword tags.
  • an appropriate parameter learning difficulty update algorithm is used to update the network parameters of the model, and the model with updated parameters is used to re-determine the matching keywords based on the re-determined keywords. and keyword tags to determine whether to continue updating the network parameters of the model.
  • the finally updated model is determined to be the trained lip recognition model.
  • the loss value can be determined based on the determined keywords and keyword tags, and when the loss value does not meet the preset conditions, the network parameters of the model are updated. When the loss value meets the preset conditions Or when the number of updates to the network parameters of the model reaches a set threshold, the update of the network parameters of the model is stopped, and the final updated model is determined as the trained lip recognition model.
  • the preset conditions may include, but are not limited to, at least one of the loss value being less than the set loss threshold, the change in the loss value converging, and the like. During implementation, the preset conditions may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • the method of updating the network parameters of the model may be determined based on the actual situation, and may include but is not limited to at least one of the gradient descent method, Newton's momentum method, etc., which is not limited here.
  • syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition by the trained lip recognition model.
  • the syllable classification features are determined based on the key point features of the mouth, the syllable classification features can better reflect the syllables corresponding to the mouth shapes in the image frame sequence, and the syllable classification features can be used to assist lip language recognition, so that image processing can be achieved
  • the keywords are more precise and improve the accuracy of lip recognition.
  • using the key point features of the mouth for lip recognition can reduce the amount of calculation required in the image processing process, thereby reducing the need to perform image processing.
  • the hardware requirements of the computer equipment of the method; and, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, so that based on the key point features of the mouth, the recognition of face shapes, textures not involved in the model training process can be improved
  • the recognition ability of image categories is improved, thereby improving the generalization ability of lip language recognition.
  • the model includes a syllable feature extraction network and a classification network
  • the above step S603 may include the following steps S6031 to S6032:
  • Step S6031 Use the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence.
  • Step S6032 Use the classification network to determine keywords matching the syllable classification features in the preset keyword library.
  • steps S6031 to S6032 respectively correspond to the aforementioned steps S503 to S504. During implementation, reference may be made to the specific implementation of the aforementioned steps S503 to S504.
  • the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network.
  • the above step S6031 may include the following steps S60311 to S60313:
  • Step S60311 Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame.
  • Step S60312 Use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object.
  • Step S60313 Use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  • steps S60311 to S60313 respectively correspond to the aforementioned steps S5031 to S5033.
  • steps S5031 to S5033 respectively correspond to the aforementioned steps S5031 to S5033.
  • FIG. 7 is a schematic structural diagram of a lip recognition model provided by an embodiment of the present disclosure.
  • the lip recognition model structure includes: a single-frame feature extraction network 701, an inter-frame feature fusion network 702, and a feature sequence classification network 703.
  • the single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012
  • the feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.
  • Embodiments of the present disclosure provide an image processing method that generates an image frame sequence of the subject speaking based on the lip movement recognition detection results, uses the characteristics of the key points of the face as the input of the lip language recognition model, and uses monosyllables to assist in detecting syllables in the speaking sequence. , and use the syllable feature layer to classify speech sequences.
  • the image processing method according to the embodiment of the present disclosure will be described below with reference to FIG. 7 .
  • Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device.
  • computer equipment may refer to equipment with data processing capabilities such as vehicles and machines.
  • the image processing method may include the following steps one to four:
  • Step 1 input preprocessing.
  • the input video sequence obtained by the computer device is a non-fixed frame, and the video sequence may include a non-fixed number of video frames.
  • the key point sequence corresponds to 106 key points in each image frame. Take out the 20 key points of the mouth object, and then use the interpolation method (for example, bilinear interpolation method) to generate the 20 key points into a length of 60 The position sequence of the key points of the image frame. 20 mouth key points are used as feature dimensions. Each key point in the sequence corresponds to a feature of length 5 in each image frame, thereby obtaining mouth key point features 704 corresponding to 60 frames. Each mouth key point feature 704 corresponds to one image frame and 20 key points. Each key point in each image frame corresponds to a 5-dimensional feature.
  • the first four dimensions of the feature are obtained based on the coordinate difference between the current image frame and the previous and subsequent image frames
  • the fifth dimension of the feature is obtained based on the height difference between the preset key point pairs in the current frame.
  • the first 4 dimensions can reflect the mouth shape changes between the current image frame and the previous and subsequent image frames
  • the fifth dimension reflects the mouth shape in the current image frame.
  • the collected videos can be processed through methods such as lip movement recognition, so that each video can at least cover the process of the set object (usually a person) speaking a sentence, and each sentence corresponds to a keyword. In this way, there is a one-to-one relationship between video and keywords.
  • the interpolation method can be used to obtain a 60-frame position sequence.
  • the more frames in the position sequence the lower the computational efficiency, but the performance of lip recognition is improved.
  • the number of frames in the position sequence is set to 60 frames.
  • the performance can be the accuracy of lip recognition.
  • Step 2 Single frame feature extraction.
  • the computer equipment implements single frame feature extraction through the single frame feature extraction network 701 in Figure 7.
  • the single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012.
  • the above-mentioned convolution inputs the features extracted through two convolutions into the spatial feature fusion network 7012.
  • the spatial feature fusion network 7012 first, a 5 ⁇ 1 convolution kernel is used to fuse the 5-dimensional features of each key point to obtain the spatial features of each image frame. Then, each image frame is used to go through the spatial feature extraction network. 7011 extracts the features 705 and uses a 1 ⁇ 1 convolution kernel to fuse the features between the 20 key points to obtain the spatial features 706 of the image frame and complete the single frame feature extraction.
  • the convolution kernel may be a residual block kernel (Residual Block kernel).
  • Step 3 Inter-frame feature fusion.
  • the computer device implements inter-frame feature fusion of adjacent image frames through the inter-frame feature fusion network 702 in Figure 7.
  • This step will occupy a certain amount of computing resources.
  • the convolution kernel size can be increased and the number of repetitions will be increased, which will accordingly affect the computing efficiency.
  • the number of extractions can be set to 5 times, and the convolution kernel size can be set to 5.
  • Step 4 Feature sequence classification.
  • the computer device implements classification of the feature sequence through the feature sequence classification network 703 in Figure 7, and obtains the keyword sequence number corresponding to the video sequence.
  • the feature sequence includes spatiotemporal features of multiple image frames.
  • the feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.
  • the spatio-temporal features are input into the "flat layer + second linear layer + nonlinear activation (relu) layer" in the syllable feature layer 7031 for processing.
  • the spatio-temporal features of all image frames are merged into one-dimensional vectors 707 to realize the spatio-temporal processing of multiple image frames. Feature fusion of features.
  • the one-dimensional vector 707 is input into the third linear layer in the syllable feature layer 7031 for 100-class single syllable auxiliary classification to obtain syllable classification features.
  • the syllable classification features are input into the first linear layer 7032 to output the keyword sequence number of the video sequence to be detected.
  • the third linear layer can be a normalized exponential function (Softmax function) and is trained with a binary cross-entropy loss (BCEloss) function as the loss function.
  • the first linear layer 7032 can be trained using the focal loss function as the loss function, and the softmax function can be used for prediction; in practical applications, the first linear layer 7032 can be a margin linear layer, consisting of a fully connected layer or a global average Pooling layer implementation. Compared with using the global average pooling layer, the direct expansion of the fully connected layer is equivalent to each frame corresponding to a learnable position embedding, so that the position sequence information of each frame in the sentence can be recorded.
  • a detection algorithm for lip recognition using syllable-assisted learning is used.
  • syllable-assisted learning there are a total of 419 categories of pronunciation for all Chinese characters. These 419 categories of syllables can be divided into 100 categories according to mouth shape. Syllables with the same mouth shape are classified into the same category.
  • the characteristics of length 100 (corresponding to the aforementioned The syllable classification feature in the embodiment) is placed before the fully connected layer of the final classification, and the output of this feature is used as the auxiliary supervision of the 100 classification.
  • the output of the syllable feature layer 7031 represents which syllables are shared in the lip sequence, and the syllables Classifying the output results of feature layer 7031 can effectively reduce the learning difficulty of fully connected layer classification, thereby improving performance.
  • the syllable feature layer 7031 can be implemented using a linear layer.
  • the monosyllable auxiliary strategy significantly improves performance; and these keywords used for matching can be stored in the form of a preset keyword library.
  • these keywords used for matching can be added in the The preset keyword library has been added accordingly to facilitate keyword updates.
  • the above-mentioned coordinate difference value may correspond to the difference information of the position information in the previous embodiment
  • the video sequence may correspond to the image frame sequence in the previous embodiment
  • the single-frame feature extraction network 701 may correspond to Spatial characteristics in the aforementioned embodiments Extraction sub-network
  • the inter-frame feature fusion network 702 may correspond to the temporal feature extraction sub-network in the previous embodiment
  • the syllable feature layer 7031 may correspond to the syllable classification feature extraction sub-network in the previous embodiment
  • the first linear layer 7032 may correspond to The classification network in the previous embodiment.
  • lip recognition can make up for the inconvenience caused by the limitations of speech recognition to a certain extent. .
  • Lip recognition can detect the keywords corresponding to what the speaker said in that interval based on the speech interval detected by lip movement recognition.
  • voice recognition is the main means of human-computer interaction, but when the car makes a lot of noise on the highway, or when the music is played loudly, the voice recognition cannot accurately recognize the user's voice; or, When someone is sleeping in the car, it is inconvenient for the user to use voice to interact. At this time, through lip recognition, the user only needs to use mouth shape to simulate speaking, and the car machine can detect the user's instructions, thereby completing human-computer interaction.
  • the embodiments of the present disclosure utilize key point recognition, which takes up less computing resources and can learn the inter-frame motion information of lips, making it easier to deploy, more efficient and more accurate.
  • the image processing method provided by the embodiment of the present disclosure supports the recognition of 35 types of commonly used keywords when used for lip language recognition, and the recognition recall rate reaches 81% while controlling the false alarm rate to less than one thousandth.
  • embodiments of the present disclosure also provide an image processing device, which includes various units and modules included in each unit, which can be implemented by a processor in a computer device; of course, it can also It can be realized through specific logic circuits; during the implementation process, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), or a digital signal processor (Digital Signal Processor, DSP) Or Field Programmable Gate Array (FPGA), etc.
  • CPU Central Processing Unit
  • MPU microprocessor
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • FIG. 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure.
  • the image processing device 800 includes: a first acquisition part 810, a first recognition part 820, a first determination part 830 and a A matching part 840, where:
  • the first acquisition part 810 is configured to acquire an image frame sequence containing a mouth object
  • the first recognition part 820 is configured to extract mouth key point features for each image frame in the image frame sequence to obtain the mouth key point features of each image frame;
  • the first determining part 830 is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence The syllable category corresponding to the subject's mouth shape;
  • the first matching part 840 is configured to determine keywords matching the syllable classification features in the preset keyword library.
  • the first identification part 820 includes: a first determining sub-part configured to determine at least two of the mouth objects. position information of the key point of the mouth in each image frame; a second determination sub-part configured to, for each image frame in the sequence of image frames, determine the position information of the key point of the mouth according to the phase of the image frame and the image frame; The position information of the mouth key points in adjacent frames determines the characteristics of the mouth key points corresponding to the image frame.
  • the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point;
  • the second determination sub-part includes: a first determination unit, configured For each mouth key point, according to the position information of the mouth key point in the image frame and the position information of the mouth key point in adjacent image frames of the image frame, Determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point;
  • the second determination unit is configured to, for each of the mouth key points, calculate a second height difference and/or a second width difference between the mouth key point in the image frame and other mouth key points of the same mouth object. , determine the intra-frame difference information of the mouth key point.
  • the first determination part 830 includes: a first extraction sub-part configured to perform spatial feature extraction on mouth key point features of each image frame to obtain the mouth object. Spatial features in each image frame; the second extraction sub-part is configured to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object; The third extraction sub-part is configured to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  • the first extraction sub-part includes: a first extraction unit configured to perform inter-frame difference information and intra-frame difference information on a plurality of mouth key points of the mouth object. Fusion to obtain the inter-frame difference features and intra-frame difference features of the mouth object in each image frame; the second extraction unit is configured to obtain the inter-frame difference features of the mouth object in multiple image frames Fusion with intra-frame difference features to obtain the spatial features of the mouth object in each image frame.
  • the first determining part 830 includes: a third determining sub-part configured to use a trained syllable feature extraction network to determine mouth key point features of multiple image frames in the image frame sequence. Perform processing to obtain syllable classification features; the first matching part 840 includes: a first matching sub-part configured to use the trained classification network to determine in the preset keyword library that matches the syllable classification features. Key words.
  • the first acquisition part 810 including the frame interpolation sub-part, is configured to: perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or, Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
  • the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification feature extraction sub-network;
  • the third determination sub-part includes: a third extraction unit configured to utilize The spatial feature extraction subnetwork, Perform spatial feature extraction on the mouth key point features of each image frame respectively to obtain the spatial features of the mouth object in each image frame;
  • the fourth extraction unit is configured to utilize the temporal feature extraction sub-network , perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatiotemporal features of the mouth object;
  • the fifth extraction unit is configured to utilize the classification feature extraction sub-network, based on The spatiotemporal features of the mouth object are subjected to syllable classification feature extraction to obtain the syllable classification features of the mouth object.
  • the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment.
  • the functions or parts of the device provided by the embodiments of the present disclosure can be used to perform the methods described in the above method embodiments.
  • the methods of the present disclosure please refer to the descriptions of the embodiments.
  • embodiments of the present disclosure provide a device for generating a lip recognition model.
  • the device includes each unit included and each part included in each unit, which can be implemented by a processor in a computer device; Of course, it can also be implemented through specific logic circuits; during the implementation process, the processor can be CPU, MPU, DSP or FPGA, etc.
  • Figure 9 is a schematic structural diagram of a device for generating a lip recognition model provided by an embodiment of the present disclosure.
  • the device 900 includes: a second acquisition part 910, a second recognition part 920, and a second matching part. 930 and updated section 940, which:
  • the second acquisition part 910 is configured to acquire a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag;
  • the second identification part 920 is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;
  • the second matching part 930 is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and generate syllable classification features in the preset keyword library Determine keywords that match the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
  • the update part 940 is configured to update the network parameters of the model at least once based on the determined keywords and the keyword tags to obtain a trained lip recognition model.
  • the model includes a syllable feature extraction network and a classification network;
  • the second matching part 930 includes: a fourth determining sub-part configured to use the feature extraction network to determine the sample image according to the sample image.
  • the mouth key point features of multiple sample image frames in the frame sequence generate syllable classification features;
  • the fifth determination sub-part is configured to use the classification network to determine the syllable classification in the preset keyword library Keywords for feature matching.
  • the feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network;
  • the fourth determination sub-part includes: a sixth extraction unit configured to utilize The spatial feature extraction sub-network performs spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame;
  • the seventh extraction unit is Configured to use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object;
  • the eighth extraction unit is It is configured to use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  • the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment.
  • the functions or parts of the device provided by the embodiments of the present disclosure can be used to perform the methods described in the above method embodiments.
  • the methods of the present disclosure please refer to the descriptions of the embodiments.
  • An embodiment of the present disclosure provides a vehicle, including:
  • a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object
  • a vehicle machine connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence
  • the syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
  • the above method is implemented in the form of a software functional part and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the software product is stored in a storage medium and includes a number of instructions to enable a A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk and other media that can store program code.
  • U disk mobile hard disk
  • read-only memory Read Only Memory
  • ROM Read Only Memory
  • magnetic disk or optical disk and other media that can store program code.
  • the embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of hardware, software, and firmware.
  • An embodiment of the present disclosure provides a computer device, including a memory and a processor.
  • the memory stores a computer program that can be run on the processor.
  • the processor executes the program, some or all of the steps in the above method are implemented.
  • Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented.
  • the computer-readable storage medium may be transient or non-transitory.
  • Embodiments of the present disclosure provide a computer program, which includes computer readable code.
  • the processor in the computer device executes a part for implementing the above method or All steps.
  • Embodiments of the present disclosure provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, some of the above methods are implemented or All steps.
  • the computer program product can be implemented specifically through hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium.
  • the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and so on.
  • Figure 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure.
  • the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002 and a memory 1003, where:
  • Processor 1001 generally controls the overall operation of computer device 1000 .
  • the communication interface 1002 can enable the computer device to communicate with other terminals or servers through a network.
  • the memory 1003 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various parts of the computer device 1000 (for example, image data, audio data, voice communication data and video communication data).
  • the memory 1003 can be implemented by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
  • Data can be transmitted between the processor 1001, the communication interface 1002 and the memory 1003 through the bus 1004.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be electrical, mechanical, or other forms. of.
  • the units described above as separate components may or may not be physically separated; the components shown as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure can be all integrated into one processing unit, or each unit can be separately used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the products applying the disclosed technical solution will clearly inform the personal information processing rules and obtain the individual's independent consent before processing personal information.
  • the product applying the disclosed technical solution must obtain the individual's separate consent before processing the sensitive personal information, and at the same time meet the requirement of "express consent”. For example, setting up clear and conspicuous signs on personal information collection devices such as cameras to inform them that they have entered the scope of personal information collection, and that personal information will be collected.
  • personal information processing rules may include personal information processing rules.
  • Information processors purposes of personal information processing, processing methods, types of personal information processed, etc.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the execution includes: Implementation of the above method
  • the aforementioned steps include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks, and other media that can store program codes.
  • the above-mentioned integrated units of the present disclosure are implemented in the form of software functional parts and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the technical solution of the present disclosure can be embodied in the form of a software product in essence or that contributes to related technologies.
  • the computer software product is stored in a storage medium and includes a number of instructions to enable a computer.
  • a computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: mobile storage devices, ROMs, magnetic disks or optical disks and other media that can store program codes.
  • the present disclosure relates to an image processing method, a model generation method, a device, a vehicle, a storage medium and a computer program product.
  • the image processing method includes: acquiring an image frame sequence including a mouth object; and processing each image in the image frame sequence. Extract mouth key point features from each frame to obtain the mouth key point features of each image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; among them, the syllable classification features represent the image frames The syllable category corresponding to the mouth shape of the mouth object in the sequence; determine the keywords that match the syllable classification characteristics in the preset keyword library.
  • the above solution can reduce the amount of calculation required in the image processing process of lip recognition, thereby reducing the hardware requirements for computer equipment; at the same time, it can achieve good recognition results for facial images with different face shapes, textures and other appearance information.
  • This improves the generalization ability of lip language recognition; in addition, by expressing the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the keywords obtained by image processing can be More accurate, thereby improving the accuracy of lip recognition.

Abstract

Disclosed in embodiments of the present disclosure are an image processing method and apparatus, a model generation method and apparatus, a vehicle, a storage medium, and a computer program product. The image processing method comprises: obtaining an image frame sequence comprising a mouth object; extracting mouth key point features from each image frame in the image frame sequence to obtain the mouth key point features of each image frame; generating syllable classification features according to the mouth key point features of the plurality of image frames in the image frame sequence, wherein the syllable classification features each represent a syllable category corresponding to a mouth shape of the mouth object in the image frame sequence; and determining, in a preset keyword library, a keyword matched with the syllable classification features.

Description

图像处理方法及模型生成方法、装置、车辆、存储介质及计算机程序产品Image processing method and model generation method, device, vehicle, storage medium and computer program product
相关申请的交叉引用Cross-references to related applications
本公开实施例基于申请号为202210476318.1、申请日为2022年04月29日、申请名称为“图像处理方法及模型生成方法、装置、车辆、存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。This disclosed embodiment is based on a Chinese patent application with application number 202210476318.1, application date is April 29, 2022, and the application name is "Image processing method and model generation method, device, vehicle, storage medium", and claims the Chinese patent Priority of the application, the entire content of this Chinese patent application is hereby incorporated by reference into this disclosure.
技术领域Technical field
本公开涉及但不限于信息技术领域,尤其涉及一种图像处理方法及模型生成方法、装置、车辆、存储介质及计算机程序产品。The present disclosure relates to but is not limited to the field of information technology, and in particular, to an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.
背景技术Background technique
唇语识别技术,可以利用计算机视觉技术从视频图像中识别人脸,提取人脸的嘴部区域的变化特征,从而识别出视频对应的文本内容。Lip recognition technology can use computer vision technology to identify faces from video images, extract the changing features of the mouth area of the face, and thereby identify the text content corresponding to the video.
发明内容Contents of the invention
有鉴于此,本公开实施例至少提供一种图像处理方法及模型生成方法、装置、车辆、存储介质及计算机程序产品。In view of this, embodiments of the present disclosure provide at least an image processing method and a model generation method, a device, a vehicle, a storage medium and a computer program product.
本公开实施例的技术方案是这样实现的:The technical solution of the embodiment of the present disclosure is implemented as follows:
本公开实施例提供一种图像处理方法,所述方法包括:获取包含嘴部对象的图像帧序列;对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;在预设关键词库中确定与所述音节分类特征匹配的关键词。An embodiment of the present disclosure provides an image processing method. The method includes: acquiring an image frame sequence containing a mouth object; extracting mouth key point features for each image frame in the image frame sequence to obtain each of the image frame sequences. Mouth key point features of an image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence The syllable category corresponding to the mouth shape of the object is determined; the keyword matching the syllable classification feature is determined in the preset keyword database.
本公开实施例还提供一种生成唇语识别模型的方法,所述方法包括:获取包含嘴部对象的样本图像帧序列;其中,所述样本图像帧序列标注有关键词标签;对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征;利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词;其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别;基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少一次更新,得到经过训练的唇语识别模型。Embodiments of the present disclosure also provide a method for generating a lip recognition model. The method includes: obtaining a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag; Each sample image frame in the image frame sequence performs mouth key point feature extraction to obtain the mouth key point features of each sample image frame; using the model to be trained, according to multiple samples in the sample image frame sequence The mouth key point features of the image frame generate syllable classification features, and determine keywords matching the syllable classification features in the preset keyword library; wherein the syllable classification features represent the sample image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; based on the determined keywords and the keyword tags, update the network parameters of the model at least once to obtain a trained lip recognition model.
本公开实施例还提供一种图像处理装置,所述装置包括:An embodiment of the present disclosure also provides an image processing device, which includes:
第一获取部分,被配置为获取包含嘴部对象的图像帧序列;a first acquisition part configured to acquire a sequence of image frames containing the mouth object;
第一识别部分,被配置为对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;The first recognition part is configured to extract mouth key point features for each image frame in the image frame sequence, and obtain the mouth key point features of each image frame;
第一确定部分,被配置为根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;The first determining part is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the mouth object in the image frame sequence The syllable category corresponding to the mouth shape;
第一匹配部分,被配置为在预设关键词库中确定与所述音节分类特征匹配的关键词。The first matching part is configured to determine keywords matching the syllable classification features in the preset keyword library.
本公开实施例还提供一种生成唇语识别模型的装置,所述装置包括:An embodiment of the present disclosure also provides a device for generating a lip recognition model. The device includes:
第二获取部分,被配置为获取包含嘴部对象的样本图像帧序列;其中,所述样本图像帧序列标注有关键词标签;The second acquisition part is configured to acquire a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;
第二识别部分,被配置为对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征;The second recognition part is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;
第二匹配部分,被配置为利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词;其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别;The second matching part is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine them in the preset keyword library Keywords matching the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
更新部分,被配置为基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少 一次更新,得到经过训练的唇语识别模型。The update part is configured to perform at least on the network parameters of the model based on the determined keywords and the keyword tags. Once updated, a trained lip recognition model is obtained.
本公开实施例还提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的部分或全部步骤。An embodiment of the present disclosure also provides a computer device, including a memory and a processor. The memory stores a computer program that can be run on the processor. When the processor executes the program, some or all of the steps in the above method are implemented. .
本公开实施例还提供一种车辆,包括:An embodiment of the present disclosure also provides a vehicle, including:
车载相机,被配置为拍摄包含嘴部对象的图像帧序列;a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object;
车机,与所述车载相机连接,被配置为从所述车载相机获取包含嘴部对象的图像帧序列;对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;在预设关键词库中确定与所述音节分类特征匹配的关键词。A vehicle machine, connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的部分或全部步骤。Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented.
本公开实施例还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算机设备中运行时,所述计算机设备中的处理器执行用于实现上述方法中的部分或全部步骤。Embodiments of the present disclosure also provide a computer program, including computer readable code. When the computer readable code is run in a computer device, the processor in the computer device executes for implementing part or all of the above method. step.
本公开实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。Embodiments of the present disclosure also provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part of the above method is implemented. or all steps.
本公开实施例中,首先,获取图像内容包含嘴部对象的图像帧序列,这样,能够得到记录设定对象说话时的嘴部对象变化过程的图像帧序列;其次,对图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到图像帧序列中多个图像帧的每一图像帧的嘴部关键点特征,相较于利用脸部图像裁剪得到的嘴部区域图像序列进行唇语识别,利用嘴部关键点特征进行唇语识别,可以降低图像处理过程所需的计算量,从而降低执行图像处理方法的计算机设备的硬件要求;并且,利用嘴部关键点特征进行唇语识别时,涉及对嘴部关键点特征的提取,因此,对不同脸型、纹理等外观信息的脸部图像都能取得良好的识别效果,从而提高了唇语识别的泛化能力;再次,根据图像帧序列中多个图像帧的嘴部关键点特征生成音节分类特征,音节分类特征表征图像帧序列中嘴部对象的口型对应的音节类别,这样,从嘴部关键点特征中提取音节分类特征,音节分类特征可以表示与图像帧序列中嘴部对象的口型所对应的至少一种音节,利用音节分类特征辅助唇语识别,能够提升唇语识别的准确度;最后,根据音节分类特征在预设关键词库中匹配确定出匹配的关键词,这样,通过表示图像帧序列对应的音节分类特征,根据音节分类特征表征的音节类别确定与音节对应字词的关键词,从而提升图像处理得到的关键词的正确度。In the embodiment of the present disclosure, first, an image frame sequence whose image content includes a mouth object is obtained. In this way, an image frame sequence that records the change process of the mouth object when the set object speaks can be obtained; secondly, each image frame sequence in the image frame sequence is obtained. Extract key point features of the mouth in one image frame to obtain the key point features of the mouth in each image frame of multiple image frames in the image frame sequence. Compared with the mouth region image sequence obtained by cropping the face image for lip language Recognition, using key point features of the mouth for lip language recognition can reduce the amount of calculation required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method; and, when using key point features of the mouth for lip language recognition , involving the extraction of mouth key point features, therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, thereby improving the generalization ability of lip language recognition; again, according to the image frame sequence The mouth key point features in multiple image frames generate syllable classification features. The syllable classification features represent the syllable categories corresponding to the mouth shapes of the mouth objects in the image frame sequence. In this way, the syllable classification features are extracted from the mouth key point features. The syllable classification features Classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence. Using syllable classification features to assist lip recognition can improve the accuracy of lip recognition; finally, based on the syllable classification features, the preset The matching keywords are determined by matching in the keyword database. In this way, by representing the syllable classification features corresponding to the image frame sequence, the keywords corresponding to the syllables are determined according to the syllable classification features represented by the syllable classification features, thereby improving the key words obtained by image processing. word accuracy.
上述方案中,通过图像帧序列中的图像帧的嘴部关键点特征提取得到嘴部关键点特征,利用嘴部关键点特征生成图像帧序列对应的音节分类特征,根据音节分类特征在预设关键词库中匹配得到关键词。这样,可以降低唇语识别的图像处理过程所需的计算量,从而可以降低对计算机设备的硬件要求;同时,可以对不同脸型、纹理等外观信息的脸部图像都能取得良好的识别效果,从而提高了唇语识别的泛化能力;此外,通过表示图像帧序列对应的音节分类特征,根据音节分类特征表征的音节类别确定与音节对应字词的关键词,可以使得图像处理得到的关键词更精确,从而能够提升唇语识别的准确度。In the above scheme, the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence. Based on the syllable classification features, the preset key points are Keywords are obtained by matching in the lexicon. In this way, the amount of calculation required for the image processing process of lip recognition can be reduced, thereby reducing the hardware requirements for computer equipment; at the same time, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information. This improves the generalization ability of lip language recognition; in addition, by expressing the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the keywords obtained by image processing can be More accurate, thereby improving the accuracy of lip recognition.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开的技术方案。It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit the technical solution of the present disclosure.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。The accompanying drawings herein are incorporated into and constitute a part of this specification. They illustrate embodiments consistent with the disclosure and, together with the description, serve to explain the technical solutions of the disclosure.
图1为本公开实施例提供的一种图像处理方法的实现流程示意图;Figure 1 is a schematic flow chart of an implementation of an image processing method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的一种图像处理方法的又一实现流程示意图;Figure 2 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的一种脸部关键点的示意图;Figure 3 is a schematic diagram of facial key points provided by an embodiment of the present disclosure;
图4为本公开实施例提供的一种图像处理方法的又一实现流程示意图;Figure 4 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;
图5为本公开实施例提供的一种图像处理方法的又一实现流程示意图;Figure 5 is a schematic flow diagram of another implementation of an image processing method provided by an embodiment of the present disclosure;
图6为本公开实施例提供的一种生成唇语识别模型的方法的实现流程示意图;Figure 6 is a schematic flowchart of the implementation of a method for generating a lip language recognition model provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种唇语识别模型的组成结构示意图;Figure 7 is a schematic structural diagram of a lip language recognition model provided by an embodiment of the present disclosure;
图8为本公开实施例提供的一种图像处理装置的组成结构示意图;Figure 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure;
图9为本公开实施例提供的一种生成唇语识别模型的装置的组成结构示意图;Figure 9 is a schematic structural diagram of a device for generating a lip language recognition model provided by an embodiment of the present disclosure;
图10为本公开实施例提供的一种计算机设备的硬件实体示意图。 FIG. 10 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本公开的目的、技术方案和优点更加清楚,下面结合附图和实施例对本公开的技术方案进一步详细阐述,所描述的实施例不应视为对本公开的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure are further elaborated below in conjunction with the accompanying drawings and examples. The described embodiments should not be regarded as limiting the present disclosure. Those of ordinary skill in the art will All other embodiments obtained without creative efforts belong to the scope of protection of this disclosure.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
所涉及的术语“第一/第二/第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一/第二/第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。The terms "first/second/third" involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first/second/third" can be used interchangeably if permitted. The specific order or sequence may be changed so that the embodiments of the disclosure described herein can be implemented in an order other than that illustrated or described herein.
除非另有定义,本文所使用的所有的技术和科学术语与属于本公开的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本公开的目的,不是旨在限制本公开。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the disclosure only and is not intended to be limiting of the disclosure.
在环境噪音过大或不方便发声的场景中,唇语识别可以弥补语音识别的局限性,从而能够增强人机交互的强健性。In scenes where the environmental noise is too loud or it is inconvenient to speak, lip recognition can make up for the limitations of speech recognition, thereby enhancing the robustness of human-computer interaction.
在相关技术的唇语识别的图像处理过程中,首先,通过人脸检测找出图像中人脸对应的位置,然后,把图像中嘴部区域切割出来,得到嘴部区域图像的图像序列,最后,将该图像序列输入三维卷积神经网络(3D卷积神经网络)进行特征提取,并将特征输入时序预测网络进行分类。但是,由于嘴部区域图像的图像序列对嘴部运动信息不敏感,使得唇语识别的准确度不高,且三维卷积需要消耗大量计算资源,对硬件要求也很高,导致上述唇语识别方法难以大范围应用。In the image processing process of lip language recognition in related technologies, first, find the position corresponding to the face in the image through face detection, then cut out the mouth area in the image to obtain an image sequence of the mouth area image, and finally , input the image sequence into a three-dimensional convolutional neural network (3D convolutional neural network) for feature extraction, and input the features into the time series prediction network for classification. However, since the image sequence of the mouth area image is not sensitive to mouth movement information, the accuracy of lip recognition is not high, and three-dimensional convolution consumes a lot of computing resources and has high hardware requirements, resulting in the above-mentioned lip recognition. The method is difficult to apply on a large scale.
本公开实施例提供一种图像处理方法,该方法可以由计算机设备的处理器执行。其中,计算机设备指的可以是车机、服务器、笔记本电脑、平板电脑、台式计算机、智能电视、机顶盒、移动设备(例如移动电话、便携式视频播放器、个人数字助理、专用消息设备、便携式游戏设备)等具备数据处理能力的设备。Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device. Among them, computer equipment can refer to cars, servers, laptops, tablets, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable gaming devices ) and other equipment with data processing capabilities.
下面,将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。Below, the technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present disclosure.
图1为本公开实施例提供的一种图像处理方法的实现流程示意图,如图1所示,该方法包括如下步骤S101至步骤S104:Figure 1 is a schematic flow chart of an image processing method provided by an embodiment of the present disclosure. As shown in Figure 1, the method includes the following steps S101 to S104:
步骤S101,获取包含嘴部对象的图像帧序列。Step S101: Obtain an image frame sequence containing a mouth object.
这里,计算机设备获取到多个图像帧,多个图像帧可以由摄像头等采集组件对说话过程中的设定对象拍摄得到,按照每一图像帧对应的时间参数对多个图像帧进行排序,得到原始图像帧序列。其中,图像帧序列中多个图像帧的画面至少包含同一设定对象的嘴部对象。设定对象通常是人类,但也可以是其他具有表达能力的动物,例如猩猩。在一些实施方式中,图像帧序列至少覆盖设定对象说一句话的完整过程,例如,图像帧序列中多个图像帧至少覆盖设定对象说“打开音乐”这句话的完整过程。并且,图像帧序列包括的图像帧的帧数可以是不固定的,例如,图像帧序列的帧数可以为40帧、50帧或100帧。这里,可以直接将原始图像帧序列作为后续图像处理使用的图像帧序列;也可以进一步处理原始图像序列得到后续图像处理使用的图像帧序列,例如,对原始图像序列进行插帧处理,获得设定帧数的图像帧序列。因而,在本公开各实施例中的图像帧序列中的图像帧,可以是利用采集组件真实采集的,也可以是根据真实采集的图像帧生成的。Here, the computer device acquires multiple image frames. The multiple image frames can be captured by a collection component such as a camera to capture the set object during the speaking process. The multiple image frames are sorted according to the time parameters corresponding to each image frame, and we obtain Raw image frame sequence. Wherein, the multiple image frames in the image frame sequence at least include the mouth object of the same setting object. The subjects are usually humans, but can also be other expressive animals, such as orangutans. In some embodiments, the image frame sequence at least covers the entire process of the setting object saying a sentence. For example, multiple image frames in the image frame sequence at least cover the entire process of the setting object saying "turn on the music." Moreover, the number of image frames included in the image frame sequence may not be fixed. For example, the number of frames in the image frame sequence may be 40 frames, 50 frames, or 100 frames. Here, the original image frame sequence can be directly used as the image frame sequence used for subsequent image processing; the original image sequence can also be further processed to obtain the image frame sequence used for subsequent image processing. For example, the original image sequence can be interpolated to obtain the settings. Frame number of image frame sequence. Therefore, the image frames in the image frame sequence in various embodiments of the present disclosure may be actually collected using the acquisition component, or may be generated based on the actual collected image frames.
在一些实施方式中,计算机设备获取多个图像帧的方式,可以是本计算机设备通过调用摄像头获取的,也可以是从其他计算机设备获取的;例如本计算机设备为车辆,该车辆可以通过车载相机获取图像,也可以利用与移动终端的无线传输等方式,获取移动终端采集的图像。需要说明的是,图像帧序列中至少一个图像帧可以来源于视频,其中,一个视频可以包括多个视频帧,每个视频帧分别对应一个图像帧,图像帧序列中的图像帧可以是连续的视频帧,也可以是以固定或不固定的时间间隔在多个视频帧中选取的不连续的视频帧。在实施时,可以获取预先采集的多个图像帧,也可以实时对设定对象进行图像采集得到多个图像帧,这里并不限定。In some embodiments, the computer device can obtain multiple image frames by calling a camera, or it can obtain them from other computer devices; for example, the computer device is a vehicle, and the vehicle can obtain it through a vehicle-mounted camera. To obtain images, you can also use wireless transmission with the mobile terminal to obtain images collected by the mobile terminal. It should be noted that at least one image frame in the image frame sequence can be derived from a video, where one video can include multiple video frames, each video frame corresponds to an image frame, and the image frames in the image frame sequence can be continuous. Video frames can also be discontinuous video frames selected from multiple video frames at fixed or non-fixed time intervals. During implementation, multiple image frames collected in advance can be obtained, or multiple image frames can be obtained by collecting images of the set object in real time, which is not limited here.
这样,能够得到记录设定对象说话时的嘴部对象变化过程的图像帧序列。In this way, an image frame sequence can be obtained that records the change process of the mouth object when the set object speaks.
步骤S102,对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征。Step S102: Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.
这里,对图像帧序列中的至少一个图像帧进行嘴部关键特征提取时,从图像帧的脸部关键点中提取与嘴部对象关联的嘴部关键点的位置信息,并基于至少一个图像帧的嘴部关键点的位置信息,确定每一个图像帧对应的一个嘴部关键点特征,从而得到图像帧序列的至少一个嘴部关键点特征。其中,嘴部关键点特征是由嘴部关键点的位置信息计算得到,而嘴部关键点的位置信息与图像帧中包含的嘴部对象的口型相关,即,同一嘴部关键点在不同图像帧中的位置信息,与这个图像帧中嘴部对象的口型相关。Here, when performing mouth key feature extraction on at least one image frame in the image frame sequence, the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the image frame, and based on at least one image frame The position information of the mouth key points is determined to determine a mouth key point feature corresponding to each image frame, thereby obtaining at least one mouth key point feature of the image frame sequence. Among them, the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the image frame, that is, the same mouth key point in different The position information in the image frame is related to the mouth shape of the mouth object in this image frame.
在一些实施方式中,基于图像帧中的嘴部关键点的位置信息确定图像帧对应的嘴部关键点特征的方 式,可以是按照每个嘴部关键点对应的关键点序号,对一个图像帧中的每个嘴部关键点的位置信息进行排序,得到位置序列,从而将位置序列作为嘴部关键点特征。例如,每一图像帧包括4个嘴部关键点,嘴部关键点的坐标分别为(x1,y1)、(x2,y2)、(x3,y3)、(x4,y4),确定出的该图像帧对应的嘴部关键点特征为[(x1,y1),(x2,y2),(x3,y3),(x4,y4)]。其中,嘴部关键点对应的关键点序号是针对脸部关键点预先设定的编号中嘴部关键点对应的编号,例如,在图3示出的脸部关键点示意图中,预先设置了106个关键点,并对每个关键点进行编号,关键点的序号从0至105,其中的第84-103号关键点是用于描述嘴巴的嘴部关键点。In some embodiments, the method of determining the characteristics of the mouth key points corresponding to the image frame is based on the position information of the mouth key points in the image frame. The formula can be to sort the position information of each mouth key point in an image frame according to the key point number corresponding to each mouth key point, to obtain a position sequence, and then use the position sequence as the mouth key point feature. For example, each image frame includes 4 mouth key points, and the coordinates of the mouth key points are (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ), the determined key point features of the mouth corresponding to the image frame are [(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ) ]. Among them, the key point serial number corresponding to the mouth key point is the number corresponding to the mouth key point among the numbers preset for the facial key points. For example, in the schematic diagram of the facial key points shown in Figure 3, 106 is preset. key points, and each key point is numbered. The key points are numbered from 0 to 105. The key points No. 84-103 are the mouth key points used to describe the mouth.
在一些实施方式中,在图像帧序列包括两个图像帧,或者多于两个图像帧的情况下,基于图像帧中的嘴部关键点的位置信息确定图像帧对应的嘴部关键点特征的方式,可以是计算每一图像帧和与该图像帧的相邻帧中的嘴部关键点的位置信息的差异信息,并按照对应的关键点序号,对一个图像帧中的每个嘴部关键点的差异信息进行排序,将排序序列作为该图像帧对应的嘴部关键点特征;其中,相邻帧可以是该图像帧的在图像帧序列中的前一图像帧和/或后一图像帧,也就是说,位置信息的差异信息包括以下至少之一:该图像帧与前一图像帧之间的差异信息;该图像帧与后一图像帧之间的差异信息。例如,在根据该图像帧与前一图像帧之间的差异信息确定该图像帧对应的嘴部关键点特征时,每一图像帧包括4个嘴部关键点,嘴部关键点在第一个图像帧中的坐标分别为(x1,y1)、(x2,y2)、(x3,y3)、(x4,y4),嘴部关键点在第二个图像帧中的坐标分别为(x'1,y'1)、(x'2,y'2)、(x'3,y'3)、(x'4,y'4),这样,确定出的第二个图像帧对应的嘴部关键点特征为[(x'1-x1,y'1-y1),(x'2-x2,y'2-y2),(x'3-x3,y'3-y3),(x'4-x4,y'4-y4)]。In some embodiments, when the image frame sequence includes two image frames, or more than two image frames, the characteristics of the mouth key points corresponding to the image frames are determined based on the position information of the mouth key points in the image frames. The method can be to calculate the difference information of the position information of the mouth key points in each image frame and the adjacent frame of the image frame, and calculate each mouth key in an image frame according to the corresponding key point serial number. The difference information of the points is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frame can be the previous image frame and/or the subsequent image frame of the image frame in the image frame sequence. , that is to say, the difference information of the position information includes at least one of the following: difference information between this image frame and the previous image frame; difference information between this image frame and the next image frame. For example, when determining the mouth key point characteristics corresponding to the image frame based on the difference information between the image frame and the previous image frame, each image frame includes 4 mouth key points, and the mouth key point is in the first The coordinates in the image frame are (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ), and the mouth key point is in the second image frame The coordinates are (x' 1 , y' 1 ), (x' 2 , y' 2 ), (x' 3 , y' 3 ), (x' 4 , y' 4 ) respectively. In this way, the determined The mouth key point features corresponding to the two image frames are [(x' 1 -x 1 , y' 1 -y 1 ), (x' 2 -x 2 , y' 2 -y 2 ), (x' 3 - x 3 , y' 3 -y 3 ), (x' 4 -x 4 , y' 4 -y 4 )].
这样,相较于利用嘴部区域图像序列进行唇语识别,利用嘴部关键点特征进行唇语识别,能够降低图像处理过程所需的计算量,从而降低对执行图像处理方法的计算机设备的硬件要求,进而使图像处理方法能够普遍适用于各种计算机设备。并且,利用嘴部关键点特征进行唇语识别时,涉及对嘴部关键点特征的提取,因此,对不同脸型、纹理等外观信息的脸部图像均可以取得良好的识别效果,提升唇语识别的泛化能力和准确度。In this way, compared with using mouth region image sequences for lip recognition, using key point features of the mouth for lip recognition can reduce the amount of calculations required in the image processing process, thereby reducing the hardware requirements for computer equipment that performs the image processing method. requirements, thereby making the image processing method universally applicable to various computer devices. Moreover, lip recognition using key point features of the mouth involves the extraction of key point features of the mouth. Therefore, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, improving lip recognition. generalization ability and accuracy.
步骤S103,根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征。Step S103: Generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence.
这里,可以对图像帧序列中多个图像帧的嘴部关键点特征进行特征提取,得到音节分类特征,其中,音节分类特征表征图像帧序列对应的至少一种预设音节类别,每一预设音节类别表征口型相同或相似的至少一种音节,也就是说,音节分类特征可以表示与图像帧序列中嘴部对象的口型对应的音节类别。其中,可以利用音节分类特征中的每一元素表示图像帧序列中是否存在一种音节类型,从而确定图像帧序列中图像包含的口型对应的至少一种音节。这里,可以根据口型的相似程度,预先将音节种类划分为设定数量的预设音节类别,每一预设音节类别包括口型相同或相似的至少一个音节种类,设定数量可以根据语言的类型进行设定;其中,口型相似程度可以根据经验人为或通过机器学习进行判定。以汉语为例进行说明,在不考虑音调的情况下,汉字共有419个音节种类,根据对应的口型可以将这419种音节划分为100类,对应的音节分类特征的长度为100;对于其他语言,例如英语,可以结合音标将音节种类划分为设定数量的预设音节类别,并根据音节和口型的对应关系设置音节分类特征的长度。Here, feature extraction can be performed on the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features, where the syllable classification features represent at least one preset syllable category corresponding to the image frame sequence, and each preset The syllable category represents at least one syllable with the same or similar mouth shape, that is, the syllable classification feature may represent a syllable category corresponding to the mouth shape of the mouth object in the image frame sequence. Each element in the syllable classification feature can be used to indicate whether there is a syllable type in the image frame sequence, thereby determining at least one syllable corresponding to the mouth shape contained in the image in the image frame sequence. Here, the syllable types can be divided into a set number of preset syllable categories in advance according to the similarity of the mouth shapes. Each preset syllable category includes at least one syllable type with the same or similar mouth shape. The set number can be based on the language. The type is set; among them, the degree of mouth shape similarity can be determined manually based on experience or through machine learning. Taking Chinese as an example, without considering tones, Chinese characters have a total of 419 syllable types. These 419 types of syllables can be divided into 100 categories according to the corresponding mouth shapes, and the length of the corresponding syllable classification feature is 100; for other Languages, such as English, can combine phonetic symbols to divide syllable types into a set number of preset syllable categories, and set the length of syllable classification features based on the correspondence between syllables and mouth shapes.
在一些实施方式中,可以通过对图像帧序列的至少两个嘴部关键点特征进行时空特征提取,得到每一嘴部关键点特征对应的时空特征,并根据时空特征确定音节分类特征。这里,可以利用时序预测网络和/或全卷积网络进行时空特征提取,得到每一嘴部关键点特征对应的时空特征。在一些实现方式中,还可以利用平坦(Flatten)层或其他方式拼接至少两个时空特征,再对拼接的时空特征进行分类,得到音节分类特征。In some embodiments, spatio-temporal features corresponding to each mouth key point feature can be obtained by performing spatio-temporal feature extraction on at least two mouth key point features of the image frame sequence, and syllable classification features can be determined based on the spatio-temporal features. Here, the temporal prediction network and/or the fully convolutional network can be used to extract spatiotemporal features to obtain the spatiotemporal features corresponding to each mouth key point feature. In some implementations, a flatten layer or other methods can be used to splice at least two spatio-temporal features, and then the spliced spatio-temporal features can be classified to obtain syllable classification features.
这样,从嘴部关键点特征中提取音节分类特征,音节分类特征可以表示与图像帧序列中嘴部对象的口型所对应的至少一种音节,然后,利用音节分类特征辅助唇语识别,能够提升唇语识别的准确度。In this way, syllable classification features are extracted from the mouth key point features. The syllable classification features can represent at least one syllable corresponding to the mouth shape of the mouth object in the image frame sequence. Then, the syllable classification features are used to assist lip language recognition, which can Improve the accuracy of lip recognition.
步骤S104,在预设关键词库中确定与所述音节分类特征匹配的关键词。Step S104: Determine keywords matching the syllable classification features in the preset keyword database.
在一些实施方式中,预先在关键词库中设置一定数量的关键词,每个关键词能够与特定的音节分类特征相匹配,从而根据关键词与音节分类特征的匹配结果,得到唇语识别的图像处理结果。其中,确定关键词之后,可以直接输出关键词,也可以输出关键词在关键词库中的序号。In some embodiments, a certain number of keywords are preset in the keyword library, and each keyword can be matched with a specific syllable classification feature, so that lip language recognition can be obtained based on the matching results of the keywords and the syllable classification features. Image processing results. Among them, after the keyword is determined, the keyword can be output directly, or the sequence number of the keyword in the keyword database can be output.
在一些实现方式中,预设关键词库中的预设关键词可以根据具体的应用场景进行设置,例如,在驾驶场景下,可以将预设关键词设置为“打开音响”、“打开左侧车窗”等。需要说明的是,预设关键词库表示关键词的存储形式。In some implementations, the preset keywords in the preset keyword library can be set according to specific application scenarios. For example, in a driving scenario, the preset keywords can be set to "turn on the audio", "turn on the left side". car windows" etc. It should be noted that the preset keyword library represents the storage form of keywords.
在一些实现方式中,可以结合说话检测得到的检测结果和唇语识别得到的识别结果,确定匹配的关键词;例如,分别设置说话检测的检测结果和唇语识别的识别结果的权重,将加权计算结果作为匹配依据。其中,说话检测可以包括但不限于对嘴部对象是否处于说话状态、处于说话状态的说话区间等中的 至少之一进行检测的过程。In some implementations, the matching keywords can be determined by combining the detection results obtained by speech detection and the recognition results obtained by lip recognition; for example, the weights of the detection results of speech detection and the recognition results of lip recognition can be set separately, and the weighted The calculation results are used as the basis for matching. Among them, speaking detection may include but is not limited to whether the mouth object is in a speaking state, the speaking interval in the speaking state, etc. The process of conducting at least one test.
这样,通过确定图像帧序列对应的音节分类特征,并根据音节分类特征表征的音节类别确定与音节对应字词的关键词,提升了图像处理得到的关键词的准确度。In this way, by determining the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the accuracy of the keywords obtained by image processing is improved.
在本公开实施例中,通过图像帧序列中的图像帧的嘴部关键点特征提取得到嘴部关键点特征,利用嘴部关键点特征生成图像帧序列对应的音节分类特征,根据音节分类特征在预设关键词库中匹配得到关键词。这样,通过对二维图像帧进行特征提取得到唇语识别结果,从而可以降低唇语识别的图像处理过程所需的计算量,降低对计算机设备的硬件要求;同时,对不同脸型、纹理等外观信息的脸部图像都能取得良好的识别效果,从而提高了唇语识别的泛化能力;此外,通过表示图像帧序列对应的音节分类特征,根据音节分类特征表征的音节类别确定与音节对应字词的关键词,可以使得图像处理得到的关键词更精确,从而能够提升唇语识别的准确度。In the embodiment of the present disclosure, the mouth key point features are obtained by extracting the mouth key point features of the image frames in the image frame sequence, and the mouth key point features are used to generate syllable classification features corresponding to the image frame sequence. According to the syllable classification features, Keywords are obtained by matching in the preset keyword library. In this way, lip recognition results are obtained by extracting features from two-dimensional image frames, which can reduce the amount of calculation required for image processing of lip recognition and reduce the hardware requirements for computer equipment; at the same time, the appearance of different face shapes, textures, etc. Information facial images can achieve good recognition results, thereby improving the generalization ability of lip language recognition; in addition, by representing the syllable classification features corresponding to the image frame sequence, the syllable-corresponding word is determined based on the syllable category represented by the syllable classification features. Keywords of words can make the keywords obtained by image processing more accurate, thereby improving the accuracy of lip recognition.
在一些实现方式中,通过唇动识别处理检测视频中设定对象的说话区间,得到覆盖设定对象说话过程的图像帧序列,即上述步骤S101可以通过以下步骤S1011和S1012实现:In some implementations, the speaking interval of the set object in the video is detected through lip movement recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S101 can be implemented through the following steps S1011 and S1012:
步骤S1011,获取图像画面包含所述嘴部对象的视频。Step S1011: Obtain a video in which the image frame includes the mouth object.
这里,计算机设备通过摄像头等采集组件对设定对象进行拍摄,得到图像画面包含嘴部对象的视频。Here, the computer device captures the set object through a collection component such as a camera, and obtains a video in which the image frame includes the mouth object.
步骤S1012,对所述嘴部对象进行唇动识别,将所述嘴部对象处于说话状态的多个视频帧确定为图像帧序列。Step S1012: Perform lip movement recognition on the mouth object, and determine multiple video frames in which the mouth object is in a speaking state as an image frame sequence.
这里,首先,利用唇动识别技术对视频进行裁剪,得到记录设定对象说话过程的视频,视频的图像画面包含的嘴部对象处于说话状态;然后,从裁剪得到的视频中选取多个视频帧图像作为图像帧序列。Here, first, lip motion recognition technology is used to crop the video to obtain a video recording the speaking process of the set object. The image of the video contains the mouth object in a speaking state; then, multiple video frames are selected from the cropped video. Image as a sequence of image frames.
上述方案中,图像帧序列可以至少覆盖设定对象说话的完整过程,且通过唇动识别技术对视频进行裁剪,可以减少图像帧序列中与说话过程无关的图像帧,对通过该方案得到的图像帧序列进行图像处理并得到与该图像序列匹配的关键词,能够进一步提升唇语识别的准确度,降低唇语识别的图像处理过程所需的计算量。In the above scheme, the image frame sequence can at least cover the complete process of the set object speaking, and the video is cropped through lip movement recognition technology, which can reduce the image frames in the image frame sequence that are not related to the speaking process. For the images obtained through this scheme Perform image processing on the frame sequence and obtain keywords matching the image sequence, which can further improve the accuracy of lip recognition and reduce the amount of calculation required for the image processing process of lip recognition.
前文提及,用于图像处理的图像帧序列包括的图像帧的帧数可以是不固定的。在一些实现方式中,可以对采集得到的原始图像序列进行插帧处理,得到包括预设数量的图像帧的图像帧序列。As mentioned above, the number of image frames included in the image frame sequence used for image processing may not be fixed. In some implementations, frame interpolation processing can be performed on the original image sequence collected to obtain an image frame sequence including a preset number of image frames.
在一些实施方式中,对采集得到的原始图像序列进行插帧处理可以包括以下步骤S1013或步骤S1014:In some implementations, performing frame interpolation processing on the acquired original image sequence may include the following step S1013 or step S1014:
步骤S1013,对获取到的包含嘴部对象的原始图像序列进行图像插帧,得到所述图像帧序列。Step S1013: Perform image frame interpolation on the acquired original image sequence including the mouth object to obtain the image frame sequence.
对采集得到的原始图像序列进行插帧处理,获得包括预设数量的图像帧的图像帧序列的方式,可以是基于原始图像序列中的图像帧进行图像插帧处理,生成预设数量的图像帧,根据生成的图像帧和/或采集的图像帧,得到进行后续嘴部关键点特征提取的图像帧序列。The method of performing frame interpolation processing on the acquired original image sequence to obtain an image frame sequence including a preset number of image frames may be to perform image interpolation processing based on the image frames in the original image sequence to generate a preset number of image frames. , based on the generated image frames and/or collected image frames, an image frame sequence for subsequent mouth key point feature extraction is obtained.
步骤S1014,基于获取到的包含嘴部对象的原始图像序列中的嘴部关键点,对所述原始图像序列进行插帧,得到所述图像帧序列。Step S1014: Based on the obtained mouth key points in the original image sequence containing the mouth object, interpolate frames on the original image sequence to obtain the image frame sequence.
对采集得到的原始图像序列进行插帧处理,获得包括预设数量的图像帧的图像帧序列的方式,可以是基于原始图像序列中的嘴部关键点的位置信息,生成新插入的图像帧,其中,新插入的图像帧中嘴部关键点的位置信息是基于原始图像序列中的嘴部关键点的位置信息预测得到,从而实现对原始图像序列的插帧,得到图像帧序列对应的预设数量的关键点信息,实现后续的嘴部关键点特征提取。The method of performing frame interpolation processing on the collected original image sequence to obtain an image frame sequence including a preset number of image frames may be to generate newly inserted image frames based on the position information of the mouth key points in the original image sequence, Among them, the position information of the mouth key points in the newly inserted image frame is predicted based on the position information of the mouth key points in the original image sequence, thereby realizing the interpolation of the original image sequence and obtaining the preset corresponding to the image frame sequence. Amount of key point information to achieve subsequent mouth key point feature extraction.
其中,可以根据经验预先设置图像帧的帧数,预设帧数越大则识别的准确度越高,但消耗的计算资源越大,影响硬件运算效率;在一些实施方式中,综合考虑准确度、硬件运算效率以及关键词的字数等因素,实际应用中可以将预设帧数设置为60。Among them, the number of image frames can be preset based on experience. The larger the preset number of frames, the higher the accuracy of recognition, but the greater the computing resources consumed, which affects the hardware computing efficiency; in some implementations, the accuracy is comprehensively considered , hardware computing efficiency and the number of keywords, etc. In practical applications, the default frame number can be set to 60.
这样,利用插帧处理后的图像帧序列进行唇语识别,并且对采集得到的原始图像序列的帧数不作要求,可以提升用于唇语识别的图像识别方法的强健性。In this way, the image frame sequence after frame interpolation is used for lip recognition, and there is no requirement on the number of frames of the original image sequence collected, which can improve the robustness of the image recognition method for lip recognition.
在一些实现方式中,利用嘴部关键点在每一图像帧和相邻帧的位置信息,确定该图像帧的嘴部关键点特征,即上述步骤S102可以通过图2所示的步骤实现。In some implementations, the position information of the mouth key points in each image frame and adjacent frames is used to determine the mouth key point characteristics of the image frame. That is, the above step S102 can be implemented by the steps shown in Figure 2 .
图2为本公开实施例提供的图像处理方法的又一实现流程示意图,结合图2所示的步骤进行以下说明:Figure 2 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 2:
步骤S201,确定所述嘴部对象的至少两个嘴部关键点在所述每一图像帧中的位置信息。Step S201: Determine the position information of at least two mouth key points of the mouth object in each image frame.
图像帧序列包括至少两个图像帧,提取与嘴部对象关联的嘴部关键点在每一图像帧的位置信息。其中,嘴部关键点的数量至少为两个,且至少分布于图像中的上下嘴唇。嘴部关键点的设置数量和分布位置通常与关键点识别算法相关,例如,68点关键点检测算法中,嘴部关键点的数量为16个。每一嘴部关键点的位置信息可以通过位置参数表示,例如,可以通过图像坐标系中的二维坐标表示,该二维坐标包括宽度坐标(横坐标)和高度坐标(纵坐标)。这里,嘴部关键点的位置信息还与图像中的嘴部对象的口型相关,同一嘴部关键点在不同图像中的位置信息随口型变化而变化。以图3示出的106点脸部关键点示意图为例,该示意图中包括0-105号共106个关键点,可以描述脸部的脸部轮廓、眉毛、眼睛、鼻子、 嘴巴等特征,其中的84-103号关键点是用于描述嘴巴的嘴部关键点。这里,93号关键点在对应不同说话内容的两帧图像中的位置不相同,例如,当93号关键点在图像的纵坐标较小时,表示嘴部的张开程度较大,此时,在“啊”和“哦”的音节中,对应为“啊”的可能性更高。The image frame sequence includes at least two image frames, and position information of mouth key points associated with the mouth object in each image frame is extracted. Among them, the number of mouth key points is at least two, and they are distributed at least on the upper and lower lips in the image. The number and distribution location of mouth key points are usually related to the key point identification algorithm. For example, in the 68-point key point detection algorithm, the number of mouth key points is 16. The position information of each mouth key point can be represented by a position parameter, for example, it can be represented by two-dimensional coordinates in the image coordinate system, which include width coordinates (abscissa) and height coordinates (ordinate). Here, the position information of the mouth key points is also related to the mouth shape of the mouth object in the image. The position information of the same mouth key point in different images changes as the mouth shape changes. Take the 106 facial key point diagram shown in Figure 3 as an example. The diagram includes a total of 106 key points numbered 0-105, which can describe the facial contour, eyebrows, eyes, nose, Mouth and other features, among which key points 84-103 are mouth key points used to describe the mouth. Here, the positions of key point No. 93 in the two frames of images corresponding to different speech contents are different. For example, when the ordinate of key point No. 93 is smaller in the image, it means that the mouth is opened to a greater degree. At this time, in Among the syllables "ah" and "oh", the possibility of corresponding to "ah" is higher.
步骤S202,针对所述图像帧序列中的每一图像帧,根据所述图像帧和所述图像帧的相邻帧中的嘴部关键点的位置信息,确定所述图像帧对应的嘴部关键点特征。Step S202: For each image frame in the image frame sequence, determine the mouth key corresponding to the image frame based on the position information of the mouth key point in the image frame and adjacent frames of the image frame. point features.
针对所述图像帧序列中每一第一图像帧,可以利用包括第一图像帧在内的至少两个图像帧中的嘴部关键点的位置信息,计算第一图像帧中的嘴部关键点特征,嘴部关键点特征可以包括帧间差异信息和/或帧内差异信息。其中,第一图像帧可以是所述图像帧序列中任一图像帧。帧间差异信息可以表示同一嘴部关键点在不同图像帧中的位置信息的差异信息,帧内差异信息可以表示不同嘴部关键点在同一图像帧中的位置信息之间的差异信息。这里,利用每个嘴部关键点在第一图像帧中的位置信息,以及该嘴部关键点在第一图像帧的相邻帧中的位置信息,计算该嘴部关键点在不同图像帧的帧间差异信息;和/或,利用包括该嘴部关键点在内的至少两个嘴部关键点在第一图像帧中的位置信息,计算该嘴部关键点在第一图像帧中的帧内差异信息。For each first image frame in the sequence of image frames, the position information of the mouth key points in at least two image frames including the first image frame may be used to calculate the mouth key points in the first image frame. Features, mouth key point features may include inter-frame difference information and/or intra-frame difference information. The first image frame may be any image frame in the image frame sequence. The inter-frame difference information can represent the difference information of the position information of the same mouth key point in different image frames, and the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same image frame. Here, the position information of each mouth key point in the first image frame and the position information of the mouth key point in adjacent frames of the first image frame are used to calculate the position information of the mouth key point in different image frames. Inter-frame difference information; and/or, using the position information of at least two mouth key points including the mouth key point in the first image frame, calculate the frame of the mouth key point in the first image frame internal difference information.
相较于利用嘴部区域图像序列进行唇语识别,本公开实施例利用多个嘴部关键点在多个图像帧中的位置信息得到嘴部关键点特征,使得嘴部关键点特征能够表示图像帧序列对应的说话过程中嘴部关键点的变化过程,从而更好地提取设定对象在说话过程中的口型变化特征;这样,利用嘴部关键点特征进行唇语识别,能够提升唇语识别的准确度。Compared with using mouth region image sequences for lip recognition, embodiments of the present disclosure use the position information of multiple mouth key points in multiple image frames to obtain mouth key point features, so that the mouth key point features can represent images. The frame sequence corresponds to the changing process of the mouth key points during the speaking process, so as to better extract the mouth shape change characteristics of the set object during the speaking process; in this way, using the mouth key point features for lip language recognition can improve lip language Recognition accuracy.
在一些实现方式中,利用嘴部关键点在相邻帧的位置信息的差异,以及预设嘴部关键点对在同一图像帧的位置信息的差异,确定嘴部关键点特征,即上述步骤S202可以通过以下步骤S2021、S2022实现:In some implementations, the difference in position information of the mouth key points in adjacent frames and the difference in position information of the preset mouth key point pairs in the same image frame are used to determine the characteristics of the mouth key points, that is, the above step S202 This can be achieved through the following steps S2021 and S2022:
步骤S2021,针对每一所述嘴部关键点,根据所述嘴部关键点在所述图像帧中的位置信息,以及所述嘴部关键点在所述图像帧的相邻图像帧中的位置信息,确定所述嘴部关键点在所述图像帧和相邻帧之间的第一高度差和/或第一宽度差,作为所述嘴部关键点的帧间差异信息。Step S2021, for each mouth key point, according to the position information of the mouth key point in the image frame, and the position of the mouth key point in adjacent image frames of the image frame Information, determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point.
在一些实施方式中,在计算每一第一图像帧对应的嘴部关键点特征时,对于每一嘴部关键点,根据该嘴部关键点在第一图像帧中的位置信息,以及该嘴部关键点在至少一个第二图像帧中的每一第二图像帧中的位置信息,计算该嘴部关键点在第一图像帧和每个第二图像帧中的位置信息的差异信息。其中,第二图像帧为与第一图像帧相邻的图像帧,也就是第一图像帧的相邻帧;差异信息可以是第一高度差,可以是第一宽度差,还可以是第一高度差和第一宽度差的组合;第一宽度差为嘴部关键点在两个图像帧(第一图像帧和第二图像帧)中的宽度差值(即该嘴部关键点在两个图像帧中的横坐标的差值),第一高度差为嘴部关键点在这两个图像帧中的高度差值(即该嘴部关键点在两个图像帧中的纵坐标的差值)。在一些实现方式中,在计算差值时,差值可以设置为在后图像帧的位置信息减去在前图像帧的位置信息,也可以设置为在前图像帧的位置信息减去在后图像帧的位置信息。这样,对于每一嘴部关键点,利用第一图像帧和至少一个第二图像帧中的每一第二图像帧,可以得到与第二图像帧的数量相同的差异信息,将这些差异信息确定为该嘴部关键点在第一图像帧中的帧间差异信息。In some embodiments, when calculating the mouth key point features corresponding to each first image frame, for each mouth key point, according to the position information of the mouth key point in the first image frame, and the mouth The position information of the mouth key point in each second image frame of at least one second image frame is calculated, and the difference information of the position information of the mouth key point in the first image frame and each second image frame is calculated. Wherein, the second image frame is an image frame adjacent to the first image frame, that is, an adjacent frame of the first image frame; the difference information may be the first height difference, the first width difference, or the first The combination of the height difference and the first width difference; the first width difference is the width difference of the mouth key point in the two image frames (the first image frame and the second image frame) (that is, the mouth key point is in the two image frames). The first height difference is the height difference of the mouth key point in the two image frames (that is, the difference in the ordinate of the mouth key point in the two image frames). ). In some implementations, when calculating the difference, the difference can be set to the position information of the subsequent image frame minus the position information of the previous image frame, or can be set to the position information of the previous image frame minus the subsequent image. Frame position information. In this way, for each mouth key point, using each second image frame of the first image frame and at least one second image frame, the same number of difference information as the second image frame can be obtained, and these difference information are determined is the inter-frame difference information of the mouth key point in the first image frame.
例如,一个嘴部关键点在三个连续的图像帧中的坐标分别为(x1,y1)、(x'1,y'1)、(x"1,y"1),以第二个图像帧为第一图像帧,前后的第一个图像帧和第三个图像帧为第二图像帧,计算第一高度差和第一宽度差,得到这个嘴部关键点在第一图像帧中的帧间差异信息为(x'1-x1,y'1-y1,x"1-x'1,y"1-y'1)。For example, the coordinates of a mouth key point in three consecutive image frames are (x 1 , y 1 ), (x' 1 , y' 1 ), (x" 1 , y" 1 ), and the second The first image frame is the first image frame, and the first image frame and the third image frame before and after are the second image frame. Calculate the first height difference and the first width difference to obtain the key point of the mouth in the first image frame. The inter-frame difference information in is (x' 1 -x 1 , y' 1 -y 1 ,x" 1 -x' 1 , y" 1 -y' 1 ).
步骤S2022,针对每一所述嘴部关键点,根据所述图像帧中的所述嘴部关键点与同一嘴部对象的其他嘴部关键点之间的第二高度差和/或第二宽度差,确定所述嘴部关键点的帧内差异信息。Step S2022: For each mouth key point, determine the second height difference and/or the second width between the mouth key point in the image frame and other mouth key points of the same mouth object. Difference, determine the intra-frame difference information of the mouth key point.
在一些实施方式中,在确定每一第一图像帧对应的嘴部关键点特征时,对于每一嘴部关键点,计算该嘴部关键点与同一嘴部对象的其他嘴部关键点之间的第二高度差和/或第二宽度差,并将第二高度差和/或第二宽度差确定为对应的预设嘴部关键点对中的每一嘴部关键点在第一图像帧中的帧内差异信息。其中,其他嘴部关键点可以是固定的嘴部关键点,例如唇珠对应的嘴部关键点,例如图3示出的98号关键点;也可以是与每个嘴部关键点满足设定位置关系的嘴部关键点。这里,将两个嘴部关键点作为一个预设嘴部关键点对。并且,在设置预设嘴部关键点对时,可以考虑嘴部关键点在图像中的位置信息,也就是说,属于同一预设嘴部关键点对的两个嘴部关键点之间满足设定位置关系;例如,将分别位于嘴部对象的上下嘴唇的两个嘴部关键点确定为一个嘴部关键点对;还可以将在图像中的宽度差异信息小于预设值的两个嘴部关键点确定为预设嘴部关键点对。这样,利用预设嘴部关键点对的第二高度差可以更好地表示嘴部对象在第一图像帧中的口型。In some embodiments, when determining the mouth key point characteristics corresponding to each first image frame, for each mouth key point, calculate the relationship between the mouth key point and other mouth key points of the same mouth object. the second height difference and/or the second width difference, and determine the second height difference and/or the second width difference as each mouth key point in the corresponding preset mouth key point pair in the first image frame Intra-frame difference information in . Among them, other mouth key points can be fixed mouth key points, such as the mouth key points corresponding to the lip beads, such as key point No. 98 shown in Figure 3; they can also be set to satisfy the settings of each mouth key point. Positional relationship between the key points of the mouth. Here, the two mouth key points are used as a preset mouth key point pair. Moreover, when setting the preset mouth key point pair, the position information of the mouth key point in the image can be considered. That is to say, the relationship between the two mouth key points belonging to the same preset mouth key point pair satisfies the set Determine the positional relationship; for example, determine the two mouth key points located on the upper and lower lips of the mouth object as a mouth key point pair; you can also determine the two mouth key points whose width difference information in the image is less than the preset value The key points are determined as preset mouth key point pairs. In this way, the mouth shape of the mouth object in the first image frame can be better represented by using the second height difference of the preset mouth key point pair.
在一些实现方式中,一个嘴部关键点可以与两个或以上的嘴部关键点分别构成预设嘴部关键点对,也就是说,每个嘴部关键点可以属于多个嘴部关键点对。此时,分别确定该嘴部关键点所属的每一嘴部关键点对的第二高度差,并利用至少两个第二高度差进行加权求和,确定该嘴部关键点在第一图像帧中 的帧内差异信息。以图3示出的106点脸部关键点示意图为例,86号关键点可以分别与103号关键点和94号关键点构成预设嘴部关键点对,也就是说,86号关键点属于两个嘴部关键点对。在计算86号关键点的帧内差异信息时,首先,分别计算86号关键点所在的每个嘴部关键点对的第二高度差;然后,对两个第二高度差进行加权求和,确定86号关键点在第一图像帧中的帧内差异信息。这样,将一个嘴部关键点放在至少两个嘴部关键点对中计算该嘴部关键点的帧内差异信息,可以改善因对一个关键点的识别误差导致的嘴部关键点特征计算偏差,基于这样的嘴部关键点特征进行唇语识别,可以提升唇语识别的准确度。In some implementations, one mouth key point can form a preset mouth key point pair with two or more mouth key points respectively. That is to say, each mouth key point can belong to multiple mouth key points. right. At this time, the second height difference of each mouth key point pair to which the mouth key point belongs is determined respectively, and at least two second height differences are used to perform a weighted sum to determine the position of the mouth key point in the first image frame. middle intra-frame difference information. Taking the schematic diagram of 106 facial key points shown in Figure 3 as an example, key point No. 86 can form a preset mouth key point pair with key point No. 103 and key point No. 94 respectively. That is to say, key point No. 86 belongs to The two key points of the mouth are correct. When calculating the intra-frame difference information of key point No. 86, first, calculate the second height difference of each mouth key point pair where key point No. 86 is located; then, perform a weighted sum of the two second height differences, Determine the intra-frame difference information of key point No. 86 in the first image frame. In this way, placing a mouth key point in the center of at least two mouth key points and calculating the intra-frame difference information of the mouth key point can improve the deviation of the mouth key point feature calculation caused by the recognition error of one key point. , Lip recognition based on such key point features of the mouth can improve the accuracy of lip recognition.
在一些实现方式中,通过步骤S2021和步骤S2022,分别得到一个嘴部关键点在第一图像帧中的帧间差异信息和帧内差异信息,可以对帧间差异信息和帧内差异信息进行拼接,得到该嘴部关键点在第一图像帧时对应的嘴部关键点特征中的一个元素,从而基于第一图像帧中的每个嘴部关键点在第一图像帧的帧间差异信息和帧内差异信息,确定每个嘴部关键点在第一图像帧时对应的嘴部关键点特征元素,根据所有嘴部关键点对应的嘴部关键点特征元素确定第一图像帧对应的嘴部关键点特征。In some implementations, through steps S2021 and S2022, inter-frame difference information and intra-frame difference information of a mouth key point in the first image frame are obtained respectively, and the inter-frame difference information and intra-frame difference information can be spliced. , obtain an element in the mouth key point feature corresponding to the mouth key point in the first image frame, thereby based on the inter-frame difference information of each mouth key point in the first image frame and the first image frame Intra-frame difference information determines the mouth key point feature elements corresponding to each mouth key point in the first image frame, and determines the mouth corresponding to the first image frame based on the mouth key point feature elements corresponding to all mouth key points. Key point features.
本公开实施例中,利用每个嘴部关键点在相邻图像帧中的位置信息的帧间差异信息,和该嘴部关键点与预设嘴部关键点的位置信息的帧内差异信息,得到嘴部关键点特征,使得嘴部关键点特征可以表示满足设定关系的嘴部关键点之间的差异,提升每一帧图像中口型确定的准确度;并且,嘴部关键点特征也可以表示图像帧序列对应的说话过程中嘴部关键点在帧间的变化过程;这样,可以更好地提取说话过程中口型的变化特征,进而提升唇语识别的准确度。In the embodiment of the present disclosure, the inter-frame difference information of the position information of each mouth key point in adjacent image frames and the intra-frame difference information of the position information of the mouth key point and the preset mouth key point are used, The mouth key point features are obtained so that the mouth key point features can represent the differences between the mouth key points that satisfy the set relationship, improving the accuracy of mouth shape determination in each frame of image; and, the mouth key point features are also It can represent the changing process of mouth key points between frames during speaking corresponding to the image frame sequence; in this way, the changing characteristics of the mouth shape during speaking can be better extracted, thereby improving the accuracy of lip recognition.
在一些实现方式中,对图像帧序列中的嘴部关键点进行时空特征提取,得到嘴部对象在每一图像帧对应的时空特征,基于该时空特征进行音节特征分类得到嘴部对象对应的音节分类特征,即上述步骤S103,可以通过图4所示的步骤实现。In some implementations, spatio-temporal features are extracted from the mouth key points in the image frame sequence to obtain the spatio-temporal features corresponding to the mouth object in each image frame, and syllable features are classified based on the spatio-temporal features to obtain the syllables corresponding to the mouth object. Classification features, that is, the above-mentioned step S103, can be implemented through the steps shown in Figure 4.
图4为本公开实施例提供的图像处理方法的又一实现流程示意图,结合图4所示的步骤进行以下说明:Figure 4 is a schematic flow diagram of yet another implementation of the image processing method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 4:
步骤S401,分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征。Step S401: Perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.
前文提及,可以得到图像帧序列的至少一个嘴部关键点特征,每一嘴部关键点特征由嘴部关键点的位置信息计算得到,嘴部关键点的位置信息表示嘴部对象在一个图像帧的中位置,每一嘴部关键点特征分别对应一个图像帧。针对每一嘴部关键点特征,嘴部对象在对应的图像帧的空间特征可以是采用任意合适的特征提取方式从该嘴部关键点特征中提取得到的。例如,可以采用卷积神经网络、循环神经网络等方式进行提取,得到空间特征。As mentioned before, at least one mouth key point feature of the image frame sequence can be obtained. Each mouth key point feature is calculated from the position information of the mouth key point. The position information of the mouth key point indicates that the mouth object is in an image. At the middle position of the frame, each mouth key point feature corresponds to an image frame. For each mouth key point feature, the spatial features of the mouth object in the corresponding image frame can be extracted from the mouth key point feature using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used for extraction to obtain spatial features.
在一些实现方式中,通过唇动识别处理检测视频中设定对象的说话区间,得到覆盖设定对象说话过程的图像帧序列,即上述步骤S401可以通过以下步骤S4011和S4012实现:In some implementations, the speaking interval of the set object in the video is detected through lip motion recognition processing, and an image frame sequence covering the speaking process of the set object is obtained. That is, the above step S401 can be implemented through the following steps S4011 and S4012:
步骤S4011,对所述嘴部对象的多个所述嘴部关键点的帧间差异信息和帧内差异信息进行融合,得到所述嘴部对象在每一图像帧的帧间差异特征和帧内差异特征。Step S4011, fuse inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain inter-frame difference features and intra-frame difference features of the mouth object in each image frame. Differential characteristics.
前文提及,每个嘴部关键点特征由嘴部关键点的位置信息计算得到,嘴部关键点的位置信息表示嘴部对象在一个图像帧的位置,每一嘴部关键点特征分别对应一个图像帧。帧间差异信息可以表示同一嘴部关键点在不同帧的位置信息的差异信息,帧内差异信息可以表示不同嘴部关键点在同一帧的位置信息之间的差异信息。在一些实施方式中,对每一图像帧的多个嘴部关键点的帧间差异信息进行融合,并对每一图像帧的多个嘴部关键点的帧内差异信息进行融合,得到所述嘴部对象在每一图像帧的帧间差异特征和帧内差异特征;其中,对帧间差异信息和/或帧内差异信息进行融合的方式,可以是利用卷积神经网络、循环神经网络等方式,利用预设大小的卷积核对多个嘴部关键点的信息进行融合,实现多个嘴部关键点的帧间和/或帧差异信息的融合。As mentioned earlier, each mouth key point feature is calculated from the position information of the mouth key point. The position information of the mouth key point represents the position of the mouth object in an image frame. Each mouth key point feature corresponds to a image frame. The inter-frame difference information can represent the difference information of the position information of the same mouth key point in different frames, and the intra-frame difference information can represent the difference information between the position information of different mouth key points in the same frame. In some embodiments, the inter-frame difference information of multiple mouth key points in each image frame is fused, and the intra-frame difference information of multiple mouth key points in each image frame is fused to obtain the above The inter-frame difference features and intra-frame difference features of the mouth object in each image frame; among them, the way to fuse the inter-frame difference information and/or the intra-frame difference information can be by using a convolutional neural network, a recurrent neural network, etc. In this way, a convolution kernel of a preset size is used to fuse the information of multiple mouth key points to achieve the fusion of inter-frame and/or frame difference information of multiple mouth key points.
例如,一个嘴部关键点对应嘴部关键点特征中的一个元素,并且该嘴部关键点包括5维特征,其中:5维特征中的前4维是帧间差异信息,分别是该嘴部关键点在第一图像帧和前一图像帧中的宽度差、在第一图像帧和前一图像帧中的高度差、在第一图像帧和后一图像帧中的宽度差、在第一图像帧和后一图像帧中的高度差;第5维是帧内差异信息,即,该嘴部关键点与同一嘴部对象的其他嘴部关键点在同一图像帧中的高度差和/或宽度差。在对特定图像帧的多个嘴部关键点的帧间差异信息和/或帧内差异信息进行融合时,分别对5维特征中的每一维在至少两个嘴部关键点(也就是嘴部关键点特征的元素之间)进行特征提取,将得到的特征中的前4维作为嘴部对象在该特定图像帧中的帧间差异特征,将第5维作为嘴部对象在该特定图像帧中的帧内差异特征。For example, a mouth key point corresponds to an element in the mouth key point feature, and the mouth key point includes a 5-dimensional feature, where: the first 4 dimensions of the 5-dimensional feature are inter-frame difference information, respectively. The width difference between the key point in the first image frame and the previous image frame, the height difference between the first image frame and the previous image frame, the width difference between the first image frame and the next image frame, the difference in the first image frame and the previous image frame, The height difference between the image frame and the subsequent image frame; the fifth dimension is the intra-frame difference information, that is, the height difference between the mouth key point and other mouth key points of the same mouth object in the same image frame and/or Width difference. When fusing the inter-frame difference information and/or intra-frame difference information of multiple mouth key points of a specific image frame, each dimension of the 5-dimensional features is separately analyzed in at least two mouth key points (that is, the mouth (between the elements of the mouth key point feature) for feature extraction, and the first 4 dimensions of the obtained features are used as the inter-frame difference features of the mouth object in this specific image frame, and the fifth dimension is used as the mouth object in this specific image. Intra-frame difference features within frames.
步骤S4012,对所述嘴部对象在多个所述图像帧的帧间差异特征和帧内差异特征进行融合,得到所述嘴部对象在每一图像帧的空间特征。Step S4012: Fusion of inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames to obtain spatial features of the mouth object in each image frame.
在一些实施方式中,对多个图像帧的帧间差异特征和帧内差异特征进行融合的方式,可以是利用卷积神经网络、循环神经网络等实现,利用预设大小的卷积核对多个嘴部关键点的信息进行融合,实现每 一嘴部关键点的帧间差异信息和帧内差异信息之间的融合,得到嘴部对象在每一图像帧的空间特征。In some embodiments, the way to fuse the inter-frame difference features and intra-frame difference features of multiple image frames can be implemented by using a convolutional neural network, a recurrent neural network, etc., and using a convolution kernel of a preset size to fuse multiple image frames. Information on key points of the mouth is fused to achieve each The fusion between the inter-frame difference information and the intra-frame difference information of a mouth key point obtains the spatial characteristics of the mouth object in each image frame.
上述步骤S4011至步骤S4012中,对嘴部对象的至少两个嘴部关键点在每一图像帧的帧间差异信息和帧内差异信息分别进行融合,得到表示嘴部关键点之间的帧间差异信息的帧间差异特征,以及表示嘴部关键点之间的帧内差异信息的帧内差异特征,再对嘴部关键点在每一图像帧的帧间差异特征和帧内差异特征进行特征融合,可以更好地提取嘴部对象在每一图像帧的空间特征,提升确定每一帧图像中的口型的准确度。In the above steps S4011 to S4012, the inter-frame difference information and the intra-frame difference information of at least two mouth key points of the mouth object in each image frame are respectively fused to obtain the inter-frame difference information between the mouth key points. The inter-frame difference features of the difference information, and the intra-frame difference features representing the intra-frame difference information between the mouth key points, and then the inter-frame difference features and intra-frame difference features of the mouth key points in each image frame are characterized. Fusion can better extract the spatial features of the mouth object in each image frame and improve the accuracy of determining the mouth shape in each frame of image.
步骤S402,对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征。Step S402: Perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.
在一些实施方式中,针对至少一个图像帧中的每一第三图像帧,可以利用嘴部对象在包括第三图像帧在内的至少两个图像帧的空间特征进行特征提取,得到嘴部对象在第三图像帧对应的时空特征。嘴部对象的时空特征可以是采用任意合适的特征提取方式从空间特征中提取得到的。例如,可以采用卷积神经网络、循环神经网络等方式对时间特征进行提取,得到时空特征。In some embodiments, for each third image frame in at least one image frame, spatial features of the mouth object in at least two image frames including the third image frame can be used to perform feature extraction to obtain the mouth object. The corresponding spatiotemporal features in the third image frame. The spatiotemporal features of the mouth object can be extracted from the spatial features using any suitable feature extraction method. For example, convolutional neural networks, recurrent neural networks, etc. can be used to extract temporal features to obtain spatiotemporal features.
在一些实现方式中,对嘴部对象在多个所述图像帧的空间特征进行时间特征提取可以执行多次,以一次时间特征提取为例,利用1×5的卷积核进行特征提取,每次卷积对第三图像帧前后各两个图像帧的空间特征进行提取,提取得到的时空特征包括五个图像帧的信息。In some implementations, temporal feature extraction of the spatial features of the mouth object in multiple image frames can be performed multiple times. Taking one temporal feature extraction as an example, a 1×5 convolution kernel is used for feature extraction. The secondary convolution extracts the spatial features of two image frames before and after the third image frame, and the extracted spatiotemporal features include information of five image frames.
由于时间特征提取的次数越多、使用的卷积核越大,每一图像帧对应的时空特征可以表示更多图像帧的信息,使得帧间的信息得到交流,因而对应的感受野越大,有利于学习多帧图像构成的字词和不同字词之间的时序,能够提升唇语识别的准确度,但需要消耗的计算资源更大,影响硬件运算效率;综合考虑准确度和硬件运算效率,实际应用中可以将图像特征提取的次数设置为5次。Since the more times temporal features are extracted and the larger the convolution kernel is used, the spatiotemporal features corresponding to each image frame can represent more information of the image frames, allowing the information between frames to be communicated, so the corresponding receptive field becomes larger. It is conducive to learning words composed of multi-frame images and the timing between different words, which can improve the accuracy of lip recognition, but it requires more computing resources and affects the hardware computing efficiency; comprehensive consideration of accuracy and hardware computing efficiency , in actual applications, the number of image feature extraction times can be set to 5 times.
步骤S403,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。Step S403: Extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
在一些实施方式中,对嘴部对象在至少两个图像帧中的每一图像帧对应的时空特征进行音节分类特征提取,得到嘴部对象的音节分类特征;其中,音节分类特征可以表示与嘴部对象在说话过程中出现的口型相对应的至少一种音节,利用音节分类特征中的每一元素确定说话过程中是否存在一种预设音节类型,从而确定图像帧序列中的图像帧包含的口型所对应的至少一种音节。嘴部对象的音节分类特征可以是采用任意合适的特征提取方式从时空特征中提取得到的。例如,可以采用全连接层、全局平均池化层等方式对时空特征进行音节分类特征提取,得到音节分类特征。In some embodiments, the syllable classification features of the mouth object are obtained by extracting syllable classification features from the spatiotemporal features corresponding to each image frame in at least two image frames; wherein, the syllable classification features can represent the same as the mouth object. At least one syllable corresponding to the mouth shape that appears during the speaking process of the subject, and each element in the syllable classification feature is used to determine whether there is a preset syllable type during the speaking process, thereby determining whether the image frame in the image frame sequence contains At least one syllable corresponding to the mouth shape. The syllable classification features of mouth objects can be extracted from spatio-temporal features using any suitable feature extraction method. For example, fully connected layers, global average pooling layers, and other methods can be used to extract syllable classification features from spatiotemporal features to obtain syllable classification features.
本公开实施例支持使用卷积神经网络进行时空特征提取;相较于采用循环神经网络(递归神经网路)等时序预测网络提取时空特征,通过卷积神经网络提取时空特征的计算量较少,能够降低计算资源的消耗,降低对用于实现唇语识别的计算机设备的硬件要求。尤其,由于采用卷积神经网络能够降低对芯片计算能力的要求,从而使本公开实施例提供的图像处理方法能够通过更多轻量化的芯片实现,使更多硬件支持本公开实施例的唇语识别过程中的图像处理方法,提升了唇语识别的通用性,例如车机等计算机设备也可以实现唇语识别。Embodiments of the present disclosure support the use of convolutional neural networks for spatiotemporal feature extraction; compared with using time series prediction networks such as recurrent neural networks (recursive neural networks) to extract spatiotemporal features, the amount of calculation required to extract spatiotemporal features through convolutional neural networks is less. It can reduce the consumption of computing resources and reduce the hardware requirements for computer equipment used to implement lip recognition. In particular, since the use of convolutional neural networks can reduce the requirements for chip computing capabilities, the image processing method provided by the embodiments of the present disclosure can be implemented with more lightweight chips, allowing more hardware to support the lip reading of the embodiments of the present disclosure. The image processing method in the recognition process improves the versatility of lip recognition. For example, computer equipment such as cars and machines can also realize lip recognition.
本公开实施例还提供了一种图像处理方法,该方法可以由计算机设备的处理器执行。如图5所示,该方法包括如下步骤S501至步骤S504:Embodiments of the present disclosure also provide an image processing method, which can be executed by a processor of a computer device. As shown in Figure 5, the method includes the following steps S501 to S504:
步骤S501,获取包含嘴部对象的图像帧序列。Step S501: Obtain an image frame sequence containing a mouth object.
这里,步骤S501对应于前述步骤S101,在实施时可以参照前述步骤S101的具体实施方式。Here, step S501 corresponds to the aforementioned step S101, and during implementation, reference may be made to the specific implementation of the aforementioned step S101.
步骤S502,对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征。Step S502: Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain mouth key point features of each image frame.
这里,步骤S502对应于前述步骤S102,在实施时可以参照前述步骤S102的具体实施方式。Here, step S502 corresponds to the aforementioned step S102, and during implementation, reference may be made to the specific implementation of the aforementioned step S102.
步骤S503,利用经过训练的音节特征提取网络对所述图像帧序列中多个图像帧的嘴部关键点特征进行处理,得到音节分类特征。Step S503: Use the trained syllable feature extraction network to process the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features.
在实施时,音节特征提取网络可以是任意合适的用于进行特征提取的网络,可以包括但不限于卷积神经网络、循环神经网络等;本领域技术人员可以根据实际情况为音节特征提取网络选择合适的网络结构,本公开实施例并不限定。During implementation, the syllable feature extraction network can be any suitable network for feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc.; those skilled in the art can select a syllable feature extraction network based on the actual situation. The appropriate network structure is not limited by the embodiments of this disclosure.
步骤S504,利用经过训练的分类网络,在预设关键词库中确定与音节分类特征匹配的关键词。Step S504: Use the trained classification network to determine keywords matching the syllable classification features in the preset keyword library.
在实施时,分类网络可以是任意合适的用于特征分类的网络,可以是全局平均池化层,也可以是全连接层等。本领域技术人员可以根据实际情况为分类网络选择合适的网络结构,本公开实施例并不限定。When implemented, the classification network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure for the classification network according to the actual situation, which is not limited by the embodiments of the present disclosure.
本公开实施例中,利用经过训练的音节特征提取网络,对嘴部关键点特征进行处理得到音节分类特征;利用经过训练的分类网络,在预设关键词库中确定与音节分类特征匹配的关键词。这样,由于深度学习模型中的各神经网络是可以学习并优化的,因此,可以提升提取的音节分类特征以及与音节分类特征匹配的关键词的准确性,从而可以使得图像处理得到的关键词更精确,提升唇语识别的准确度。In the embodiment of the present disclosure, a trained syllable feature extraction network is used to process the key point features of the mouth to obtain syllable classification features; the trained classification network is used to determine the key matching the syllable classification features in the preset keyword library word. In this way, since each neural network in the deep learning model can be learned and optimized, the accuracy of the extracted syllable classification features and the keywords matching the syllable classification features can be improved, thereby making the keywords obtained by image processing more accurate. Accurate, improve the accuracy of lip recognition.
在一些实现方式中,所述音节特征提取网络包括空间特征提取子网络、时间特征提取子网络和分类 特征提取子网络,即上述步骤S503可以通过以下步骤S5031至S5033实现:In some implementations, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification The feature extraction sub-network, that is, the above step S503, can be implemented through the following steps S5031 to S5033:
步骤S5031,利用所述空间特征提取子网络,分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征。Step S5031: Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each image frame to obtain the spatial features of the mouth object in each image frame.
在实施时,空间特征提取子网络可以是任意合适的用于进行图像特征提取的网络,可以包括但不限于卷积神经网络、循环神经网络等。本领域技术人员可以根据实际对每一所述嘴部关键点特征进行空间特征提取的方式,选择合适的网络结构,本公开实施例并不限定。During implementation, the spatial feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc. Those skilled in the art can select an appropriate network structure based on the actual spatial feature extraction method for each mouth key point feature, which is not limited by the embodiments of the present disclosure.
步骤S5032,利用所述时间特征提取子网络,对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征。Step S5032: Use the temporal feature extraction sub-network to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object.
这里,时间特征提取子网络可以是任意合适的用于进行图像特征提取的网络,可以包括但不限于卷积神经网络、循环神经网络网络等。本领域技术人员可以根据实际对嘴部对象在至少一个图像帧的空间特征进行至少一次时间特征提取的方式,选择合适的网络结构,本公开实施例并不限定。Here, the temporal feature extraction sub-network can be any suitable network used for image feature extraction, which can include but is not limited to convolutional neural networks, recurrent neural networks, etc. Those skilled in the art can select an appropriate network structure based on the actual method of performing at least one temporal feature extraction on the spatial features of the mouth object in at least one image frame, which is not limited by the embodiments of the present disclosure.
步骤S5033,利用所述分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。Step S5033: Use the classification feature extraction sub-network to extract syllable classification features based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
这里,分类特征提取子网络可以是任意合适的用于特征分类的网络,可以是全局平均池化层,也可以是全连接层等。本领域技术人员可以根据实际对嘴部对象的每一时空特征进行分类特征提取的方式,选择合适的网络结构,本公开实施例并不限定。Here, the classification feature extraction sub-network can be any suitable network for feature classification, it can be a global average pooling layer, a fully connected layer, etc. Those skilled in the art can select an appropriate network structure based on the actual classification feature extraction method for each spatio-temporal feature of the mouth object, which is not limited by the embodiments of the present disclosure.
本公开实施例还提供了一种生成唇语识别模型的方法,该方法可以由计算机设备的处理器执行。如图6所示,该方法包括如下步骤S601至步骤S604:Embodiments of the present disclosure also provide a method of generating a lip recognition model, which method can be executed by a processor of a computer device. As shown in Figure 6, the method includes the following steps S601 to S604:
步骤S601,获取包含嘴部对象的样本图像帧序列。Step S601: Obtain a sample image frame sequence including a mouth object.
在一些实施方式中,计算机设备获取已标注关键词标签的样本图像帧序列,该样本图像帧序列包括多个样本图像帧,该样本图像帧序列中的样本图像按照每一个样本图像帧对应的时间参数排序。并且,样本图像帧序列包括的样本图像帧的帧数可以是不固定的,例如,样本图像帧序列的样本图像帧数可以为40帧、50帧或100帧。In some embodiments, the computer device obtains a sequence of sample image frames that have been labeled with keyword tags. The sequence of sample image frames includes multiple sample image frames. The sample images in the sequence of sample image frames are arranged according to the time corresponding to each sample image frame. Parameter sorting. Furthermore, the number of sample image frames included in the sample image frame sequence may not be fixed. For example, the number of sample image frames included in the sample image frame sequence may be 40 frames, 50 frames, or 100 frames.
这样,可以得到至少覆盖设定对象说一句话的完整过程的样本图像帧序列。In this way, a sample image frame sequence that at least covers the complete process of the set subject speaking a sentence can be obtained.
步骤S602,对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征。Step S602: Perform mouth key point feature extraction on each sample image frame in the sample image frame sequence to obtain mouth key point features of each sample image frame.
这里,对样本图像帧序列中的至少一个样本图像帧进行嘴部关键点提取时,从样本图像帧的脸部关键点中提取与嘴部对象关联的嘴部关键点的位置信息,并基于至少一个样本图像帧的嘴部关键点的位置信息,确定每一个样本图像帧对应的一个嘴部关键点特征,从而得到样本图像帧序列的至少一个嘴部关键点特征。其中,嘴部关键点特征是由嘴部关键点的位置信息计算得到,而嘴部关键点的位置信息与样本图像帧中包含的嘴部对象的口型相关,即,同一嘴部关键点在不同样本图像帧的位置信息,与这个样本图像帧中嘴部对象的口型相关。Here, when performing mouth key point extraction on at least one sample image frame in the sample image frame sequence, the position information of the mouth key point associated with the mouth object is extracted from the facial key points of the sample image frame, and based on at least The position information of the mouth key point of a sample image frame is used to determine a mouth key point feature corresponding to each sample image frame, thereby obtaining at least one mouth key point feature of the sample image frame sequence. Among them, the mouth key point features are calculated from the position information of the mouth key points, and the position information of the mouth key points is related to the mouth shape of the mouth object contained in the sample image frame, that is, the same mouth key point in The position information of different sample image frames is related to the mouth shape of the mouth object in this sample image frame.
在一些实施方式中,基于样本图像帧的嘴部关键点的位置信息确定样本图像帧对应的嘴部关键点特征的方式,可以是按照每个嘴部关键点对应的关键点序号,对一个样本图像帧中的每个嘴部关键点的位置信息进行排序,得到位置序列,从而将位置序列作为嘴部关键点特征。In some embodiments, the method of determining the characteristics of the mouth key points corresponding to the sample image frame based on the position information of the mouth key points of the sample image frame may be to calculate a sample according to the key point serial number corresponding to each mouth key point. The position information of each mouth key point in the image frame is sorted to obtain a position sequence, and the position sequence is used as the mouth key point feature.
在一些实施方式中,在样本图像帧序列包括两个样本图像帧,或者多于两个样本图像帧的情况下,基于样本图像帧中的嘴部关键点的位置信息确定样本图像帧对应的嘴部关键点特征的方式,可以是计算每一个样本图像帧和该样本图像帧的相邻帧中的嘴部关键点的位置信息的差异信息,并按照对应的关键点序号,对一个样本图像帧中的每个嘴部关键点的差异信息进行排序,将排序序列作为该图像帧对应的嘴部关键点特征;其中,相邻帧可以是该样本图像帧的前一样本图像帧和/或后一样本图像帧。In some embodiments, when the sample image frame sequence includes two sample image frames, or more than two sample image frames, the mouth corresponding to the sample image frame is determined based on the position information of the mouth key point in the sample image frame. The way to obtain the mouth key point features can be to calculate the difference information of the position information of the mouth key points in each sample image frame and the adjacent frame of the sample image frame, and calculate a sample image frame according to the corresponding key point serial number. The difference information of each mouth key point in is sorted, and the sorted sequence is used as the mouth key point feature corresponding to the image frame; where the adjacent frames can be the previous sample image frame and/or the subsequent sample image frame. A sample image frame.
这里,步骤S601至步骤S602分别对应于前述步骤S101至步骤S102,在实施时可以参照前述步骤S101至步骤S102的具体实施方式。Here, steps S601 to S602 respectively correspond to the aforementioned steps S101 to S102. During implementation, reference may be made to the specific implementation of the aforementioned steps S101 to S102.
步骤S603,利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词。Step S603: Using the model to be trained, generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine the syllable classification features in the preset keyword database Keywords for feature matching.
其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别。Wherein, the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence.
这里,待训练的模型可以是任意合适的深度学习模型,这里并不限定。在实施时,本领域技术人员可以根据实际情况采用合适的网络结构构建待训练的模型。Here, the model to be trained can be any suitable deep learning model, and is not limited here. During implementation, those skilled in the art can use an appropriate network structure to construct the model to be trained according to the actual situation.
利用待训练的模型对样本图像帧序列中多个样本图像帧的嘴部关键点特征进行处理,生成音节分类特征,音节分类特征表征样本图像帧序列中嘴部对象的口型对应的音节类别,并在预设关键词库中确定与音节分类特征匹配的关键词的过程,对应于前述实施例中的步骤S103至步骤S104中对嘴部关键点特征进行处理的过程,在实施时可以参照前述步骤S103至步骤S104的具体实施方式。The model to be trained is used to process the mouth key point features of multiple sample image frames in the sample image frame sequence to generate syllable classification features. The syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence. The process of determining keywords matching the syllable classification features in the preset keyword library corresponds to the process of processing the key point features of the mouth in steps S103 to S104 in the previous embodiment. During implementation, you can refer to the above Specific implementation of step S103 to step S104.
这样,通过音节辅助学习能够有效降低关键词识别分类的学习难度,从而提升唇语识别的准确度。In this way, syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition.
步骤S604,基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少一次更新, 得到经过训练的唇语识别模型。Step S604: Update the network parameters of the model at least once based on the determined keywords and the keyword tags, Get a trained lip recognition model.
这里,可以基于确定出的关键词和关键词标签,确定是否对模型的网络参数进行更新。在确定对模型的网络参数进行更新的情况下,采用合适的参数学习难度更新算法对模型的网络参数进行更新,并利用参数更新后的模型重新确定匹配的关键词,以基于重新确定的关键词和关键词标签,确定是否对模型的网络参数进行继续更新。在确定不对模型的网络参数进行继续更新的情况下,将最终更新后的模型确定为经过训练的唇语识别模型。Here, it can be determined whether to update the network parameters of the model based on the determined keywords and keyword tags. When it is determined to update the network parameters of the model, an appropriate parameter learning difficulty update algorithm is used to update the network parameters of the model, and the model with updated parameters is used to re-determine the matching keywords based on the re-determined keywords. and keyword tags to determine whether to continue updating the network parameters of the model. When it is determined not to continue updating the network parameters of the model, the finally updated model is determined to be the trained lip recognition model.
在一些实施方式中,可以基于确定出的关键词和关键词标签确定损失值,并在该损失值不满足预设条件的情况下,对模型的网络参数进行更新,在损失值满足预设条件或对模型的网络参数进行更新的次数达到设定阈值的情况下,停止对模型的网络参数进行更新,并将最终更新后的模型确定为经过训练的唇语识别模型。预设条件可以包括但不限于损失值小于设定的损失阈值、损失值的变化收敛等至少之一。在实施时,预设条件可以根据实际情况设定,本公开实施例对此并不限定。In some implementations, the loss value can be determined based on the determined keywords and keyword tags, and when the loss value does not meet the preset conditions, the network parameters of the model are updated. When the loss value meets the preset conditions Or when the number of updates to the network parameters of the model reaches a set threshold, the update of the network parameters of the model is stopped, and the final updated model is determined as the trained lip recognition model. The preset conditions may include, but are not limited to, at least one of the loss value being less than the set loss threshold, the change in the loss value converging, and the like. During implementation, the preset conditions may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.
对模型的网络参数进行更新的方式可以是根据实际情况确定的,可以包括但不限于梯度下降法、牛顿动量法等中的至少一种,这里并不限定。The method of updating the network parameters of the model may be determined based on the actual situation, and may include but is not limited to at least one of the gradient descent method, Newton's momentum method, etc., which is not limited here.
本公开实施例中,在模型训练过程中,通过音节辅助学习能够有效降低关键词识别分类的学习难度,从而可以提升经过训练的唇语识别模型进行唇语识别的准确度。并且,由于音节分类特征是基于嘴部关键点特征确定的,因而音节分类特征可以更好地体现与图像帧序列中口型对应的音节,利用音节分类特征辅助唇语识别,从而使得图像处理得到的关键词更精确,提升唇语识别的准确度。并且,相较于利用脸部图像裁剪得到的嘴部区域图像序列进行唇语识别,利用嘴部关键点特征进行唇语识别,能够降低图像处理过程所需的计算量,从而降低对执行图像处理方法的计算机设备的硬件要求;并且,对不同脸型、纹理等外观信息的脸部图像都能取得良好的识别效果,从而基于嘴部关键点特征可以提高对模型训练过程中未涉及的脸型、纹理的图像类别的识别能力,进而提高了唇语识别的泛化能力。In the embodiment of the present disclosure, during the model training process, syllable-assisted learning can effectively reduce the learning difficulty of keyword recognition and classification, thereby improving the accuracy of lip recognition by the trained lip recognition model. Moreover, since the syllable classification features are determined based on the key point features of the mouth, the syllable classification features can better reflect the syllables corresponding to the mouth shapes in the image frame sequence, and the syllable classification features can be used to assist lip language recognition, so that image processing can be achieved The keywords are more precise and improve the accuracy of lip recognition. Moreover, compared to using the mouth region image sequence obtained by cropping the face image for lip recognition, using the key point features of the mouth for lip recognition can reduce the amount of calculation required in the image processing process, thereby reducing the need to perform image processing. The hardware requirements of the computer equipment of the method; and, good recognition results can be achieved for facial images with different face shapes, textures and other appearance information, so that based on the key point features of the mouth, the recognition of face shapes, textures not involved in the model training process can be improved The recognition ability of image categories is improved, thereby improving the generalization ability of lip language recognition.
在一些实施例中,所述模型中包括音节特征提取网络和分类网络,上述步骤S603可以包括如下步骤S6031至步骤S6032:In some embodiments, the model includes a syllable feature extraction network and a classification network, and the above step S603 may include the following steps S6031 to S6032:
步骤S6031,利用所述音节特征提取网络,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征。Step S6031: Use the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence.
步骤S6032,利用所述分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。Step S6032: Use the classification network to determine keywords matching the syllable classification features in the preset keyword library.
这里,步骤S6031至步骤S6032分别对应于前述步骤S503至步骤S504,在实施时可以参照前述步骤S503至步骤S504的具体实施方式。Here, steps S6031 to S6032 respectively correspond to the aforementioned steps S503 to S504. During implementation, reference may be made to the specific implementation of the aforementioned steps S503 to S504.
在一些实施方式中,所述音节特征提取网络包括空间特征提取子网络、时间特征提取子网络和音节分类特征提取子网络,上述步骤S6031可以包括如下步骤S60311至步骤S60313:In some embodiments, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network. The above step S6031 may include the following steps S60311 to S60313:
步骤S60311,利用所述空间特征提取子网络,分别对每一所述样本图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一样本图像帧的空间特征。Step S60311: Use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame.
步骤S60312,利用所述时间特征提取子网络,对所述嘴部对象在多个所述样本图像帧的空间特征进行样本时间特征提取,得到所述嘴部对象的时空特征。Step S60312: Use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object.
步骤S60313,利用所述音节分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。Step S60313: Use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
这里,步骤S60311至步骤S60313分别对应于前述步骤S5031至步骤S5033,在实施时可以参照前述步骤S5031至步骤S5033的具体实施方式。Here, steps S60311 to S60313 respectively correspond to the aforementioned steps S5031 to S5033. During implementation, reference may be made to the specific implementation of the aforementioned steps S5031 to S5033.
下面说明本公开实施例提供的图像处理方法在实际场景中的应用,以图像处理用于汉语的唇语识别为例进行说明。The following describes the application of the image processing method provided by the embodiments of the present disclosure in actual scenarios, taking image processing for lip recognition in Chinese as an example.
图7为本公开实施例提供的一种唇语识别模型的组成结构示意图。如图7所示,该唇语识别模型结构包括:单帧特征提取网络701、帧间特征融合网络702和特征序列分类网络703。其中,单帧特征提取网络701包括空间特征提取网络7011和空间特征融合网络7012,特征序列分类网络703包括音节特征层7031和第一线性层7032。Figure 7 is a schematic structural diagram of a lip recognition model provided by an embodiment of the present disclosure. As shown in Figure 7, the lip recognition model structure includes: a single-frame feature extraction network 701, an inter-frame feature fusion network 702, and a feature sequence classification network 703. Among them, the single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012, and the feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.
本公开实施例提供一种图像处理方法,根据唇动识别检测结果生成对象说话的图像帧序列,将人脸关键点的特征作为唇语识别模型的输入,利用单音节辅助检测说话序列中的音节,并利用音节特征层实现对说话序列的分类。下面结合图7对本公开实施例的图像处理方法进行说明。Embodiments of the present disclosure provide an image processing method that generates an image frame sequence of the subject speaking based on the lip movement recognition detection results, uses the characteristics of the key points of the face as the input of the lip language recognition model, and uses monosyllables to assist in detecting syllables in the speaking sequence. , and use the syllable feature layer to classify speech sequences. The image processing method according to the embodiment of the present disclosure will be described below with reference to FIG. 7 .
本公开实施例提供一种图像处理方法,该方法可以由计算机设备的处理器执行。其中,计算机设备指的可以是车机等具备数据处理能力的设备。图像处理方法可以包括如下步骤一至步骤四:Embodiments of the present disclosure provide an image processing method, which can be executed by a processor of a computer device. Among them, computer equipment may refer to equipment with data processing capabilities such as vehicles and machines. The image processing method may include the following steps one to four:
步骤一,输入预处理。Step 1, input preprocessing.
计算机设备获得的输入视频序列为不固定帧,视频序列可以包括数量不固定的视频帧。关键点序列在每个图像帧对应有106个关键点,将嘴部对象的20个关键点取出,再利用插值法(例如,双线性插值法)将该20个关键点生成长度为60个图像帧的关键点的位置序列。20个嘴部关键点作为特征维度,位 置序列中每个关键点在每一图像帧中对应长度为5的特征,从而得到对应于60帧嘴部关键点特征704,每帧嘴部关键点特征704对应一个图像帧,20个关键点中的每个关键点在每一图像帧中对应有5维特征。The input video sequence obtained by the computer device is a non-fixed frame, and the video sequence may include a non-fixed number of video frames. The key point sequence corresponds to 106 key points in each image frame. Take out the 20 key points of the mouth object, and then use the interpolation method (for example, bilinear interpolation method) to generate the 20 key points into a length of 60 The position sequence of the key points of the image frame. 20 mouth key points are used as feature dimensions. Each key point in the sequence corresponds to a feature of length 5 in each image frame, thereby obtaining mouth key point features 704 corresponding to 60 frames. Each mouth key point feature 704 corresponds to one image frame and 20 key points. Each key point in each image frame corresponds to a 5-dimensional feature.
在一些实施方式中,根据当前图像帧与前后图像帧的坐标差值得到特征的前4维,根据当前帧中的预设的关键点对之间的高度差得到特征的第5维。其中,前4维能够反映当前图像帧与前后图像帧之间的口型变化,第5维反映当前图像帧中的口型。这里,可以利用通过唇动识别等方式处理采集到的视频,使每个视频能够至少覆盖设定对象(通常是人)说一句话的过程,每句话对应一个关键词。这样,视频与关键词是一对一的关系。并且,无论获取的视频说话序列的帧数是多少帧,都可以利用插值法得到60帧的位置序列。In some implementations, the first four dimensions of the feature are obtained based on the coordinate difference between the current image frame and the previous and subsequent image frames, and the fifth dimension of the feature is obtained based on the height difference between the preset key point pairs in the current frame. Among them, the first 4 dimensions can reflect the mouth shape changes between the current image frame and the previous and subsequent image frames, and the fifth dimension reflects the mouth shape in the current image frame. Here, the collected videos can be processed through methods such as lip movement recognition, so that each video can at least cover the process of the set object (usually a person) speaking a sentence, and each sentence corresponds to a keyword. In this way, there is a one-to-one relationship between video and keywords. Moreover, no matter how many frames the video speech sequence is acquired, the interpolation method can be used to obtain a 60-frame position sequence.
这里,位置序列的帧数越多,计算效率降低,但唇语识别的性能提升,综合考虑识别性能、计算效率以及待检测关键词的字数分布,将位置序列的帧数设置为60帧。其中,性能可以是唇语识别的准确度。Here, the more frames in the position sequence, the lower the computational efficiency, but the performance of lip recognition is improved. Considering the recognition performance, computing efficiency and the word count distribution of the keywords to be detected, the number of frames in the position sequence is set to 60 frames. Among them, the performance can be the accuracy of lip recognition.
步骤二,单帧特征提取。Step 2: Single frame feature extraction.
计算机设备通过图7中单帧特征提取网络701实现单帧特征提取。单帧特征提取网络701包括空间特征提取网络7011和空间特征融合网络7012。The computer equipment implements single frame feature extraction through the single frame feature extraction network 701 in Figure 7. The single-frame feature extraction network 701 includes a spatial feature extraction network 7011 and a spatial feature fusion network 7012.
将嘴部关键点特征704输入唇语识别模型,利用空间特征提取网络7011,独立地对每个图像帧中的嘴部关键点特征704以1×1的卷积核进行特征提取,重复2次上述卷积,将经过2次卷积提取的特征输入空间特征融合网络7012。在空间特征融合网络7012中,首先,采用5×1的卷积核对每个关键点的5维特征进行融合,得到每一图像帧的空间特征,然后,利用每一图像帧经空间特征提取网络7011提取得到的特征705,采用1×1的卷积核对20个关键点间的特征进行融合,得到图像帧的空间特征706,完成单帧特征提取。Input the key point features 704 of the mouth into the lip language recognition model, and use the spatial feature extraction network 7011 to independently extract features of the key point features 704 of the mouth in each image frame with a 1×1 convolution kernel, repeating 2 times. The above-mentioned convolution inputs the features extracted through two convolutions into the spatial feature fusion network 7012. In the spatial feature fusion network 7012, first, a 5×1 convolution kernel is used to fuse the 5-dimensional features of each key point to obtain the spatial features of each image frame. Then, each image frame is used to go through the spatial feature extraction network. 7011 extracts the features 705 and uses a 1×1 convolution kernel to fuse the features between the 20 key points to obtain the spatial features 706 of the image frame and complete the single frame feature extraction.
在一些实现方式中,卷积核可以是残差块卷积核(Residual Block kernel)。In some implementations, the convolution kernel may be a residual block kernel (Residual Block kernel).
步骤三,帧间特征融合。Step 3: Inter-frame feature fusion.
计算机设备通过图7中的帧间特征融合网络702实现相邻图像帧的帧间特征融合。The computer device implements inter-frame feature fusion of adjacent image frames through the inter-frame feature fusion network 702 in Figure 7.
将每一图像帧的空间特征706输入帧间特征融合网络702,利用1×5的卷积核在序列长度维度上卷积,将每一图像帧的空间特征706和前后各两个图像帧的空间特征706融合,并重复5次上述卷积以提高感受野,使得帧间的信息得到交流,加强相邻帧间关联,有利于学习多帧构成的关键词和汉字之间的时间序列。Input the spatial features 706 of each image frame into the inter-frame feature fusion network 702, use a 1×5 convolution kernel to convolve in the sequence length dimension, and combine the spatial features 706 of each image frame with the two image frames before and after. Spatial features 706 are fused, and the above-mentioned convolution is repeated 5 times to increase the receptive field, so that information between frames can be communicated, and the correlation between adjacent frames can be strengthened, which is beneficial to learning the time sequence between keywords and Chinese characters composed of multiple frames.
该步骤将占用一定的计算资源,为提高唇语识别性能可以将卷积核尺寸增大,并将重复次数增多,相应地将影响计算效率。综合考虑准确度和硬件运算效率,实际应用中可以将提取次数设置为5次,卷积核尺寸设置为5。This step will occupy a certain amount of computing resources. In order to improve the performance of lip recognition, the convolution kernel size can be increased and the number of repetitions will be increased, which will accordingly affect the computing efficiency. Considering the accuracy and hardware computing efficiency, in actual applications, the number of extractions can be set to 5 times, and the convolution kernel size can be set to 5.
步骤四,特征序列分类。Step 4: Feature sequence classification.
计算机设备通过图7中特征序列分类网络703实现对特征序列的分类,得到视频序列对应的关键词序号。其中,特征序列包括多个图像帧的时空特征。特征序列分类网络703包括音节特征层7031和第一线性层7032。The computer device implements classification of the feature sequence through the feature sequence classification network 703 in Figure 7, and obtains the keyword sequence number corresponding to the video sequence. Among them, the feature sequence includes spatiotemporal features of multiple image frames. The feature sequence classification network 703 includes a syllable feature layer 7031 and a first linear layer 7032.
将时空特征输入音节特征层7031中的“平坦层+第二线性层+非线性激活(relu)层”进行处理,所有图像帧的时空特征融合至一维向量707,实现多个图像帧的时空特征的特征融合。将一维向量707输入音节特征层7031中的第三线性层进行100分类单音节辅助分类得到音节分类特征,将音节分类特征输入第一线性层7032,输出需要检测的视频序列的关键词序号。其中,第三线性层可以为归一化指数函数(Softmax函数),并以二分类交叉熵损失(BCEloss)函数为损失函数训练得到。第一线性层7032可以以焦点损失(Focalloss)函数为损失函数进行训练,利用Softmax函数进行预测;实际应用中,第一线性层7032可以是间隔线性(MarginLinear)层,由全连接层或全局平均池化层实现。相比使用全局平均池化层,全连接层直接展开等价于每帧对应一个可学习的位置编码(learnable position embedding),从而能够记录每帧在语句中的位置顺序前后信息。The spatio-temporal features are input into the "flat layer + second linear layer + nonlinear activation (relu) layer" in the syllable feature layer 7031 for processing. The spatio-temporal features of all image frames are merged into one-dimensional vectors 707 to realize the spatio-temporal processing of multiple image frames. Feature fusion of features. The one-dimensional vector 707 is input into the third linear layer in the syllable feature layer 7031 for 100-class single syllable auxiliary classification to obtain syllable classification features. The syllable classification features are input into the first linear layer 7032 to output the keyword sequence number of the video sequence to be detected. Among them, the third linear layer can be a normalized exponential function (Softmax function) and is trained with a binary cross-entropy loss (BCEloss) function as the loss function. The first linear layer 7032 can be trained using the focal loss function as the loss function, and the softmax function can be used for prediction; in practical applications, the first linear layer 7032 can be a margin linear layer, consisting of a fully connected layer or a global average Pooling layer implementation. Compared with using the global average pooling layer, the direct expansion of the fully connected layer is equivalent to each frame corresponding to a learnable position embedding, so that the position sequence information of each frame in the sentence can be recorded.
在一些实现方式中,使用音节辅助学习的唇语识别的检测算法。目前,不考虑音调,汉语所有字的读音一共有419类,根据口型可以将这419类音节分为100类,同一口型的音节划为同一类,将长度为100的特征(对应于前述实施例中的音节分类特征)放在最后分类的全连接层之前,并将该特征的输出作为100分类的辅助监督,此时音节特征层7031的输出代表唇语序列中共有哪些音节,对音节特征层7031的输出结果进行分类,能够有效降低全连接层分类的学习难度,从而提升性能。其中,音节特征层7031可以采用线性(Linear)层实现。In some implementations, a detection algorithm for lip recognition using syllable-assisted learning is used. Currently, regardless of pitch, there are a total of 419 categories of pronunciation for all Chinese characters. These 419 categories of syllables can be divided into 100 categories according to mouth shape. Syllables with the same mouth shape are classified into the same category. The characteristics of length 100 (corresponding to the aforementioned The syllable classification feature in the embodiment) is placed before the fully connected layer of the final classification, and the output of this feature is used as the auxiliary supervision of the 100 classification. At this time, the output of the syllable feature layer 7031 represents which syllables are shared in the lip sequence, and the syllables Classifying the output results of feature layer 7031 can effectively reduce the learning difficulty of fully connected layer classification, thereby improving performance. Among them, the syllable feature layer 7031 can be implemented using a linear layer.
本公开实施例中,单音节辅助策略对性能提升明显;并且,这些用于匹配的关键词可以以预设关键词库的形式存储,在增加新的用于匹配的关键词时,可以在该预设关键词库中相应增加,方便关键词的更新。In the embodiment of the present disclosure, the monosyllable auxiliary strategy significantly improves performance; and these keywords used for matching can be stored in the form of a preset keyword library. When adding new keywords used for matching, they can be added in the The preset keyword library has been added accordingly to facilitate keyword updates.
需要说明的是,在实施时,上述坐标差值可以对应于前述实施例中的位置信息的差异信息,视频序列可以对应于前述实施例中的图像帧序列,单帧特征提取网络701可以对应于前述实施例中的空间特征 提取子网络,帧间特征融合网络702可以对应于前述实施例中的时间特征提取子网络,音节特征层7031可以对应于前述实施例中的音节分类特征提取子网络,第一线性层7032可以对应于前述实施例中的分类网络。It should be noted that during implementation, the above-mentioned coordinate difference value may correspond to the difference information of the position information in the previous embodiment, the video sequence may correspond to the image frame sequence in the previous embodiment, and the single-frame feature extraction network 701 may correspond to Spatial characteristics in the aforementioned embodiments Extraction sub-network, the inter-frame feature fusion network 702 may correspond to the temporal feature extraction sub-network in the previous embodiment, the syllable feature layer 7031 may correspond to the syllable classification feature extraction sub-network in the previous embodiment, and the first linear layer 7032 may correspond to The classification network in the previous embodiment.
在人机交互领域,语音识别的应用仍存在一定的局限,如噪音或音乐音量较大、不方便说话的场合等,此时唇语识别能够在一定程度上弥补语音识别的局限带来的不便。唇语识别能够根据唇动识别检测到的说话区间,检测到说话人在该区间内所说的内容所对应的关键词。例如,在车舱中,语音识别是人机交互的主要手段,但当汽车在高速上的噪音较大,或者,播放音乐声音较大时,语音识别均不能准确识别用户的语音;又或者,当车内有人睡觉时,用户也不方便利用语音进行交互,此时通过唇语识别,用户只需要利用口型模拟说话,便可以让车机检测用户的指示,从而完成人机交互。In the field of human-computer interaction, the application of speech recognition still has certain limitations, such as situations where noise or music volume is loud, and it is inconvenient to speak. In this case, lip recognition can make up for the inconvenience caused by the limitations of speech recognition to a certain extent. . Lip recognition can detect the keywords corresponding to what the speaker said in that interval based on the speech interval detected by lip movement recognition. For example, in the car cabin, voice recognition is the main means of human-computer interaction, but when the car makes a lot of noise on the highway, or when the music is played loudly, the voice recognition cannot accurately recognize the user's voice; or, When someone is sleeping in the car, it is inconvenient for the user to use voice to interact. At this time, through lip recognition, the user only needs to use mouth shape to simulate speaking, and the car machine can detect the user's instructions, thereby completing human-computer interaction.
相较于相关技术的唇语识别技术,本公开实施例利用关键点识别,占用的计算资源更少,且可以学习到嘴唇的帧间运动信息,方便部署、效率更高且准确性更好。本公开实施例提供的图像处理方法,用于唇语识别时,支持35类常用关键词的识别,识别召回率在控制误报率小于千分之一的情况下达到81%。Compared with lip recognition technology in the related art, the embodiments of the present disclosure utilize key point recognition, which takes up less computing resources and can learn the inter-frame motion information of lips, making it easier to deploy, more efficient and more accurate. The image processing method provided by the embodiment of the present disclosure supports the recognition of 35 types of commonly used keywords when used for lip language recognition, and the recognition recall rate reaches 81% while controlling the false alarm rate to less than one thousandth.
基于前述的实施例,本公开实施例还提供了一种图像处理装置,该装置包括所包括的各单元、以及各单元所包括的各模块,可以通过计算机设备中的处理器来实现;当然也可通过具体的逻辑电路实现;在实施的过程中,处理器可以为中央处理器(Central Processing Unit,CPU)、微处理器(Microprocessor Unit,MPU)、数字信号处理器(Digital Signal Processor,DSP)或现场可编程门阵列(Field Programmable Gate Array,FPGA)等。Based on the foregoing embodiments, embodiments of the present disclosure also provide an image processing device, which includes various units and modules included in each unit, which can be implemented by a processor in a computer device; of course, it can also It can be realized through specific logic circuits; during the implementation process, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), or a digital signal processor (Digital Signal Processor, DSP) Or Field Programmable Gate Array (FPGA), etc.
图8为本公开实施例提供的一种图像处理装置的组成结构示意图,如图8所示,图像处理装置800包括:第一获取部分810、第一识别部分820、第一确定部分830和第一匹配部分840,其中:Figure 8 is a schematic structural diagram of an image processing device provided by an embodiment of the present disclosure. As shown in Figure 8, the image processing device 800 includes: a first acquisition part 810, a first recognition part 820, a first determination part 830 and a A matching part 840, where:
第一获取部分810,被配置为获取包含嘴部对象的图像帧序列;The first acquisition part 810 is configured to acquire an image frame sequence containing a mouth object;
第一识别部分820,被配置为对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;The first recognition part 820 is configured to extract mouth key point features for each image frame in the image frame sequence to obtain the mouth key point features of each image frame;
第一确定部分830,被配置为根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;The first determining part 830 is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the mouth in the image frame sequence The syllable category corresponding to the subject's mouth shape;
第一匹配部分840,被配置为在预设关键词库中确定与所述音节分类特征匹配的关键词。The first matching part 840 is configured to determine keywords matching the syllable classification features in the preset keyword library.
在一些实施例中,在所述图像帧序列包括至少两帧图像的情况下,所述第一识别部分820,包括:第一确定子部分,被配置为确定所述嘴部对象的至少两个嘴部关键点在所述每一图像帧中的位置信息;第二确定子部分,被配置为针对所述图像帧序列中的每一图像帧,根据所述图像帧和所述图像帧的相邻帧中的嘴部关键点的位置信息,确定所述图像帧对应的嘴部关键点特征。In some embodiments, in the case where the image frame sequence includes at least two frames of images, the first identification part 820 includes: a first determining sub-part configured to determine at least two of the mouth objects. position information of the key point of the mouth in each image frame; a second determination sub-part configured to, for each image frame in the sequence of image frames, determine the position information of the key point of the mouth according to the phase of the image frame and the image frame; The position information of the mouth key points in adjacent frames determines the characteristics of the mouth key points corresponding to the image frame.
在一些实施例中,所述嘴部关键点特征包括每一所述嘴部关键点的帧间差异信息和帧内差异信息;所述第二确定子部分,包括:第一确定单元,被配置为针对每一所述嘴部关键点,根据所述嘴部关键点在所述图像帧中的位置信息,以及所述嘴部关键点在所述图像帧的相邻图像帧中的位置信息,确定所述嘴部关键点在所述图像帧和相邻帧之间的第一高度差和/或第一宽度差,作为所述嘴部关键点的帧间差异信息;第二确定单元,被配置为针对每一所述嘴部关键点,根据所述图像帧中的所述嘴部关键点与同一嘴部对象的其他嘴部关键点之间的第二高度差和/或第二宽度差,确定所述嘴部关键点的帧内差异信息。In some embodiments, the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point; the second determination sub-part includes: a first determination unit, configured For each mouth key point, according to the position information of the mouth key point in the image frame and the position information of the mouth key point in adjacent image frames of the image frame, Determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point; the second determination unit is configured to, for each of the mouth key points, calculate a second height difference and/or a second width difference between the mouth key point in the image frame and other mouth key points of the same mouth object. , determine the intra-frame difference information of the mouth key point.
在一些实施例中,所述第一确定部分830,包括:第一提取子部分,被配置为分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征;第二提取子部分,被配置为对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征;第三提取子部分,被配置为基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。In some embodiments, the first determination part 830 includes: a first extraction sub-part configured to perform spatial feature extraction on mouth key point features of each image frame to obtain the mouth object. Spatial features in each image frame; the second extraction sub-part is configured to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object; The third extraction sub-part is configured to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
在一些实施例中,所述第一提取子部分,包括:第一提取单元,被配置为对所述嘴部对象的多个所述嘴部关键点的帧间差异信息和帧内差异信息进行融合,得到所述嘴部对象在每一图像帧的帧间差异特征和帧内差异特征;第二提取单元,被配置为对所述嘴部对象在多个所述图像帧的帧间差异特征和帧内差异特征进行融合,得到所述嘴部对象在每一图像帧的空间特征。In some embodiments, the first extraction sub-part includes: a first extraction unit configured to perform inter-frame difference information and intra-frame difference information on a plurality of mouth key points of the mouth object. Fusion to obtain the inter-frame difference features and intra-frame difference features of the mouth object in each image frame; the second extraction unit is configured to obtain the inter-frame difference features of the mouth object in multiple image frames Fusion with intra-frame difference features to obtain the spatial features of the mouth object in each image frame.
在一些实施例中,所述第一确定部分830,包括:第三确定子部分,被配置为利用经过训练的音节特征提取网络对所述图像帧序列中多个图像帧的嘴部关键点特征进行处理,得到音节分类特征;所述第一匹配部分840,包括:第一匹配子部分,被配置为利用经过训练的分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。In some embodiments, the first determining part 830 includes: a third determining sub-part configured to use a trained syllable feature extraction network to determine mouth key point features of multiple image frames in the image frame sequence. Perform processing to obtain syllable classification features; the first matching part 840 includes: a first matching sub-part configured to use the trained classification network to determine in the preset keyword library that matches the syllable classification features. Key words.
在一些实施例中,所述第一获取部分810,包括插帧子部分,被配置为:对获取到的包含嘴部对象的原始图像序列进行图像插帧,得到所述图像帧序列;或者,基于获取到的包含嘴部对象的原始图像序列中的嘴部关键点,对所述原始图像序列进行插帧,得到所述图像帧序列。In some embodiments, the first acquisition part 810, including the frame interpolation sub-part, is configured to: perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or, Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
在一些实施例中,所述音节特征提取网络包括空间特征提取子网络、时间特征提取子网络和分类特征提取子网络;所述第三确定子部分,包括:第三提取单元,被配置为利用所述空间特征提取子网络, 分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征;第四提取单元,被配置为利用所述时间特征提取子网络,对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征;第五提取单元,被配置为利用所述分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。In some embodiments, the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification feature extraction sub-network; the third determination sub-part includes: a third extraction unit configured to utilize The spatial feature extraction subnetwork, Perform spatial feature extraction on the mouth key point features of each image frame respectively to obtain the spatial features of the mouth object in each image frame; the fourth extraction unit is configured to utilize the temporal feature extraction sub-network , perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatiotemporal features of the mouth object; the fifth extraction unit is configured to utilize the classification feature extraction sub-network, based on The spatiotemporal features of the mouth object are subjected to syllable classification feature extraction to obtain the syllable classification features of the mouth object.
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。在一些实施例中,本公开实施例提供的装置具有的功能或包含的部分可以用于执行上述方法实施例描述的方法,对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. In some embodiments, the functions or parts of the device provided by the embodiments of the present disclosure can be used to perform the methods described in the above method embodiments. For technical details not disclosed in the embodiments of the device of the present disclosure, please refer to the methods of the present disclosure. be understood from the description of the embodiments.
基于前述的实施例,本公开实施例提供一种生成唇语识别模型的装置,该装置包括所包括的各单元、以及各单元所包括的各部分,可以通过计算机设备中的处理器来实现;当然也可通过具体的逻辑电路实现;在实施的过程中,处理器可以为CPU、MPU、DSP或FPGA等。Based on the foregoing embodiments, embodiments of the present disclosure provide a device for generating a lip recognition model. The device includes each unit included and each part included in each unit, which can be implemented by a processor in a computer device; Of course, it can also be implemented through specific logic circuits; during the implementation process, the processor can be CPU, MPU, DSP or FPGA, etc.
图9为本公开实施例提供的一种生成唇语识别模型的装置的组成结构示意图,如图9所示,该装置900包括:第二获取部分910、第二识别部分920、第二匹配部分930和更新部分940,其中:Figure 9 is a schematic structural diagram of a device for generating a lip recognition model provided by an embodiment of the present disclosure. As shown in Figure 9, the device 900 includes: a second acquisition part 910, a second recognition part 920, and a second matching part. 930 and updated section 940, which:
第二获取部分910,被配置为获取包含嘴部对象的样本图像帧序列;其中,所述样本图像帧序列标注有关键词标签;The second acquisition part 910 is configured to acquire a sample image frame sequence containing a mouth object; wherein the sample image frame sequence is marked with a keyword tag;
第二识别部分920,被配置为对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征;The second identification part 920 is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;
第二匹配部分930,被配置为利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词;其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别;The second matching part 930 is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and generate syllable classification features in the preset keyword library Determine keywords that match the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
更新部分940,被配置为基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少一次更新,得到经过训练的唇语识别模型。The update part 940 is configured to update the network parameters of the model at least once based on the determined keywords and the keyword tags to obtain a trained lip recognition model.
在一些实施例中,所述模型中包括音节特征提取网络和分类网络;所述第二匹配部分930,包括:第四确定子部分,被配置为利用所述特征提取网络,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征;第五确定子部分,被配置为利用所述分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。In some embodiments, the model includes a syllable feature extraction network and a classification network; the second matching part 930 includes: a fourth determining sub-part configured to use the feature extraction network to determine the sample image according to the sample image. The mouth key point features of multiple sample image frames in the frame sequence generate syllable classification features; the fifth determination sub-part is configured to use the classification network to determine the syllable classification in the preset keyword library Keywords for feature matching.
在一些实施例中,所述特征提取网络包括空间特征提取子网络、时间特征提取子网络和音节分类特征提取子网络;所述第四确定子部分,包括:第六提取单元,被配置为利用所述空间特征提取子网络,分别对每一所述样本图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一样本图像帧的空间特征;第七提取单元,被配置为利用所述时间特征提取子网络,对所述嘴部对象在多个所述样本图像帧的空间特征进行样本时间特征提取,得到所述嘴部对象的时空特征;第八提取单元,被配置为利用所述音节分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。In some embodiments, the feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network; the fourth determination sub-part includes: a sixth extraction unit configured to utilize The spatial feature extraction sub-network performs spatial feature extraction on the key point features of the mouth in each sample image frame to obtain the spatial features of the mouth object in each sample image frame; the seventh extraction unit is Configured to use the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object; the eighth extraction unit is It is configured to use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。在一些实施例中,本公开实施例提供的装置具有的功能或包含的部分可以用于执行上述方法实施例描述的方法,对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. In some embodiments, the functions or parts of the device provided by the embodiments of the present disclosure can be used to perform the methods described in the above method embodiments. For technical details not disclosed in the embodiments of the device of the present disclosure, please refer to the methods of the present disclosure. be understood from the description of the embodiments.
本公开实施例提供一种车辆,包括:An embodiment of the present disclosure provides a vehicle, including:
车载相机,被配置为拍摄包含嘴部对象的图像帧序列;a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object;
车机,与所述车载相机连接,被配置为从所述车载相机获取包含嘴部对象的图像帧序列;对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;在预设关键词库中确定与所述音节分类特征匹配的关键词。A vehicle machine, connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
以上车辆实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开车辆实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The above description of the vehicle embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the vehicle embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
需要说明的是,本公开实施例中,如果以软件功能部分的形式实现上述方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何特定的硬件、软件或固件,或者硬件、软件、固件三者之间的任意结合。It should be noted that in the embodiments of the present disclosure, if the above method is implemented in the form of a software functional part and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present disclosure can be embodied in the form of software products that are essentially or contribute to related technologies. The software product is stored in a storage medium and includes a number of instructions to enable a A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk and other media that can store program code. In this way, the embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of hardware, software, and firmware.
本公开实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的部分或全部步骤。 An embodiment of the present disclosure provides a computer device, including a memory and a processor. The memory stores a computer program that can be run on the processor. When the processor executes the program, some or all of the steps in the above method are implemented.
本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的部分或全部步骤。所述计算机可读存储介质可以是瞬时性的,也可以是非瞬时性的。Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented. The computer-readable storage medium may be transient or non-transitory.
本公开实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行用于实现上述方法中的部分或全部步骤。Embodiments of the present disclosure provide a computer program, which includes computer readable code. When the computer readable code is run in a computer device, the processor in the computer device executes a part for implementing the above method or All steps.
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一些实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一些实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。Embodiments of the present disclosure provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, some of the above methods are implemented or All steps. The computer program product can be implemented specifically through hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium. In other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and so on.
这里需要指出的是:上文对各个实施例的描述倾向于强调各个实施例之间的不同之处,其相同或相似之处可以互相参考。以上设备、存储介质、计算机程序及计算机程序产品实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开设备、存储介质、计算机程序及计算机程序产品实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be noted here that the above description of various embodiments tends to emphasize the differences between the various embodiments, and the similarities or similarities may be referred to each other. The description of the above embodiments of equipment, storage media, computer programs and computer program products is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the disclosed equipment, storage media, computer programs and computer program products, please refer to the description of the disclosed method embodiments for understanding.
需要说明的是,图10为本公开实施例中计算机设备的一种硬件实体示意图,如图10所示,该计算机设备1000的硬件实体包括:处理器1001、通信接口1002和存储器1003,其中:It should be noted that Figure 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure. As shown in Figure 10, the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002 and a memory 1003, where:
处理器1001通常控制计算机设备1000的总体操作。Processor 1001 generally controls the overall operation of computer device 1000 .
通信接口1002可以使计算机设备通过网络与其他终端或服务器通信。The communication interface 1002 can enable the computer device to communicate with other terminals or servers through a network.
存储器1003被配置为存储有处理器1001可执行的指令和应用,还可以缓存待处理器1001以及计算机设备1000中各部分待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据)。存储器1003可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。The memory 1003 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various parts of the computer device 1000 (for example, image data, audio data, voice communication data and video communication data). The memory 1003 can be implemented by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
处理器1001、通信接口1002和存储器1003之间可以通过总线1004进行数据传输。Data can be transmitted between the processor 1001, the communication interface 1002 and the memory 1003 through the bus 1004.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各步骤/过程的序号的大小并不意味着执行顺序的先后,各步骤/过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。It will be understood that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic associated with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the size of the serial numbers of the above steps/processes does not mean the order of execution. The execution order of each step/process should be determined by its functions and internal logic, and should not be The implementation process of the embodiments of the present disclosure constitutes no limitations. The above serial numbers of the embodiments of the present disclosure are only for description and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element.
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this disclosure, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components may be combined, or can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be electrical, mechanical, or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated; the components shown as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure can be all integrated into one processing unit, or each unit can be separately used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
若本公开技术方案涉及个人信息,应用本公开技术方案的产品在处理个人信息前,已明确告知个人信息处理规则,并取得个人自主同意。若本公开技术方案涉及敏感个人信息,应用本公开技术方案的产品在处理敏感个人信息前,已取得个人单独同意,并且同时满足“明示同意”的要求。例如,在摄像头等个人信息采集装置处,设置明确显著的标识告知已进入个人信息采集范围,将会对个人信息进行采集,若个人自愿进入采集范围即视为同意对其个人信息进行采集;或者在个人信息处理的装置上,利用明显的标识/信息告知个人信息处理规则的情况下,通过弹窗信息或请个人自行上传其个人信息等方式获得个人授权;其中,个人信息处理规则可包括个人信息处理者、个人信息处理目的、处理方式、处理的个人信息种类等信息。If the disclosed technical solution involves personal information, the products applying the disclosed technical solution will clearly inform the personal information processing rules and obtain the individual's independent consent before processing personal information. If the disclosed technical solution involves sensitive personal information, the product applying the disclosed technical solution must obtain the individual's separate consent before processing the sensitive personal information, and at the same time meet the requirement of "express consent". For example, setting up clear and conspicuous signs on personal information collection devices such as cameras to inform them that they have entered the scope of personal information collection, and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed to have agreed to the collection of his or her personal information; or On personal information processing devices, when using obvious logos/information to inform personal information processing rules, obtain personal authorization through pop-up messages or asking individuals to upload their personal information; among them, personal information processing rules may include personal information processing rules. Information processors, purposes of personal information processing, processing methods, types of personal information processed, etc.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施 例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps to implement the above method embodiments can be completed through hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the execution includes: Implementation of the above method The aforementioned steps include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks, and other media that can store program codes.
或者,本公开上述集成的单元如果以软件功能部分的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated units of the present disclosure are implemented in the form of software functional parts and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or that contributes to related technologies. The computer software product is stored in a storage medium and includes a number of instructions to enable a computer. A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: mobile storage devices, ROMs, magnetic disks or optical disks and other media that can store program codes.
以上所述,仅为本公开的实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。The above are only embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, and should are covered by the protection scope of this disclosure.
工业实用性Industrial applicability
本公开涉及一种图像处理方法及模型生成方法、装置、车辆、存储介质及计算机程序产品,所述图像处理方法包括:获取包含嘴部对象的图像帧序列;对图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到每一图像帧的嘴部关键点特征;根据图像帧序列中多个图像帧的嘴部关键点特征,生成音节分类特征;其中,音节分类特征表征图像帧序列中嘴部对象的口型对应的音节类别;在预设关键词库中确定与音节分类特征匹配的关键词。上述方案可以降低唇语识别的图像处理过程所需的计算量,从而可以降低对计算机设备的硬件要求;同时,可以对不同脸型、纹理等外观信息的脸部图像都能取得良好的识别效果,从而提高了唇语识别的泛化能力;此外,通过表示图像帧序列对应的音节分类特征,根据音节分类特征表征的音节类别确定与音节对应字词的关键词,可以使得图像处理得到的关键词更精确,从而能够提升唇语识别的准确度。 The present disclosure relates to an image processing method, a model generation method, a device, a vehicle, a storage medium and a computer program product. The image processing method includes: acquiring an image frame sequence including a mouth object; and processing each image in the image frame sequence. Extract mouth key point features from each frame to obtain the mouth key point features of each image frame; generate syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence; among them, the syllable classification features represent the image frames The syllable category corresponding to the mouth shape of the mouth object in the sequence; determine the keywords that match the syllable classification characteristics in the preset keyword library. The above solution can reduce the amount of calculation required in the image processing process of lip recognition, thereby reducing the hardware requirements for computer equipment; at the same time, it can achieve good recognition results for facial images with different face shapes, textures and other appearance information. This improves the generalization ability of lip language recognition; in addition, by expressing the syllable classification features corresponding to the image frame sequence, and determining the keywords of the words corresponding to the syllables based on the syllable categories represented by the syllable classification features, the keywords obtained by image processing can be More accurate, thereby improving the accuracy of lip recognition.

Claims (25)

  1. 一种图像处理方法,包括:An image processing method including:
    获取包含嘴部对象的图像帧序列;Get a sequence of image frames containing the mouth object;
    对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;Perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the mouth key point features of each image frame;
    根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;Generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the image frame sequence;
    在预设关键词库中确定与所述音节分类特征匹配的关键词。Keywords matching the syllable classification features are determined in the preset keyword library.
  2. 根据权利要求1所述的方法,其中,所述对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征,包括:The method according to claim 1, wherein the extraction of mouth key point features for each image frame in the image frame sequence to obtain the mouth key point features of each image frame includes:
    确定所述嘴部对象的至少两个嘴部关键点在所述每一图像帧中的位置信息;Determine the position information of at least two mouth key points of the mouth object in each image frame;
    针对所述图像帧序列中的每一图像帧,根据所述图像帧和所述图像帧的相邻帧中的嘴部关键点的位置信息,确定所述图像帧对应的嘴部关键点特征。For each image frame in the image frame sequence, the mouth key point characteristics corresponding to the image frame are determined based on the position information of the mouth key points in the image frame and adjacent frames of the image frame.
  3. 根据权利要求2所述的方法,其中,所述嘴部关键点特征包括每一所述嘴部关键点的帧间差异信息和帧内差异信息;The method according to claim 2, wherein the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point;
    所述根据所述图像帧和所述图像帧的相邻帧中的嘴部关键点的位置信息,确定所述图像帧对应的嘴部关键点特征,包括:Determining the characteristics of the mouth key points corresponding to the image frame based on the position information of the mouth key points in the image frame and adjacent frames of the image frame includes:
    针对每一所述嘴部关键点,根据所述嘴部关键点在所述图像帧中的位置信息,以及所述嘴部关键点在所述图像帧的相邻帧中的位置信息,确定所述嘴部关键点在所述图像帧和相邻帧之间的第一高度差和/或第一宽度差,作为所述嘴部关键点的帧间差异信息;For each mouth key point, determine the position information of the mouth key point in the image frame and the position information of the mouth key point in adjacent frames of the image frame. The first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame is used as the inter-frame difference information of the mouth key point;
    针对每一所述嘴部关键点,根据所述图像帧中的所述嘴部关键点与同一嘴部对象的其他嘴部关键点之间的第二高度差和/或第二宽度差,确定所述嘴部关键点的帧内差异信息。For each mouth key point, determine based on a second height difference and/or a second width difference between the mouth key point in the image frame and other mouth key points of the same mouth object. Intra-frame difference information of the mouth key points.
  4. 根据权利要求1至3任一项所述的方法,其中,所述根据所述图像帧序列中多个图像帧的嘴部关键点特征,生成音节分类特征,包括:The method according to any one of claims 1 to 3, wherein generating syllable classification features based on mouth key point features of multiple image frames in the image frame sequence includes:
    分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征;Perform spatial feature extraction on the mouth key point features of each image frame respectively to obtain the spatial features of the mouth object in each image frame;
    对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征;Perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatiotemporal features of the mouth object;
    基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。Perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  5. 根据权利要求4所述的方法,其中,所述分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征,包括:The method according to claim 4, wherein the spatial feature extraction is performed on the mouth key point features of each image frame to obtain the spatial features of the mouth object in each image frame, including:
    对所述嘴部对象的多个所述嘴部关键点的帧间差异信息和帧内差异信息进行融合,得到所述嘴部对象在每一图像帧的帧间差异特征和帧内差异特征;Fusion of inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain inter-frame difference features and intra-frame difference features of the mouth object in each image frame;
    对所述嘴部对象在多个所述图像帧的帧间差异特征和帧内差异特征进行融合,得到所述嘴部对象在每一图像帧的空间特征。The inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames are fused to obtain the spatial features of the mouth object in each image frame.
  6. 根据权利要求1至5任一项所述的方法,其中,The method according to any one of claims 1 to 5, wherein,
    所述根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征,包括:利用经过训练的音节特征提取网络对所述图像帧序列中多个图像帧的嘴部关键点特征进行处理,得到音节分类特征;Generating syllable classification features based on the mouth key point features of multiple image frames in the image frame sequence includes: using a trained syllable feature extraction network to classify the mouth key points of multiple image frames in the image frame sequence. The key point features are processed to obtain syllable classification features;
    所述在预设关键词库中确定与所述音节分类特征匹配的关键词,包括:利用经过训练的分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。Determining keywords matching the syllable classification features in the preset keyword database includes: using a trained classification network to determine keywords matching the syllable classification features in the preset keyword database.
  7. 根据权利要求1至6任一项所述的方法,其中,所述获取包含嘴部对象的图像帧序列,包括:The method according to any one of claims 1 to 6, wherein said obtaining an image frame sequence containing a mouth object includes:
    对获取到的包含嘴部对象的原始图像序列进行图像插帧,得到所述图像帧序列;或者,Perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or,
    基于获取到的包含嘴部对象的原始图像序列中的嘴部关键点,对所述原始图像序列进行插帧,得到所述图像帧序列。Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
  8. 一种生成唇语识别模型的方法,包括:A method for generating a lip recognition model, including:
    获取包含嘴部对象的样本图像帧序列;其中,所述样本图像帧序列标注有关键词标签;Obtaining a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;
    对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征;Perform mouth key point feature extraction on each sample image frame in the sample image frame sequence to obtain the mouth key point features of each sample image frame;
    利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词;其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别; Using the model to be trained, syllable classification features are generated based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and syllable classification features are determined in a preset keyword library that match the syllable classification features. Keywords; wherein, the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
    基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少一次更新,得到经过训练的唇语识别模型。Based on the determined keywords and the keyword tags, the network parameters of the model are updated at least once to obtain a trained lip recognition model.
  9. 根据权利要求8所述的方法,其中,所述模型中包括音节特征提取网络和分类网络;所述利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词,包括:The method according to claim 8, wherein the model includes a syllable feature extraction network and a classification network; using the model to be trained, according to the mouth of multiple sample image frames in the sample image frame sequence Key point features, generate syllable classification features, and determine keywords matching the syllable classification features in the preset keyword library, including:
    利用所述音节特征提取网络,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征;Utilize the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence;
    利用所述分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。Using the classification network, keywords matching the syllable classification features are determined in the preset keyword library.
  10. 根据权利要求9所述的方法,其中,所述音节特征提取网络包括空间特征提取子网络、时间特征提取子网络和音节分类特征提取子网络;The method according to claim 9, wherein the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network;
    所述利用所述音节特征提取网络,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,包括:The method of using the syllable feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence includes:
    利用所述空间特征提取子网络,分别对每一所述样本图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一样本图像帧的空间特征;Using the spatial feature extraction sub-network, perform spatial feature extraction on the mouth key point features of each sample image frame, respectively, to obtain the spatial features of the mouth object in each sample image frame;
    利用所述时间特征提取子网络,对所述嘴部对象在多个所述样本图像帧的空间特征进行样本时间特征提取,得到所述嘴部对象的时空特征;Using the temporal feature extraction sub-network, perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames to obtain the spatio-temporal features of the mouth object;
    利用所述音节分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。The syllable classification feature extraction sub-network is used to extract syllable classification features based on the spatiotemporal features of the mouth object, and the syllable classification features of the mouth object are obtained.
  11. 一种图像处理装置,包括:An image processing device, including:
    第一获取部分,被配置为获取包含嘴部对象的图像帧序列;a first acquisition part configured to acquire a sequence of image frames containing the mouth object;
    第一识别部分,被配置为对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;The first recognition part is configured to extract mouth key point features for each image frame in the image frame sequence, and obtain the mouth key point features of each image frame;
    第一确定部分,被配置为根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;The first determining part is configured to generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification feature represents the mouth object in the image frame sequence The syllable category corresponding to the mouth shape;
    第一匹配部分,被配置为在预设关键词库中确定与所述音节分类特征匹配的关键词。The first matching part is configured to determine keywords matching the syllable classification features in the preset keyword library.
  12. 根据权利要求11所述的装置,其中,所述第一识别部分包括:The device of claim 11, wherein the first identification portion includes:
    第一确定子部分,被配置为确定所述嘴部对象的至少两个嘴部关键点在所述每一图像帧中的位置信息;A first determining sub-section configured to determine position information of at least two mouth key points of the mouth object in each image frame;
    第二确定子部分,被配置为针对所述图像帧序列中的每一图像帧,根据所述图像帧和所述图像帧的相邻帧中的嘴部关键点的位置信息,确定所述图像帧对应的嘴部关键点特征。The second determination sub-part is configured to, for each image frame in the image frame sequence, determine the image according to the position information of the mouth key point in the image frame and an adjacent frame of the image frame. The key point features of the mouth corresponding to the frame.
  13. 根据权利要求12所述的装置,其中,所述嘴部关键点特征包括每一所述嘴部关键点的帧间差异信息和帧内差异信息;The device according to claim 12, wherein the mouth key point features include inter-frame difference information and intra-frame difference information of each mouth key point;
    所述第二确定子部分,包括:The second determining sub-part includes:
    第一确定单元,被配置为针对每一所述嘴部关键点,根据所述嘴部关键点在所述图像帧中的位置信息,以及所述嘴部关键点在所述图像帧的相帧中的位置信息,确定所述嘴部关键点在所述图像帧和相邻帧之间的第一高度差和/或第一宽度差,作为所述嘴部关键点的帧间差异信息;The first determination unit is configured to, for each mouth key point, determine the position information of the mouth key point in the image frame and the phase of the mouth key point in the image frame. The position information in , determine the first height difference and/or the first width difference of the mouth key point between the image frame and the adjacent frame as the inter-frame difference information of the mouth key point;
    第二确定单元,被配置为针对每一所述嘴部关键点,根据所述图像帧中的所述嘴部关键点与同一嘴部对象的其他嘴部关键点之间的第二高度差和/或第二宽度差,确定所述嘴部关键点的帧内差异信息。The second determination unit is configured to determine, for each mouth key point, a sum of second height differences between the mouth key point in the image frame and other mouth key points of the same mouth object. /or the second width difference determines the intra-frame difference information of the mouth key point.
  14. 根据权利要求11至13任一项所述的装置,其中,所述第一确定部分包括:The device according to any one of claims 11 to 13, wherein the first determining part includes:
    第一提取子部分,被配置为分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征;The first extraction sub-part is configured to perform spatial feature extraction on the key point features of the mouth in each image frame, and obtain the spatial features of the mouth object in each image frame;
    第二提取子部分,被配置为对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征;The second extraction sub-part is configured to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object;
    第三提取子部分,被配置为基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。The third extraction sub-part is configured to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object to obtain the syllable classification features of the mouth object.
  15. 根据权利要求14所述的装置,其中,所述第一提取子部分,包括:The device according to claim 14, wherein the first extraction sub-part includes:
    第一提取单元,被配置为对所述嘴部对象的多个所述嘴部关键点的帧间差异信息和帧内差异信息进行融合,得到所述嘴部对象在每一图像帧的帧间差异特征和帧内差异特征;The first extraction unit is configured to fuse inter-frame difference information and intra-frame difference information of multiple mouth key points of the mouth object to obtain the inter-frame difference information of the mouth object in each image frame. Difference features and intra-frame difference features;
    第二提取单元,被配置为对所述嘴部对象在多个所述图像帧的帧间差异特征和帧内差异特征进行融合,得到所述嘴部对象在每一图像帧的空间特征。The second extraction unit is configured to fuse inter-frame difference features and intra-frame difference features of the mouth object in multiple image frames to obtain spatial features of the mouth object in each image frame.
  16. 根据权利要求11至15任一项所述的装置,其中,所述第一确定部分,包括:The device according to any one of claims 11 to 15, wherein the first determining part includes:
    第三确定子部分,被配置为利用经过训练的音节特征提取网络对所述图像帧序列中多个图像帧的嘴部关键点特征进行处理,得到音节分类特征; The third determination sub-part is configured to use the trained syllable feature extraction network to process the mouth key point features of multiple image frames in the image frame sequence to obtain syllable classification features;
    所述第一匹配部分,包括:The first matching part includes:
    第一匹配子部分,被配置为利用经过训练的分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。The first matching sub-part is configured to use the trained classification network to determine keywords matching the syllable classification features in the preset keyword library.
  17. 根据权利要求11至16任一项所述的装置,其中,所述音节特征提取网络包括空间特征提取子网络、时间特征提取子网络和分类特征提取子网络;The device according to any one of claims 11 to 16, wherein the syllable feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a classification feature extraction sub-network;
    所述第三确定子部分,包括:The third determining sub-part includes:
    第三提取单元,被配置为利用所述空间特征提取子网络,分别对每一所述图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一图像帧的空间特征;The third extraction unit is configured to use the spatial feature extraction sub-network to perform spatial feature extraction on the key point features of the mouth in each image frame, and obtain the spatial features of the mouth object in each image frame. ;
    第四提取单元,被配置为利用所述时间特征提取子网络,对所述嘴部对象在多个所述图像帧的空间特征进行时间特征提取,得到所述嘴部对象的时空特征;The fourth extraction unit is configured to use the temporal feature extraction sub-network to perform temporal feature extraction on the spatial features of the mouth object in multiple image frames to obtain the spatio-temporal features of the mouth object;
    第五提取单元,被配置为利用所述分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。The fifth extraction unit is configured to use the classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object, and obtain the syllable classification features of the mouth object.
  18. 根据权利要求11所述的装置,其中所述第一获取部分,包括:The device according to claim 11, wherein the first acquisition part includes:
    插帧子部分,被配置为:The frame insertion sub-part is configured as:
    对获取到的包含嘴部对象的原始图像序列进行图像插帧,得到所述图像帧序列;或者,Perform image interpolation on the acquired original image sequence containing the mouth object to obtain the image frame sequence; or,
    基于获取到的包含嘴部对象的原始图像序列中的嘴部关键点,对所述原始图像序列进行插帧,得到所述图像帧序列。Based on the obtained mouth key points in the original image sequence containing the mouth object, frames are interpolated on the original image sequence to obtain the image frame sequence.
  19. 一种生成唇语识别模型的装置,包括:A device for generating a lip language recognition model, including:
    第二获取部分,被配置为获取包含嘴部对象的样本图像帧序列;其中,所述样本图像帧序列标注有关键词标签;The second acquisition part is configured to acquire a sequence of sample image frames containing the mouth object; wherein the sequence of sample image frames is annotated with a keyword tag;
    第二识别部分,被配置为对所述样本图像帧序列中的每一样本图像帧进行嘴部关键点特征提取,得到所述每一样本图像帧的嘴部关键点特征;The second recognition part is configured to extract mouth key point features for each sample image frame in the sample image frame sequence, and obtain the mouth key point features of each sample image frame;
    第二匹配部分,被配置为利用待训练的模型,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征,并在预设关键词库中确定与所述音节分类特征匹配的关键词;其中,所述音节分类特征表征所述样本图像帧序列中嘴部对象的口型对应的音节类别;The second matching part is configured to use the model to be trained to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence, and determine them in the preset keyword library Keywords matching the syllable classification feature; wherein the syllable classification feature represents the syllable category corresponding to the mouth shape of the mouth object in the sample image frame sequence;
    更新部分,被配置为基于确定出的所述关键词和所述关键词标签,对所述模型的网络参数进行至少一次更新,得到经过训练的唇语识别模型。The update part is configured to update the network parameters of the model at least once based on the determined keywords and keyword tags to obtain a trained lip recognition model.
  20. 根据权利要求19所述的装置,其中,所述模型中包括音节特征提取网络和分类网络;The device according to claim 19, wherein the model includes a syllable feature extraction network and a classification network;
    所述第二匹配部分,包括:The second matching part includes:
    第四确定子部分,被配置为利用所述特征提取网络,根据所述样本图像帧序列中多个样本图像帧的所述嘴部关键点特征,生成音节分类特征;The fourth determining sub-part is configured to utilize the feature extraction network to generate syllable classification features based on the mouth key point features of multiple sample image frames in the sample image frame sequence;
    第五确定子部分,被配置为利用所述分类网络,在预设关键词库中确定与所述音节分类特征匹配的关键词。The fifth determination sub-part is configured to utilize the classification network to determine keywords matching the syllable classification features in the preset keyword library.
  21. 根据权利要求20所述的装置,其中,所述特征提取网络包括空间特征提取子网络、时间特征提取子网络和音节分类特征提取子网络;The device according to claim 20, wherein the feature extraction network includes a spatial feature extraction sub-network, a temporal feature extraction sub-network and a syllable classification feature extraction sub-network;
    所述第四确定子部分,包括:The fourth determining sub-part includes:
    第六提取单元,被配置为利用所述空间特征提取子网络,分别对每一所述样本图像帧的嘴部关键点特征进行空间特征提取,得到所述嘴部对象在每一样本图像帧的空间特征;The sixth extraction unit is configured to use the spatial feature extraction sub-network to perform spatial feature extraction on the mouth key point features of each sample image frame, and obtain the mouth object in each sample image frame. spatial characteristics;
    第七提取单元,被配置为利用所述时间特征提取子网络,对所述嘴部对象在多个所述样本图像帧的空间特征进行样本时间特征提取,得到所述嘴部对象的时空特征;A seventh extraction unit is configured to utilize the temporal feature extraction sub-network to perform sample temporal feature extraction on the spatial features of the mouth object in multiple sample image frames, to obtain the spatio-temporal features of the mouth object;
    第八提取单元,被配置为利用所述音节分类特征提取子网络,基于所述嘴部对象的时空特征进行音节分类特征提取,得到所述嘴部对象的音节分类特征。The eighth extraction unit is configured to use the syllable classification feature extraction sub-network to perform syllable classification feature extraction based on the spatiotemporal features of the mouth object, and obtain the syllable classification features of the mouth object.
  22. 一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现权利要求1至10任一项所述方法中的步骤。A computer device, including a memory and a processor, the memory stores a computer program that can be run on the processor, wherein the method of any one of claims 1 to 10 is implemented when the processor executes the program. A step of.
  23. 一种车辆,包括:A vehicle including:
    车载相机,被配置为拍摄包含嘴部对象的图像帧序列;a vehicle-mounted camera configured to capture a sequence of image frames containing a mouth object;
    车机,与所述车载相机连接,被配置为从所述车载相机获取包含嘴部对象的图像帧序列;对所述图像帧序列中的每一图像帧进行嘴部关键点特征提取,得到所述每一图像帧的嘴部关键点特征;根据所述图像帧序列中多个图像帧的所述嘴部关键点特征,生成音节分类特征;其中,所述音节分类特征表征所述图像帧序列中嘴部对象的口型对应的音节类别;在预设关键词库中确定与所述音节分类特征匹配的关键词。A vehicle machine, connected to the vehicle-mounted camera, is configured to obtain an image frame sequence containing a mouth object from the vehicle-mounted camera; perform mouth key point feature extraction on each image frame in the image frame sequence to obtain the Describe the mouth key point features of each image frame; generate syllable classification features according to the mouth key point features of multiple image frames in the image frame sequence; wherein the syllable classification features represent the image frame sequence The syllable category corresponding to the mouth shape of the middle mouth object; determine the keyword matching the syllable classification feature in the preset keyword library.
  24. 一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现权利要求1至10任一项所述方法中的步骤。 A computer-readable storage medium having a computer program stored thereon, wherein the steps of the method of any one of claims 1 to 10 are implemented when the computer program is executed by a processor.
  25. 一种计算机程序产品,所述计算机程序产品包括存储有计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,使得所述计算机执行权利要求1至10中任一项所述的方法。 A computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, it causes the computer to execute any of claims 1 to 10. The method described in one item.
PCT/CN2023/091298 2022-04-29 2023-04-27 Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product WO2023208134A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210476318.1A CN114821794A (en) 2022-04-29 2022-04-29 Image processing method, model generation method, image processing apparatus, vehicle, and storage medium
CN202210476318.1 2022-04-29

Publications (1)

Publication Number Publication Date
WO2023208134A1 true WO2023208134A1 (en) 2023-11-02

Family

ID=82510607

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091298 WO2023208134A1 (en) 2022-04-29 2023-04-27 Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product

Country Status (2)

Country Link
CN (1) CN114821794A (en)
WO (1) WO2023208134A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821794A (en) * 2022-04-29 2022-07-29 上海商汤临港智能科技有限公司 Image processing method, model generation method, image processing apparatus, vehicle, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415701A (en) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 The recognition methods of lip reading and its device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112784696A (en) * 2020-12-31 2021-05-11 平安科技(深圳)有限公司 Lip language identification method, device, equipment and storage medium based on image identification
CN114821794A (en) * 2022-04-29 2022-07-29 上海商汤临港智能科技有限公司 Image processing method, model generation method, image processing apparatus, vehicle, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415701A (en) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 The recognition methods of lip reading and its device
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112784696A (en) * 2020-12-31 2021-05-11 平安科技(深圳)有限公司 Lip language identification method, device, equipment and storage medium based on image identification
CN114821794A (en) * 2022-04-29 2022-07-29 上海商汤临港智能科技有限公司 Image processing method, model generation method, image processing apparatus, vehicle, and storage medium

Also Published As

Publication number Publication date
CN114821794A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
US11093734B2 (en) Method and apparatus with emotion recognition
WO2020182121A1 (en) Expression recognition method and related device
KR101617649B1 (en) Recommendation system and method for video interesting section
CN110765294B (en) Image searching method and device, terminal equipment and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
WO2023284182A1 (en) Training method for recognizing moving target, method and device for recognizing moving target
CN112804558B (en) Video splitting method, device and equipment
WO2023208134A1 (en) Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product
WO2023207541A1 (en) Speech processing method and related device
WO2023197749A1 (en) Background music insertion time point determining method and apparatus, device, and storage medium
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN113590876A (en) Video label setting method and device, computer equipment and storage medium
CN113569607A (en) Motion recognition method, motion recognition device, motion recognition equipment and storage medium
CN109902155B (en) Multi-modal dialog state processing method, device, medium and computing equipment
WO2024001539A1 (en) Speaking state recognition method and apparatus, model training method and apparatus, vehicle, medium, computer program and computer program product
CN114780757A (en) Short media label extraction method and device, computer equipment and storage medium
CN114463552A (en) Transfer learning and pedestrian re-identification method and related equipment
CN114140718A (en) Target tracking method, device, equipment and storage medium
CN111782762A (en) Method and device for determining similar questions in question answering application and electronic equipment
CN117540007B (en) Multi-mode emotion analysis method, system and equipment based on similar mode completion
CN111797303A (en) Information processing method, information processing apparatus, storage medium, and electronic device
CN113196279A (en) Face attribute identification method and electronic equipment
CN114912502B (en) Double-mode deep semi-supervised emotion classification method based on expressions and voices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23795566

Country of ref document: EP

Kind code of ref document: A1