WO2024001539A1 - 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品 - Google Patents

说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品 Download PDF

Info

Publication number
WO2024001539A1
WO2024001539A1 PCT/CN2023/093495 CN2023093495W WO2024001539A1 WO 2024001539 A1 WO2024001539 A1 WO 2024001539A1 CN 2023093495 W CN2023093495 W CN 2023093495W WO 2024001539 A1 WO2024001539 A1 WO 2024001539A1
Authority
WO
WIPO (PCT)
Prior art keywords
image frame
facial image
target object
sequence
mouth
Prior art date
Application number
PCT/CN2023/093495
Other languages
English (en)
French (fr)
Inventor
范栋轶
李潇婕
王飞
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2024001539A1 publication Critical patent/WO2024001539A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Definitions

  • the present disclosure relates to but is not limited to the field of information technology, and in particular, to a speaking state recognition method and a model training method, a device, a vehicle, a medium, a computer program and a computer program product.
  • Lip movement detection technology can use computer vision technology to identify faces from video images, extract the changing characteristics of the mouth area of the face, and thereby identify the movement status of the mouth area.
  • embodiments of the present disclosure provide a speaking state recognition method and a model training method, a device, a vehicle, a medium, a computer program and a computer program product.
  • An embodiment of the present disclosure provides a speaking state recognition method.
  • the method is executed by an electronic device.
  • the method includes: obtaining a sequence of facial image frames of a target object; obtaining key points of the mouth of each image frame in the sequence of facial image frames.
  • Information based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence, and the displacement characteristics represent multiple positions of the mouth key points in the facial image frame sequence. Position changes between image frames; determining the recognition result of the speaking state of the target object based on the displacement characteristics.
  • Embodiments of the present disclosure provide a model training method, which is executed by an electronic device.
  • the method includes:
  • the displacement characteristics of the mouth key points corresponding to the sample facial image frame sequence are determined, and the displacement characteristics represent multiple positions of the mouth key points in the sample facial image frame sequence. Position changes between sample image frames;
  • network parameters of the model are updated at least once to obtain the trained model.
  • An embodiment of the present disclosure provides a speaking state recognition device, which includes:
  • a first acquisition part configured to acquire a sequence of facial image frames of the target object
  • the second acquisition part is configured to acquire the mouth key point information of each image frame in the sequence of facial image frames
  • the first determination part is configured to determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence based on the mouth key point information, and the displacement characteristics represent the position of the mouth key points in the facial image. Position changes between multiple image frames in a frame sequence;
  • the second determination part is configured to determine the recognition result of the speaking state of the target object according to the displacement characteristic.
  • An embodiment of the present disclosure provides a model training device, including:
  • the third acquisition part is configured to acquire a sequence of sample facial image frames of the target object, wherein the sequence of sample facial image frames is annotated with a sample label characterizing the speaking state of the target object;
  • the fourth acquisition part is configured to acquire the mouth key point information of each sample image frame in the sample facial image frame sequence
  • the third determination module is configured to determine, based on the mouth key point information, the displacement characteristics of the mouth key points corresponding to the sample facial image frame sequence, and the displacement characteristics represent the position of the mouth key points in the sample. Position changes between multiple sample image frames in a sequence of facial image frames;
  • the fourth determination part is configured to generate a network using the recognition results in the model to be trained, and determine the target according to the displacement characteristics.
  • the update part is configured to update the network parameters of the model at least once based on the recognition result and the sample label to obtain the trained model.
  • An embodiment of the present disclosure provides a computer device, including a memory and a processor.
  • the memory stores a computer program that can be run on the processor.
  • the processor executes the program, some or all of the steps in the above method are implemented.
  • An embodiment of the present disclosure provides a vehicle, including:
  • a vehicle-mounted camera that captures a sequence of facial image frames containing the target subject
  • a vehicle machine connected to the vehicle-mounted camera, is used to obtain the facial image frame sequence of the target object from the vehicle-mounted camera; obtain the mouth key point information of each image frame in the facial image frame sequence; based on the mouth
  • the mouth key point information is used to determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the positions of the mouth key points between multiple image frames in the facial image frame sequence. Change; determine the recognition result of the speaking state of the target object according to the displacement characteristics.
  • Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, some or all of the steps in the above method are implemented.
  • Embodiments of the present disclosure provide a computer program, which includes computer readable code.
  • the processor in the computer device executes a part for implementing the above method or All steps.
  • Embodiments of the present disclosure provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the computer program is read and executed by a computer, some of the above methods are implemented or All steps.
  • a facial image frame sequence of the target object is obtained, and the mouth key point information of each image frame in the facial image frame sequence is obtained; in this way, the mouth key point information of the target object in each image frame in the facial image frame sequence can be obtained.
  • the displacement characteristics of the mouth key points corresponding to the facial image frame sequence are determined. The displacement characteristics represent the position of the mouth key points between multiple image frames in the facial image frame sequence. Position changes; in this way, the displacement characteristics of the mouth key points corresponding to the facial image frame sequence can represent the position change process of the target object's mouth key points in the facial image frame sequence; finally, the target object's speaking state is determined based on the displacement characteristics.
  • the accuracy of the recognition results of the determined speaking state of the target object can be improved.
  • the displacement characteristics of the mouth key points corresponding to the facial image frame sequence it can represent the position change process of the target object's mouth key points in the facial image frame sequence, and determine the speaking state of the target object based on the displacement characteristics.
  • the recognition result can accurately identify the speaking state of the target object, thereby improving the accuracy of speaking state recognition.
  • the above solution uses the displacement characteristics of the key points of the mouth to reduce the amount of calculation required for speaking state recognition, thereby reducing the time required to perform speaking state recognition. Hardware requirements for computer equipment of the method.
  • good recognition results can be achieved for facial image frames with different face shapes, textures and other appearance information, thus improving the generalization ability of speaking state recognition.
  • Figure 1 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of facial key points provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic flowchart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure
  • Figure 5 is a schematic flowchart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure
  • Figure 6 is a schematic flowchart of the implementation of a speaking state identification method provided by an embodiment of the present disclosure
  • Figure 7 is a schematic flowchart of the implementation of a model training method provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic structural diagram of a speaking state recognition model provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a speaking state recognition device provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a model training device provided by an embodiment of the present disclosure.
  • Figure 11 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present disclosure.
  • first/second/third involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understood that “first/second/third” can be used interchangeably if permitted. The specific order or sequence may be changed so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
  • Cabin intelligence includes multi-mode interaction, personalized services, safety perception and other aspects of intelligence, and is an important direction for the current development of the automotive industry.
  • multi-mode interaction in the cabin is intended to provide passengers with a comfortable interactive experience.
  • Multi-mode interaction methods include but are not limited to voice recognition, gesture recognition, etc.
  • the accuracy of speech recognition is not high when there are sound interferences such as wind outside the window and chatting in the car. Therefore, the introduction of lip movement detection using computer vision features will help identify more accurate speaking state intervals, thereby improving speech recognition accuracy.
  • the lip movement detection scheme of the related technology has limitations: on the one hand, the scheme uses the image sequence of the mouth area as the model input, and finds the corresponding position of the face in the image through face detection. Cut out the mouth area in the image to obtain an image sequence of the mouth area image, input the image sequence into the convolutional neural network for feature extraction, and input the features into the time series prediction network for classification. Since the image sequence of the mouth area image is insensitive to mouth movement information, the accuracy of speech state recognition is not high, and three-dimensional convolution consumes a lot of computing resources and has high hardware requirements, making it difficult to apply on a large scale.
  • the solution is used to determine whether the person is in a speaking state based on the judgment results.
  • Embodiments of the present disclosure provide a speaking state recognition method, which can be executed by a processor of a computer device.
  • computer equipment can refer to cars, servers, laptops, tablets, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable game devices ) and other equipment with data processing capabilities.
  • Figure 1 is a schematic flowchart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure. As shown in Figure 1, the method includes the following steps S101 to S104:
  • Step S101 Obtain a facial image frame sequence of the target object.
  • the computer device acquires multiple image frames.
  • the multiple image frames are captured by a camera or other acquisition component to capture the target object.
  • the multiple image frames are sorted according to the acquisition time corresponding to each image frame, or the collected image frames are collected in real time according to the acquisition sequence of the image frames.
  • the image frames are added to the facial image sequence of the target subject. Obtain the facial image frame sequence of the target object.
  • the length of the facial image frame sequence may not be fixed. In implementation, the length of the facial image frame sequence may be 40 frames, 50 frames, or 100 frames.
  • the computer device can acquire multiple image frames by calling the camera, or it can be acquired from other computer devices; for example, the computer device is a vehicle, and the images can be acquired through a vehicle-mounted camera, or it can be obtained by using a mobile
  • the images collected by the mobile terminal can be obtained through wireless transmission of the terminal. It should be noted that at least one image frame of the facial image frame sequence may originate from a video stream, and one video stream may include multiple video frames, and each video frame corresponds to an image frame.
  • At least one facial image frame sequence corresponding to each target facial image frame may be obtained from the video according to preset rules.
  • the preset rule can be a sliding window method, which takes out a sequence of facial image frames multiple times from the sliding window, that is, using a preset sliding step size, each time a consecutive preset number is selected from multiple consecutive facial image frames.
  • Each image frame is a facial image frame sequence. After completing the processing of a facial image frame sequence (that is, completing the speech state recognition based on the facial image frame sequence), slide the sliding window along the preset direction and according to the sliding step, and remove the sliding window.
  • the facial image frames in the window form a new facial image frame sequence; image frames can be selected at fixed intervals or non-fixed intervals as the facial image frame sequence.
  • the image frame of the target facial image frame may include part or all of the target object's face, and at least include the mouth; the target object is usually a human being, but may also be other animals with expressive abilities, such as orangutans.
  • the target facial image frame can be understood as the image frame of the speaking state to be recognized.
  • Step S102 Obtain the mouth key point information of each image frame in the facial image frame sequence.
  • the facial image frame sequence includes at least one image frame
  • key point detection can be performed on at least one image frame in the facial image frame sequence to obtain at least one image frame included in the facial image frame sequence.
  • the mouth key point information contains the position information of each mouth key point.
  • obtaining the mouth key point information of each image frame in the facial image frame sequence includes: performing facial key point detection for each facial image frame in the facial image frame sequence, and obtaining the facial key point information in each facial image frame. mouth key point information.
  • the mouth key point information in each facial image frame can be obtained in any suitable way.
  • a trained keypoint detection model can be used to detect facial keypoints on facial image frames.
  • convolutional neural networks, recurrent neural networks, etc. Image frames are obtained through key point detection.
  • the position information can be represented by position parameters, for example, represented by two-dimensional coordinates in the image coordinate system.
  • the two-dimensional coordinates include width (abscissa) and height (ordinate); the displacement feature can represent the position of the key point on the face.
  • Motion characteristics of image frame sequences The position information of key points is related to the shape of the mouth. The position information of the same key point in different image frames changes as the shape of the mouth changes.
  • Step S103 based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the position changes of the mouth key points between multiple image frames in the facial image frame sequence.
  • a displacement feature that can characterize the position change of the mouth key point between multiple image frames in the facial image frame sequence is determined.
  • a mouth keypoint may be calculated in that image frame and adjacent to the image frame in the facial image frame sequence.
  • the difference information of the position information between the first set number of image frames, and the displacement feature is obtained according to the mouth key point information of the image frame.
  • the difference information can be sorted according to the set order, and the obtained result is as Displacement characteristics.
  • the first set number may be one, or may be two or more, and the first set number of image frames adjacent to the image frame may be before the image frame and/or after the image frame. Continuous image frames.
  • the displacement feature may include at least one of the following: the difference information of the position information between the image frame and the previous image frame; the difference information of the position information between the image frame and the next image frame. Difference information.
  • each image frame includes 4 mouth key points as an example.
  • the position information of the mouth key points in the image frame are (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ), the position information of the mouth key point in the previous image frame is (x' 1 , y ' 1 ), (x' 2 , y' 2 ), (x' 3 , y' 3 ), (x' 4 , y' 4 ), the obtained displacement characteristics are [(x' 1 -x 1 , y' 1 -y 1 ),(x' 2 -x 2 ,y' 2 -y 2 ),(x' 3 -x 3 ,y' 3 -y 3 ),(x' 4 -x 4 ,y' 4 - y 4 )].
  • Step S104 Determine the recognition result of the speaking state of the target object based on the displacement characteristics.
  • the displacement characteristics corresponding to the facial image frame sequence are used to identify the speaking state of the target object, and a recognition result is obtained.
  • the recognition result indicates whether the target object is in a speaking state when the image frame of the facial image frame sequence is set.
  • the speaking state of the target object can be obtained using any suitable recognition method.
  • it can be obtained by classifying the displacement features using a neural network for classification; for another example, it can be obtained by matching the displacement features by setting rules in advance.
  • the recognition result of the target object's speaking state can indicate whether the target object is in a speaking state when the image frame is set.
  • the set image frame may be an image frame with a set number in the image frame sequence, including but not limited to the first frame, the second frame, the last frame or the penultimate frame.
  • the recognition results include any suitable information that can describe whether the target object is in a speaking state, for example, it can be information that directly describes whether the target object is in a speaking state, or it can include information that indirectly describes whether the target object is in a speaking state, such as confidence.
  • the target object is in a speaking state, which means that the corresponding target image frame is an image frame taken of the target object who is talking; the target object is in a non-speaking state, which means that the corresponding target image frame is an image frame taken of the target object who is not speaking.
  • the resulting image frame is any suitable information that can describe whether the target object is in a speaking state, for example, it can be information that directly describes whether the target object is in a speaking state, or it can include information that indirectly describes whether the target object is in a speaking state, such as confidence.
  • the target object is in a speaking state, which means that the corresponding target image frame is an image frame taken of the target object who is talking; the target object is in a non-speaking state, which means that the
  • the mouth key points corresponding to the facial image frame sequence due to the displacement characteristics of the mouth key points corresponding to the facial image frame sequence, it can represent the position change process of the target object's mouth key points in the facial image frame sequence, and determine the speaking state of the target object based on the displacement characteristics.
  • the recognition result can accurately identify the speaking state of the target object, thereby improving the accuracy of speaking state recognition.
  • using the displacement characteristics of the key points of the mouth to identify the speaking state can reduce the amount of calculation required for speaking state identification, thereby reducing the computer equipment required to perform the speaking state identification method. hardware requirements.
  • good recognition results can be achieved for facial image frames with different face shapes, textures and other appearance information, thus improving the generalization ability of speaking state recognition.
  • the image frame sequence in which the target object is in the speaking state can be extracted from the video stream derived from the facial image frame sequence according to the recognition result. In this way, the accuracy of selecting the image frame sequence in which the target object is speaking from the video stream can be improved. Moreover, when lip recognition is performed using the image frame sequence selected from the video stream as a result of the recognition, the accuracy of lip recognition can also be improved and the amount of calculation required for image processing of lip recognition can be reduced.
  • step S103 may be implemented by the steps shown in FIG. 2 .
  • Figure 2 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 2:
  • Step S1031 for each facial image frame, perform the following steps: determine the frame of each mouth key point based on the mouth key point information of each mouth key point in the facial image frame and the adjacent frame of the facial image frame. inter-frame displacement information; determine the intra-frame difference information of multiple mouth key points in the facial image frame based on the mouth key point information corresponding to the multiple mouth key points in the facial image frame; based on the respective mouth key points of the multiple mouth key points.
  • the inter-frame displacement information and intra-frame difference information are used to determine the displacement characteristics of the mouth key points corresponding to the facial image frames.
  • the difference information of the position information between the second set number of facial image frames adjacent to the facial image frame determines the inter-frame displacement information of the mouth key point.
  • the second set number may be one, or may be two or more, and the second set number of image frames adjacent to the facial image frame may be before the facial image frame and/or after the facial image frame. Consecutive facial image frames after frames. Taking the second set number as two and the sequence number of the facial image frame in the facial image frame sequence as 20 as an example, the second set number of image frames adjacent to the facial image frame may be facial image frames. Image frames with serial numbers 18, 19, 21, and 22 in the sequence.
  • the difference information of the position information between facial image frames includes but is not limited to: at least one of a first height difference, a first width difference, etc.; the first width difference is the key point of the mouth in the image.
  • the width difference between frames, the first height difference is the height difference of the key point of the mouth between image frames.
  • the position information of the subsequent image frame in the facial image frame sequence can be used as the minuend, and the position information of the previous image frame can be used as the subtrahend; or the position information of the previous image frame in the facial image frame sequence can be used as the minuend.
  • the position information is used as the minuend, and the position information of the subsequent image frame is used as the subtrahend.
  • the preset key point pair includes two key points.
  • the position information of the key point in the image is usually considered. That is to say, between the two key points belonging to the same preset key point pair Satisfy the set position relationship; for example, consider two key points located on the upper and lower lips as a key point pair.
  • two key points whose width difference information in the image is smaller than the preset value can be determined as a preset key point pair.
  • one mouth key point can form a preset key point pair with two or more key points respectively. That is to say, each mouth key point can belong to at least one key point pair.
  • the second height difference of each key point pair to which the mouth key point belongs is determined respectively, and the position of the mouth key point on the face can be obtained by weighted calculation or taking the maximum value of at least two second height differences.
  • Figure 3 is a schematic diagram of facial key points provided by an embodiment of the present disclosure. Taking the schematic diagram of 106 facial key points shown in Figure 3 as an example, it includes a total of 106 key points numbered 0-105, which can describe the face of a human face.
  • Key points No. 84 to 103 are the mouth key points used to describe the mouth.
  • key point No. 86 can form a preset key point pair with key point No. 103 and key point No. 94 respectively. That is to say, key point No. 86 can belong to two preset key point pairs, and two are calculated respectively.
  • the second height difference is used to determine the intra-frame difference information of key point No. 86 in the facial image frame through weighted summation. In this way, the displacement feature calculation deviation caused by key point detection errors can be improved, and speaking state recognition based on displacement features can improve the accuracy of speaking state recognition.
  • each facial image frame based on the intra-frame difference information and inter-frame displacement information of each mouth key point in the facial image frame, the displacement characteristics of the facial image frame are determined through sequential splicing or weighted calculation. In this way, based on the inter-frame displacement information and intra-frame difference information of all key points in the facial image frame, the displacement characteristics of the facial image frame can be determined.
  • each key point of the mouth corresponds to a 5-dimensional feature in the displacement feature.
  • the first 4 dimensions of the 5-dimensional feature are inter-frame displacement information, which are the width difference between this image frame and the previous image frame, the difference between this image frame and the previous image frame.
  • the fifth dimension is the intra-frame difference information, which is the preset key point pair in the image.
  • Step S1032 Determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence based on the displacement characteristics of the mouth key points corresponding to multiple facial image frames in the facial image frame sequence.
  • the displacement features of the mouth key points corresponding to multiple facial image frames can be sorted according to a set order to obtain the displacement features of the mouth key points corresponding to the facial image frame sequence.
  • intra-frame difference information can represent the difference between mouth key points that satisfy the set relationship, improving the accuracy of mouth shape recognition in each facial image frame
  • inter-frame displacement information can represent The image frame sequence corresponds to the inter-frame change process of the key points of the mouth during the speaking process; in this way, the intra-frame difference information and inter-frame displacement information in each facial image frame can be used to better extract the shape of the mouth during the speaking process. Change characteristics, thereby improving the accuracy of speaking state recognition.
  • step S1031 may include the following steps S10311 to S10314:
  • Step S10311 Determine the eye-mouth distance of the target object in each image frame in the facial image frame sequence.
  • Eye-mouth distance represents the distance between the eyes and mouth of the target object in the image frame.
  • the average coordinate of the key points of the two eyes in the image frame is used as the first coordinate
  • the average coordinate of the key point of the mouth is used as the second coordinate
  • the first coordinate is calculated.
  • the distance from the second coordinate is the eye-mouth distance of the target object in the image frame.
  • the eye-mouth distance may be the lateral distance between the first coordinate and the second coordinate, the longitudinal distance between the first coordinate and the second coordinate, or the two-dimensional distance between the first coordinate and the second coordinate. distance.
  • Step S10312 Determine the reference distance based on the eye-mouth distance of the target object in each image frame in the facial image frame sequence.
  • one of the maximum, minimum, mean, or median values among multiple eye-mouth distances corresponding to the facial image frame sequence may be used as the reference distance.
  • the maximum eye-mouth distance can be determined from the eye-mouth distances corresponding to the multiple facial image frame sequences, and this maximum eye-mouth distance can be used as the reference distance. .
  • Step S10313 Use the reference distance as the normalized denominator to calculate the inter-frame displacement information and intra-frame differences of multiple mouth key points respectively.
  • the information is normalized to obtain processed inter-frame displacement information and processed intra-frame difference information.
  • the processed inter-frame displacement information of the mouth key point is obtained; using the reference distance as the normalized denominator, each The intra-frame difference information of the mouth key point is used as a normalized numerator to obtain the processed intra-frame difference information of the mouth key point.
  • Step S10314 Determine the displacement characteristics of the mouth key points corresponding to the facial image frame based on the processed inter-frame displacement information and the processed intra-frame difference information of the multiple mouth key points.
  • the facial image is determined through sequential splicing or weighted calculation. Frame displacement characteristics.
  • the distance between the eyes and the mouth of the target object in each image frame in the facial image frame sequence is used to determine the normalized denominator of the inter-frame displacement information and the intra-frame difference information. According to the displacement characteristics obtained by the normalization process, In this way, the displacement features can be made more standardized, thereby improving the accuracy of the recognition result of the determined speaking state of the target object. Moreover, when the model is used to recognize the speaking state of the target object, the convergence speed of the model during the training process can be improved.
  • Figure 4 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 4:
  • Step S201 sequentially extract a sequence of image frames of a preset length from the video stream containing facial information of the target object in a sliding window manner as a sequence of facial image frames of the target object.
  • Obtain a video stream containing facial information of the target object process the video stream with a sliding window of a preset window size and a preset sliding step, and sequentially extract multiple presets with the same window size from the video stream.
  • An image frame sequence of length, and each of the multiple image frame sequences taken out is used as a facial image frame sequence of the target object.
  • the sliding step size of the sliding window is not less than 1, and the sliding step size of the sliding window is not greater than the preset length. Therefore, the facial image frame sequence taken out each time the sliding window slides is at least the same as the facial image frame sequence taken out last time. Have one non-overlapping frame and at least one overlapping frame.
  • the window size can be set to 22 image frames, and the sliding step size can be set to any integer from 1 to 22, so that multiple image frame sequences with a length of 22 can be obtained .
  • Step S202 Obtain the mouth key point information of each image frame in the facial image frame sequence.
  • Step S203 based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the position changes of the mouth key points between multiple image frames in the facial image frame sequence.
  • the facial image frame sequence includes multiple facial image frames; based on the mouth key point information, determining the displacement characteristics of the mouth key points corresponding to the facial image frame sequence includes: for each facial image frame, perform the following Steps: Determine the inter-frame displacement information of each mouth key point based on the mouth key point information in the facial image frame and the adjacent frame of the facial image frame; The mouth key point information corresponding to the mouth key point determines the intra-frame difference information of multiple mouth key points in the facial image frame; based on the inter-frame displacement information and intra-frame difference information of the multiple mouth key points, determine The displacement characteristics of the mouth key points corresponding to the facial image frame; according to the displacement characteristics of the mouth key points corresponding to multiple facial image frames in the facial image frame sequence, the displacement characteristics of the mouth key points corresponding to the facial image frame sequence are determined .
  • determining the displacement characteristics of the mouth key points corresponding to the facial image frame based on the inter-frame displacement information and the intra-frame difference information of each of the multiple mouth key points includes: determining each image frame in the facial image frame sequence. The distance between the eyes and mouth of the target object; determine the reference distance based on the distance between the eyes and mouth of the target object in each image frame in the facial image frame sequence; use the reference distance as the normalized denominator to calculate the inter-frame differences between multiple mouth key points respectively.
  • the displacement information and the intra-frame difference information are normalized to obtain the processed inter-frame displacement information and the processed intra-frame difference information; based on the processed inter-frame displacement information and the processed respectively of multiple mouth key points Intra-frame difference information determines the displacement characteristics of the mouth key points corresponding to the facial image frame.
  • Step S204 Determine the recognition result of the speaking state of the target object based on the displacement characteristics.
  • the above steps S202 to S204 respectively correspond to the above steps S102 to S104, and the implementation of the above steps S102 to S104 may be referred to during implementation.
  • a sliding window is used to sequentially extract multiple facial image frame sequences of preset length from the video stream, and these facial image frame sequences of preset length are used to determine the target object when setting the image frame in the video stream.
  • the identification result of whether the person is speaking is obtained, and the identification result of multiple image frames in the video stream is obtained.
  • the speaking state can be recognized through the sequence of facial image frames obtained multiple times by the sliding window, which can reflect the position change process of the target object's mouth key points in the sequence of multiple facial image frames taken out by the sliding window, and among the multiple facial image frame sequences There are at least some overlapping frames between them, so that the speaking state of the target object in any set image frame in the consecutive image frames can be accurately identified, and the accuracy of the recognition result of the speaking state of the target object can be improved, thereby improving the selection from the video stream. Accuracy of image frame sequences in which the target object is speaking.
  • Figure 5 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 5:
  • Step S301 Obtain a facial image frame sequence of the target object.
  • Step S302 Obtain the mouth key point information of each image frame in the facial image frame sequence.
  • Step S303 based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the position changes of the mouth key points between multiple image frames in the facial image frame sequence.
  • steps S301 to step S303 respectively correspond to the foregoing steps S101 to step S103.
  • steps S101 to step S103 respectively correspond to the foregoing steps S101 to step S103.
  • Step S304 Use the trained key point feature extraction network to process the displacement features to obtain the spatial features of the facial image frame sequence.
  • the inter-frame displacement information and the intra-frame difference information in the displacement feature can be extracted separately to obtain the inter-frame displacement feature and the intra-frame difference feature of the key points of the mouth, and then the inter-frame displacement feature and the intra-frame difference feature can be obtained.
  • Spatial features are extracted between the internal difference features to obtain the spatial characteristics of the image frame. According to the spatial characteristics of each image frame in the facial image frame sequence, the spatial characteristics of the facial image frame sequence are obtained.
  • each key point corresponds to a 5-dimensional feature in the displacement feature.
  • the first 4 dimensions in the 5-dimensional feature are inter-frame displacement information, which are the width difference between the image frame and the previous image frame, the image frame and the previous image frame.
  • the height difference, the width difference between the image frame and the next image frame, the height difference between the image frame and the image frame, and the fifth dimension is the intra-frame difference information.
  • Features are obtained by extracting features between different key points for each dimension of the 5-dimensional feature.
  • the first 4 dimensions of this feature are the inter-frame displacement features of the key points of the mouth in the image frame, and the fifth dimension is the mouth.
  • the key points are the intra-frame difference features of the image frame. Then perform spatial feature extraction between these five dimensions to obtain the spatial features of the image frame.
  • the trained key point feature extraction network is trained with a preset sample set and can be implemented by any suitable network architecture, including but not limited to at least one of a convolutional neural network, a recurrent neural network, etc.
  • Step S305 Use the trained temporal feature extraction network to process the spatial features to obtain the spatiotemporal features of the facial image frame sequence.
  • At least one temporal feature extraction is performed on the spatial features of multiple image frames in the facial image frame sequence to obtain the spatio-temporal features corresponding to the image frame.
  • the spatio-temporal features of each image frame in the facial image frame sequence we obtain Spatiotemporal features of facial image frame sequences.
  • Spatiotemporal features can be extracted from spatial features using any suitable feature extraction method. For example, taking a temporal feature extraction as an example, a 1 ⁇ 5 convolution kernel is used for feature extraction. Each convolution extracts the spatial features of two image frames before and after the image frame.
  • the extracted spatiotemporal features include five Image frame information.
  • the trained temporal feature extraction network is trained with a preset sample set and can be implemented by any suitable network architecture, including but not limited to at least one of a convolutional neural network, a recurrent neural network, etc.
  • the spatiotemporal features of each image frame can represent more information of the image frame, and the corresponding larger receptive field will help improve the accuracy of speaking state recognition, but It consumes more computing resources and affects the hardware computing efficiency.
  • the number of time feature extractions can be set to 5 times during implementation.
  • the key point feature extraction network and the temporal feature extraction network are trained based on a training sample set, where the training sample set includes a sequence of continuous video frames that have been labeled with the speaking state of the object in each included video frame.
  • the key point feature extraction network and the temporal feature extraction network are trained with a continuous video frame sequence including the speaking state of the object in each video frame that has been annotated, and the trained key point feature extraction network and the trained key point feature extraction network are obtained.
  • Temporal feature extraction network is obtained.
  • Step S306 Determine the recognition result of the speaking state of the target object based on spatiotemporal features.
  • the spatiotemporal characteristics of the image frames in the facial image frame sequence are used to identify the speaking state of the target object, and the recognition result is obtained.
  • the recognition result indicates whether the target object is in a speaking state when the image frame is set in the facial image frame sequence.
  • the speaking state of the target object can be obtained by any suitable recognition method.
  • a classification network can be used to identify displacement features, such as a global average pooling layer (Global Average Pooling, GAP), or a fully connected layer; for example, it can be obtained by Preset rules are used to match displacement features.
  • GAP Global Average Pooling
  • Preset rules are used to match displacement features.
  • embodiments of the present disclosure support the use of convolutional neural networks for spatiotemporal feature extraction; compared to using time series prediction networks such as recurrent neural networks (for example, recurrent neural networks) to extract spatiotemporal features, the calculation of extracting spatiotemporal features through convolutional neural networks The amount is small, which can reduce the consumption of computing resources and reduce the hardware requirements of computer equipment for speech state recognition.
  • convolutional neural networks can reduce the requirements for chip computing capabilities, so that the speaking state recognition method provided by the embodiments of the present disclosure can be implemented through more lightweight chips, and more hardware supports the speaking state recognition of the embodiments of the present disclosure.
  • This method enables more hardware to support speaking state recognition and improves the versatility of speaking state recognition.
  • computer equipment such as cars and computers can also realize speaking state recognition.
  • Figure 6 is a schematic flow chart of the implementation of a speaking state recognition method provided by an embodiment of the present disclosure. The following description will be made in conjunction with the steps shown in Figure 6:
  • Step S401 Obtain a facial image frame sequence of the target object.
  • Step S402 Obtain the mouth key point information of each image frame in the facial image frame sequence.
  • Step S403 based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the position changes of the mouth key points between multiple image frames in the facial image frame sequence.
  • the above steps S401 to S403 respectively correspond to the above steps S101 to S103, and the implementation of the above steps S101 to S103 may be referred to during implementation.
  • Step S404 Use the trained key point feature extraction network to process the displacement features to obtain the spatial features of the facial image frame sequence.
  • Step S405 Use the trained temporal feature extraction network to process the spatial features to obtain the spatiotemporal features of the facial image frame sequence.
  • steps S404 to step S405 respectively correspond to the foregoing steps S304 to step S305.
  • steps S404 to step S405 respectively correspond to the foregoing steps S304 to step S305.
  • Step S406 Determine the recognition result of the speaking state of the target object corresponding to the facial image frame sequence according to the spatiotemporal characteristics, as the recognition result of the speaking state of the target object in the last image frame in the facial image frame sequence.
  • the spatiotemporal characteristics of the image frames in the facial image frame sequence are used to identify the target object's speaking state, and the recognition result is obtained.
  • the recognition result represents whether the target object is in a speaking state at the corresponding moment of the last image frame in the facial image frame sequence. .
  • Step S407 Determine the start frame and end frame of the target object's speech based on the recognition result of the target object's speaking state in the last image frame in the facial image frame sequence taken out from multiple sliding windows.
  • Whether the target object is speaking is determined by determining whether the target object is speaking in multiple image frames that meet the set position relationship in the video stream, thereby determining the starting frame when the target object starts speaking in the video stream, and where the target object is in the video The end frame at which speech in the stream ends.
  • the set position relationship is related to the step size of the sliding window. For example, if the step size is 1, it can be determined whether the target object is speaking in multiple consecutive image frames.
  • a video stream containing the facial information of the target object is obtained, the video stream is processed with a sliding window of a preset window size and a preset sliding step, and a plurality of video streams are sequentially extracted from the video stream.
  • Image frame sequences of preset lengths with the same window size are used, and each of the multiple image frame sequences taken out is used as a facial image frame sequence of the target object.
  • the sliding step size of the sliding window is not less than 1, and the sliding step size of the sliding window is not greater than the preset length.
  • Each image frame in the facial image frame sequence can be used as an image frame to be determined, and it is determined whether the image frame to be determined is the starting frame or the end frame of speaking.
  • the recognition result of the speaking state includes a first confidence level indicating that the target object is in a first state indicating that he is speaking; the first confidence level corresponding to the image frame to be determined is greater than or equal to the first preset threshold, and the first confidence level to be determined is When the first confidence degree of the judgment image frame corresponding to the previous image frame in the facial image frame sequence is less than the first preset threshold, the image frame to be judged is used as the starting frame for the target object to speak; when the image frame to be judged corresponds to When the first confidence level is greater than or equal to the first preset threshold, and the first confidence level corresponding to the subsequent image frame in the facial image frame sequence of the image frame to be determined is less than the first preset threshold, the image to be determined is frame serves as the end frame of the target object speaking.
  • the recognition result of the speaking state includes a second confidence level indicating that the target object is in a second state indicating that he is not speaking; the second confidence level corresponding to the image frame to be judged is less than the second preset threshold, and the target object is to be judged.
  • the image frame to be determined is used as the starting frame for the target object to speak; in the image frame to be determined If the corresponding second confidence level is less than the first preset threshold, and the second confidence level corresponding to the subsequent image frame in the facial image frame sequence of the image frame to be determined is greater than or equal to the second preset threshold, the image frame to be determined will be The image frame serves as the end frame of the target object speaking.
  • the start frame and the end frame of the target object speaking in the video stream are determined, so that Improved accuracy in selecting image frame sequences from video streams in which the target object is speaking.
  • the accuracy of lip recognition can also be improved and the amount of calculation required for image processing of lip recognition can be reduced.
  • Embodiments of the present disclosure provide a model training method, which can be executed by a processor of a computer device. As shown in Figure 7, the method includes the following steps S501 to S505:
  • Step S501 Obtain a sequence of sample facial image frames of the target object.
  • the sample facial image frame sequence is annotated with a sample label that represents the speaking state of the target object.
  • the computer device obtains a sample facial image frame sequence that has been labeled with a sample label.
  • the sample facial image frame sequence includes a sample image frame.
  • the sample image frame contains part or all of the face of the set target object, and at least includes the mouth.
  • the sample label can describe The speaking state of the target object in the sample image frame.
  • sample facial image frame sequence in which the target object in all sample image frames is in a speaking state can be labeled as sample label 1
  • sample facial image frame sequence in which the target object in all sample image frames is in a non-speaking state can be labeled
  • the facial image frame sequence is labeled with sample label 0.
  • the sample facial image frame sequence can be sequentially taken out from the video stream in a sliding window manner using a preset window size and sliding step size.
  • Step S502 Obtain the mouth key point information of each sample image frame in the sample facial image frame sequence.
  • Step S503 based on the mouth key point information, determine the displacement characteristics of the mouth key points corresponding to the sample facial image frame sequence.
  • the displacement characteristics represent the positions of the mouth key points between multiple sample image frames in the sample facial image frame sequence. Variety.
  • Step S504 Use the recognition result generation network in the model to be trained to determine the recognition of the speaking state of the target object based on the displacement characteristics. Don't get results.
  • the model to be trained can be any suitable deep learning model, and is not limited here.
  • those skilled in the art can use an appropriate network structure to construct the model to be trained according to the actual situation.
  • the model to be trained may also include the above-mentioned key point feature extraction network and temporal feature extraction network.
  • the displacement feature may be input to the key point feature extraction network, and the temporal feature extraction network may be used to further Process the output data of the key point feature extraction network, and then use the recognition result generation network to process the spatiotemporal features output by the temporal feature extraction network to obtain the recognition result of the speaking state.
  • the model to be trained uses an end-to-end approach to train classification scores to obtain recognition results.
  • the advantage of end-to-end is that by reducing manual pre-processing and subsequent processing, the model can be moved from the original input to the final output as much as possible, giving the model more space to automatically adjust according to the data, and increasing the fitting degree of the model.
  • the above steps S501 to S504 respectively correspond to the above steps S101 to S104, and the implementation of the above steps S101 to S104 may be referred to during implementation.
  • Step S505 Based on the recognition results and sample labels, update the network parameters of the model at least once to obtain a trained model.
  • the network parameters of the model based on the recognition results and sample labels, it can be determined whether to update the network parameters of the model.
  • an appropriate parameter learning difficulty update algorithm is used to update the network parameters of the model, and
  • the model with updated parameters is used to re-determine the recognition results to determine whether to continue updating the network parameters of the model based on the re-determined recognition results and sample labels.
  • the finally updated model is determined as the trained model.
  • the loss value can be determined based on the recognition result and the sample label, and when the loss value does not meet the preset conditions, the network parameters of the model are updated. When the loss value meets the preset conditions or the model's When the number of network parameter updates reaches a set threshold, the update of the network parameters of the model is stopped, and the final updated model is determined as the trained model.
  • the preset conditions may include, but are not limited to, at least one of the loss value being less than the set loss threshold, the change in the loss value converging, and the like. During implementation, the preset conditions may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • the method of updating the network parameters of the model may be determined based on the actual situation, and may include but is not limited to at least one of the gradient descent method, Newton's momentum method, etc., which is not limited here.
  • the following describes the application of the speaking state recognition method provided by the embodiment of the present disclosure in actual scenarios. Taking the speaking state recognition of a video stream containing people talking as an example, the speaking state recognition method of the embodiment of the present disclosure is explained.
  • Embodiments of the present disclosure provide a speaking state recognition method, which can be executed by a processor of a computer device.
  • computer equipment may refer to equipment with data processing capabilities such as vehicles and machines.
  • the speaking state recognition method may include at least the following two steps:
  • Step 1 Temporal feature construction.
  • the input video stream can be represented as [N, 720, 1280, 3].
  • N in the first dimension is the length of the video stream
  • 720 in the second dimension is the height of each image frame
  • 1280 in the third dimension is the width of each image frame
  • 3 in the fourth dimension is the number of image channels.
  • N is the number of frames in the video stream
  • N-21 in the first dimension is the number of facial image frame sequences
  • 22 in the second dimension is the length of each facial image frame sequence
  • 106 in the third dimension is the number of key points.
  • the fourth dimension 2 is the two-dimensional coordinates of each key point.
  • the displacement difference of each key point can be expressed as [x pre_diff , y pre_diff , x next_diff , y next_diff ].
  • the first dimension is the displacement difference of the abscissa between the current image frame and the previous image frame
  • the second dimension is the displacement difference of the ordinate between the current image frame and the previous image frame
  • the third dimension is the displacement difference of the current image frame and the previous image frame.
  • the fourth dimension is the displacement difference of the ordinate between the current image frame and the next image frame.
  • N-21 in the first dimension is the number of facial image frame sequences
  • 20 in the second dimension is the length of each input sequence
  • 20 in the third dimension is the number of mouth key points
  • 5 in the fourth dimension is the feature dimension.
  • Step 2 Feature extraction model processing.
  • FIG. 8 is a schematic structural diagram of a speaking state recognition model provided by an embodiment of the present disclosure.
  • the speaking state recognition model structure includes two parts: a key point feature extraction backbone network (backbone) 81 and a temporal feature extraction branch 82.
  • the two parts are in series, that is, the model input 831 is the input of the key point feature extraction backbone network 81, the backbone network output 832 of the key point feature extraction backbone network 81 is the input of the time series feature extraction branch 82, and the output of the time series feature extraction branch 82
  • the model outputs a speaking score of 833.
  • the model input 831 can be [N-21, 20, 20, 5], the same as the output of step 1;
  • the backbone network output 832 can be [N-21, 64, 20, 1], where the first dimension N-21 is the number of facial image frame sequences, 64 in the second dimension is the dimension of spatiotemporal features, 20 in the third dimension is the number of mouth key points, and 1 in the fourth dimension is the feature dimension after intra-frame feature fusion ;
  • the model output speaking score 833 can be [N-21,2], where N-21 in the first dimension is the number of facial image frame sequences, and 2 in the second dimension represents the first state of speaking. The confidence level and the second confidence level representing the second state of not speaking.
  • the key point feature extraction backbone network 81 includes 4 convolution modules.
  • Each convolution module includes convolution with a convolution kernel (kernel) of (1, 1) or (5, 1), and batch normalization (Batch Normalization).
  • kernel kernel
  • Batch Normalization batch Normalization
  • BN linear rectification function
  • ReLU Linear rectification function
  • ResNets residual network
  • co-occurrence features include but are not limited to mouth shape and lip distance.
  • the temporal feature extraction branch 82 includes 5 convolution modules, GAP, fully connected layer (FC), matrix transformation (Reshape) layer, dropout layer, window classification layer (Cls), and softmax.
  • the product module includes convolution, BN, and ReLU with a convolution kernel of (1,5).
  • the entire temporal feature extraction branch is used to learn the features between image frames and the global motion displacement information of key points in the entire facial image frame sequence, thus The final output is whether the facial image frame sequence is a prediction score of speaking, that is, the predicted model outputs a speaking score of 833.
  • the model output speaking score 833 of the facial image frame sequence is used as the score of a specific image frame in the facial image frame sequence, and the comparison result between the model output speaking score 833 and the preset threshold can be used to determine whether the specific image frame is in a speaking state. For example, image frames whose model output speaking score 833 is greater than or equal to a preset threshold are determined to be speaking image frames, and image frames whose model output speaking score 833 is less than the preset threshold are determined to be non-speaking image frames. In practical applications, according to the requirements of detection accuracy, the preset threshold can be set to 0.7. Moreover, multiple facial image frame sequences are obtained from the video stream using a sliding window method with a sliding step size of 1. The corresponding multiple specific image frames are also adjacent. In the predicted video stream, the speech start image frame and the speech end image are frame, the score change trend of adjacent image frames can also be used.
  • the convolution kernel of (1,5) can be used to convolve in the length dimension of the facial image frame sequence, and the spatial features of each image frame in the facial image frame sequence are summed
  • the spatial features of the two image frames before and after are fused, and the above convolution is repeated 5 times to enhance the receptive field, complete inter-frame feature fusion, and obtain the spatiotemporal features of each image frame.
  • this step will occupy a certain amount of computing resources, in order to improve performance, the convolution kernel size can be increased and the number of repetitions will be increased, which will affect the efficiency accordingly.
  • the number of extractions can be set to 5 times, and the convolution kernel size can be set to 5.
  • the training of the speaking state recognition model shown in Figure 8 can be implemented in the following ways:
  • the first sample image frame sequence is a continuous video frame. Use a sliding window with a step size of S and a window size of L to obtain the sample face. Image frame sequence. If all frames in each sample facial image frame sequence are in the speaking state, determine the label of the sample facial image frame sequence to be 1; if all frames in each sample facial image frame sequence are not in the speaking state, determine the sample facial image The frame sequence has a label of 0. Here, samples containing some speech frames are not added to the training yet.
  • the entire model uses an end-to-end approach to train classification scores, and the loss function is the margin Softmax loss function (Margin Softmax Loss).
  • the continuous first sample image frame sequence can be divided into a speaking interval and a non-speaking interval, and sample facial image frame sequences are selected from the two intervals respectively.
  • the detection frame and key points corresponding to each face image are first obtained through face detection and key point positioning, and then processed frame by frame in the form of a sliding window to obtain a facial image frame sequence of length L. Construct the motion features of the facial image frame sequence based on the key points of the mouth. After inputting the features into the model, a score for predicting whether the facial image frame sequence is speaking is obtained.
  • the score of the facial image frame sequence is used as a specific image frame (usually is the score of frame 21). If the score of this frame is higher than the preset threshold, it is determined that the speaker is speaking, thereby determining the time point results of the start and end of speaking in the video stream.
  • the video stream is processed by sliding window, and the motion characteristics of the key points are constructed for model prediction. It can use a smaller amount of model calculation and resource occupation to realize the speech of the characters in the video stream. Real-time prediction of the start frame and end frame, and good recognition accuracy for various complex non-speaking mouth movements. Especially, for users to use voice interaction in the smart cockpit When the wind outside the car window, chatter in the car or music is played too loudly, the accuracy of speech recognition is not high.
  • the speaking state recognition method provided by the embodiment of the present disclosure is used to perform multi-modal recognition combined with speech, using vision. Features can effectively reduce sound interference, provide a more accurate speaking range, improve speech recognition accuracy, and reduce false positives and false positives.
  • the above-mentioned model output speaking score 833 may correspond to the recognition result in the aforementioned embodiment
  • the motion feature may correspond to the displacement feature in the aforementioned embodiment
  • the displacement difference may correspond to the
  • the inter-frame displacement information and the upper and lower lip distance features may correspond to the intra-frame difference information in the foregoing embodiments
  • the sample video frame sequence may correspond to the sample facial image frame sequence in the foregoing embodiments.
  • the writing order of each step does not mean a strict execution order and does not constitute any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible The internal logic is determined.
  • embodiments of the present disclosure provide a speaking state recognition device.
  • the device includes each unit and each part included in each unit, and can be implemented by a processor in a computer device; of course, it can also Implemented through logic circuits in some embodiments; during implementation, the processor may be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), or a digital signal processor (Digital Signal Processor, DSP) or field programmable gate array (Field Programmable Gate Array, FPGA), etc.
  • CPU Central Processing Unit
  • MPU microprocessor
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • Figure 9 is a schematic structural diagram of a speaking state recognition device provided by an embodiment of the present disclosure.
  • the speaking state recognition device 900 includes: a first acquisition part 910, a second acquisition part 920, and a first determination part 930 and a second determining portion 940, wherein:
  • the first acquisition part 910 is configured to acquire a facial image frame sequence of the target object
  • the second acquisition part 920 is configured to acquire the mouth key point information of each image frame in the facial image frame sequence
  • the first determining part 930 is configured to determine, based on the mouth key point information, the displacement characteristics of the mouth key points corresponding to the facial image frame sequence, the displacement characteristics representing the position of the mouth key points on the face. Position changes between multiple image frames in a sequence of image frames;
  • the second determination part 940 is configured to determine the recognition result of the speaking state of the target object according to the displacement feature.
  • the second acquisition part 920 includes: a first detection sub-part configured to detect facial key points for each facial image frame in the sequence of facial image frames, and obtain the facial key points for each facial image frame sequence. Mouth key point information in a facial image frame.
  • the first acquisition part 910 includes: a first acquisition sub-part configured to sequentially retrieve a preset length of video streams containing facial information of the target object in a sliding window manner.
  • the image frame sequence is a facial image frame sequence of the target object, wherein the sliding step size of the sliding window is not less than 1, and the sliding step size of the sliding window is not greater than the preset length.
  • the facial image frame sequence includes a plurality of the facial image frames;
  • the first determining part 930 includes: a first execution sub-part configured to perform the following steps for each facial image frame : Determine the inter-frame displacement information of each mouth key point according to the mouth key point information in the facial image frame and the adjacent frame of the facial image frame; according to the facial image mouth key point information corresponding to the plurality of mouth key points in the frame, determining intra-frame difference information of the plurality of mouth key points in the facial image frame; based on the respective mouth key point information
  • the inter-frame displacement information and the intra-frame difference information determine the displacement characteristics of the mouth key points corresponding to the facial image frames;
  • the first determination sub-part is configured to determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence according to a plurality of the facial image frame sequences.
  • the displacement characteristics of the mouth key points corresponding to the facial image frames respectively determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the first determination sub-part includes: a first determination unit configured to determine the eye-mouth distance of the target object in each image frame in the facial image frame sequence; a second determination unit configured In order to determine a reference distance according to the eye-mouth distance of the target object in each image frame in the facial image frame sequence; the first processing unit is configured to use the reference distance as a normalized denominator to calculate the plurality of mouths respectively.
  • the inter-frame displacement information and the intra-frame difference information of each key point are normalized to obtain the processed inter-frame displacement information and the processed intra-frame difference information;
  • the third determination unit is configured to The processed inter-frame displacement information and the processed intra-frame difference information of each of the plurality of mouth key points determine the displacement characteristics of the mouth key points corresponding to the facial image frame.
  • the second determination part 940 includes: a first processing sub-part configured to use a trained key point feature extraction network to process the displacement features to obtain the facial image frame sequence Spatial features; the second processing sub-part is configured to use a trained temporal feature extraction network to process the spatial features to obtain the spatio-temporal features of the facial image frame sequence; the first recognition sub-part is configured to use a trained temporal feature extraction network to process the spatial features of the facial image frame sequence.
  • the spatio-temporal features determine the recognition result of the speaking state of the target object.
  • the first recognition sub-part includes: a first recognition unit configured to determine the recognition result of the speaking state of the target object corresponding to the facial image frame sequence according to the spatiotemporal characteristics, as The recognition result of the speaking state of the target object in the last image frame in the sequence of facial image frames; the device further includes: a fifth determining part configured to perform multiple sliding operations according to the target object.
  • the recognition result of the speaking state in the last image frame in the facial image frame sequence taken out from the window is used to determine the starting frame and the ending frame of the target object's speaking.
  • the recognition result of the speaking state includes a first confidence level that the target object is in a first state that represents that the target object is speaking, or a second confidence level that that the target object is in a second state that represents that is not speaking. degree;
  • the fifth determination part includes: a second execution sub-part configured to treat each image in the facial image frame sequence as an image frame to be determined, and perform one of the following steps for the image frame to be determined :
  • the first confidence degree corresponding to the image frame to be determined is greater than or equal to the first preset threshold, and the image frame to be determined is the first image frame corresponding to the previous image frame in the facial image frame sequence.
  • the image frame to be determined is used as the starting frame for the target object to speak; the first confidence level corresponding to the image frame to be determined is greater than or equal to the first preset threshold, and the image frame to be determined is in
  • the image frame to be determined is used as the end frame of the target object's speech; in the The second confidence level corresponding to the image frame to be determined is less than a second preset threshold, and the second confidence level corresponding to the previous image frame of the image frame to be determined in the facial image frame sequence is greater than or equal to
  • the image frame to be determined is used as the starting frame of the target object's speech; the second confidence corresponding to the image frame to be determined is less than the first preset threshold, and
  • the image frame to be determined is used as the starting frame of the target object's speech; the second confidence corresponding to the image frame to be determined is less than the first preset threshold, and
  • the image frame to be determined is used as the starting frame of the target object's speech; the second confidence corresponding to the image frame to be determined is
  • the apparatus further includes: a first training part configured to train the key point feature extraction network and the temporal feature extraction network based on a training sample set, wherein the training sample set includes A sequence of consecutive video frames that has been annotated with the speaking state of an object in each contained video frame.
  • the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment.
  • the functions or included parts of the device provided by the embodiments of the present disclosure can be configured to perform the methods described in the above method embodiments.
  • technical details not disclosed in the device embodiments of the present disclosure please refer to this disclosure be understood from the description of the method embodiments.
  • embodiments of the present disclosure provide a model training device.
  • the device includes each unit and each part included in each unit. It can be implemented by a processor in a computer device; of course, it can also be implemented by Logic circuit implementation in some embodiments; during implementation, the processor may be a CPU, MPU, DSP or FPGA, etc.
  • Figure 10 is a schematic structural diagram of a model training device provided by an embodiment of the present disclosure.
  • the model training device 1000 includes: a third acquisition part 1010, a fourth acquisition part 1020, a third determination part 1030, a fourth determination part Part 1040 and updated part 1050, which:
  • the third acquisition part 1010 is configured to acquire a sequence of sample facial image frames of the target object, wherein the sequence of sample facial image frames is annotated with a sample label characterizing the speaking state of the target object;
  • the fourth acquisition part 1020 is configured to acquire the mouth key point information of each sample image frame in the sample facial image frame sequence
  • the third determination part 1030 is configured to determine, based on the mouth key point information, the displacement characteristics of the mouth key points corresponding to the sample facial image frame sequence, the displacement characteristics representing the position of the mouth key points in the Position changes between multiple sample image frames in the sample facial image frame sequence;
  • the fourth determination part 1040 is configured to use the recognition result generation network in the model to be trained to determine the recognition result of the speaking state of the target object according to the displacement feature;
  • the update part 1050 is configured to update the network parameters of the model at least once based on the recognition result and the sample label to obtain the trained model.
  • the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment.
  • the functions or included parts of the device provided by the embodiments of the present disclosure can be configured to perform the methods described in the above method embodiments.
  • technical details not disclosed in the device embodiments of the present disclosure please refer to this disclosure be understood from the description of the method embodiments.
  • An embodiment of the present disclosure provides a vehicle, including:
  • a vehicle-mounted camera that captures a sequence of facial image frames containing the target subject
  • a vehicle machine connected to the vehicle-mounted camera, is used to obtain the facial image frame sequence of the target object from the vehicle-mounted camera; obtain the mouth key point information of each image frame in the facial image frame sequence; based on the mouth
  • the mouth key point information is used to determine the displacement characteristics of the mouth key points corresponding to the facial image frame sequence.
  • the displacement characteristics represent the positions of the mouth key points between multiple image frames in the facial image frame sequence. Change; determine the recognition result of the speaking state of the target object according to the displacement characteristics.
  • part may be part of a circuit, part of a processor, part of a program or software, etc., of course, it may also be a unit, it may be a module or it may be non-modular.
  • the above method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the software product is stored in a storage medium and includes a number of instructions to enable a A computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk and other media that can store program code.
  • U disk mobile hard disk
  • read-only memory Read Only Memory
  • ROM Read Only Memory
  • magnetic disk or optical disk and other media that can store program code.
  • the embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of hardware, software, and firmware.
  • An embodiment of the present disclosure provides a computer device, including a memory and a processor.
  • the memory stores a computer program that can be run on the processor.
  • the processor executes the program, some or all of the steps in the above method are implemented.
  • Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor. Implement some or all of the steps in the above method.
  • the computer-readable storage medium may be transient or non-transitory.
  • Embodiments of the present disclosure provide a computer program, which includes computer readable code.
  • the processor in the computer device executes a part for implementing the above method or All steps.
  • Embodiments of the present disclosure provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, some of the above methods are implemented or All steps.
  • the computer program product may be implemented by hardware, software or a combination thereof in some embodiments.
  • the computer program product is embodied as, for example, a computer storage medium.
  • the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) or the like.
  • Figure 11 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure.
  • the hardware entity of the computer device 1100 includes: a processor 1101, a communication interface 1102 and a memory 1103, where:
  • Processor 1101 generally controls the overall operation of computer device 1100 .
  • Communication interface 1102 may enable the computer device to communicate with other terminals or servers over a network.
  • the memory 1103 is configured to store instructions and applications executable by the processor 1101, and can also cache data to be processed or processed by the processor 1101 and various parts of the computer device 1100 (for example, image data, audio data, voice communication data and Video communication data), which can be implemented through flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data can be transmitted between the processor 1101, the communication interface 1102 and the memory 1103 through the bus 1104.
  • data can be transmitted between the processor 1101, the communication interface 1102 and the memory 1103 through the bus 1104.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be electrical, mechanical, or other forms. of.
  • the units described above as separate components may or may not be physically separated; the components shown as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • all functional units in the embodiments of the present application can be integrated into one processing unit, or each unit can be separately used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the products applying the technical solution of this application will clearly inform the personal information processing rules and obtain the individual's independent consent before processing personal information.
  • the product applying the technical solution in this application must obtain the individual's separate consent before processing sensitive personal information, and meet the requirement of "express consent" at the same time. For example, setting up clear and conspicuous signs on personal information collection devices such as cameras to inform them that they have entered the scope of personal information collection, and that personal information will be collected.
  • personal information processing rules may include personal information processing rules.
  • Information processors purposes of personal information processing, processing methods, types of personal information processed, etc.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the execution includes: The steps of the above method embodiment; and the aforementioned storage medium includes: a mobile storage device, a read-only memory (Read Only Memory, ROM), a magnetic disk or CDs and other media that can store program code.
  • ROM Read Only Memory
  • the integrated units mentioned above in this application are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to related technologies.
  • the computer software product is stored in a storage medium and includes a number of instructions to enable a computer.
  • a computer device (which may be a personal computer, a server, a network device, etc.) executes all or part of the methods described in various embodiments of this application.
  • the aforementioned storage media include: mobile storage devices, ROMs, magnetic disks or optical disks and other media that can store program codes.
  • Embodiments of the present disclosure disclose a speaking state recognition method and model training method, device, vehicle, medium, computer program and computer program product, wherein the speaking state recognition method includes: obtaining a facial image frame sequence of a target object; obtaining a facial image The mouth key point information of each image frame in the frame sequence; based on the mouth key point information, the displacement characteristics of the mouth key points corresponding to the facial image frame sequence are determined. The displacement characteristics represent the number of mouth key points in the facial image frame sequence. The position change between image frames; the recognition result of the speaking state of the target object is determined based on the displacement characteristics.
  • the position change process of the target object's mouth key points in the facial image frame sequence can be represented, the recognition result of the target object's speaking state is determined based on the displacement characteristics, and the speaking state of the target object can be accurately identified, thereby improving the Accuracy of speech state recognition.
  • the above solution uses the displacement characteristics of the key points of the mouth to reduce the amount of calculation required for speaking state recognition, thereby reducing the time required to perform speaking state recognition. Hardware requirements for computer equipment of the method.
  • good recognition results can be achieved for facial image frames with different face shapes, textures and other appearance information, thus improving the generalization ability of speaking state recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

一种说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品,其中,说话状态识别方法包括:获取目标对象的面部图像帧序列(S101);获取面部图像帧序列中各图像帧的嘴部关键点信息(S102);基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化(S103);根据位移特征确定目标对象的说话状态的识别结果(S104)。

Description

说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品
相关申请的交叉引用
本公开实施例基于申请号为202210772934.1、申请日为2022年06月30日、申请名称为“说话状态识别方法及模型训练方法、装置、车辆、介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及但不限于信息技术领域,尤其涉及一种说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品。
背景技术
唇动检测技术,可以利用计算机视觉技术从视频图像中识别人脸,提取人脸的嘴部区域的变化特征,从而识别嘴部区域运动状态。
发明内容
有鉴于此,本公开实施例提供一种说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品。
本申请实施例的技术方案是这样实现的:
本公开实施例提供一种说话状态识别方法,所述方法由电子设备执行,所述方法包括:获取目标对象的面部图像帧序列;获取所述面部图像帧序列中各图像帧的嘴部关键点信息;基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;根据所述位移特征确定所述目标对象的说话状态的识别结果。
本公开实施例提供一种模型训练方法,所述方法由电子设备执行,所述方法包括:
获取目标对象的样本面部图像帧序列,其中,所述样本面部图像帧序列标注有表征所述目标对象的说话状态的样本标签;
获取所述样本面部图像帧序列中各样本图像帧的嘴部关键点信息;
基于所述嘴部关键点信息,确定所述样本面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述样本面部图像帧序列中的多个样本图像帧之间的位置变化;
利用待训练的模型中的识别结果生成网络,根据所述位移特征确定所述目标对象的说话状态的识别结果;
基于所述识别结果和所述样本标签,对所述模型的网络参数进行至少一次更新,得到训练后的所述模型。
本公开实施例提供一种说话状态识别装置,所述装置包括:
第一获取部分,被配置为获取目标对象的面部图像帧序列;
第二获取部分,被配置为获取所述面部图像帧序列中各图像帧的嘴部关键点信息;
第一确定部分,被配置为基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;
第二确定部分,被配置为根据所述位移特征确定所述目标对象的说话状态的识别结果。
本公开实施例提供一种模型训练装置,包括:
第三获取部分,被配置为获取目标对象的样本面部图像帧序列,其中,所述样本面部图像帧序列标注有表征所述目标对象的说话状态的样本标签;
第四获取部分,被配置为获取所述样本面部图像帧序列中各样本图像帧的嘴部关键点信息;
第三确定模块,被配置为基于所述嘴部关键点信息,确定所述样本面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述样本面部图像帧序列中的多个样本图像帧之间的位置变化;
第四确定部分,被配置为利用待训练的模型中的识别结果生成网络,根据所述位移特征确定所述目 标对象的说话状态的识别结果;
更新部分,被配置为基于所述识别结果和所述样本标签,对所述模型的网络参数进行至少一次更新,得到训练后的所述模型。
本公开实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的部分或全部步骤。
本公开实施例提供一种车辆,包括:
车载相机,用于拍摄包含目标对象的面部图像帧序列;
车机,与所述车载相机连接,用于从所述车载相机获取所述目标对象的面部图像帧序列;获取所述面部图像帧序列中各图像帧的嘴部关键点信息;基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;根据所述位移特征确定所述目标对象的说话状态的识别结果。
本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的部分或全部步骤。
本公开实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行用于实现上述方法中的部分或全部步骤。
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。
本公开实施例中,首先,获取目标对象的面部图像帧序列,获取面部图像帧序列中各图像帧的嘴部关键点信息;这样,能够获取目标对象在面部图像帧序列中各图像帧的嘴部关键点信息;其次,基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化;这样,面部图像帧序列对应的嘴部关键点的位移特征,能够表示目标对象在面部图像帧序列中嘴部关键点的位置变化过程;最后,根据位移特征确定目标对象的说话状态的识别结果;这样,能够提升确定出的目标对象的说话状态的识别结果的精确度。在本公开实施例中,由于面部图像帧序列对应的嘴部关键点的位移特征,能够表示目标对象在面部图像帧序列中嘴部关键点的位置变化过程,根据位移特征确定目标对象的说话状态的识别结果,能够精确识别目标对象的说话状态,从而能够提升说话状态的识别的精确度。并且,相较于利用面部图像帧裁剪得到的嘴部区域图像序列进行说话状态识别,上述方案利用嘴部关键点的位移特征,能够降低说话状态识别所需的计算量,从而降低执行说话状态识别方法的计算机设备的硬件要求。此外,利用嘴部关键点的位移特征,对不同脸型、纹理等外观信息的面部图像帧都能取得良好的识别效果,从而提高了说话状态识别的泛化能力。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开的技术方案。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1为本公开实施例提供的一种说话状态识别方法的实现流程示意图;
图2为本公开实施例提供的一种说话状态识别方法的实现流程示意图;
图3为本公开实施例提供的一种脸部关键点示意图;
图4为本公开实施例提供的一种说话状态识别方法的实现流程示意图;
图5为本公开实施例提供的一种说话状态识别方法的实现流程示意图;
图6为本公开实施例提供的一种说话状态识别方法的实现流程示意图;
图7为本公开实施例提供的一种模型训练方法的实现流程示意图;
图8为本公开实施例提供的一种说话状态识别模型的组成结构示意图;
图9为本公开实施例提供的一种说话状态识别装置的组成结构示意图;
图10为本公开实施例提供的一种模型训练装置的组成结构示意图;
图11为本公开实施例提供的一种计算机设备的硬件实体示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和实施例对本申请的技术方案进一 步详细阐述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
所涉及的术语“第一/第二/第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一/第二/第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请的目的,不是旨在限制本申请。
车舱智能化包括多模交互、个性化服务、安全感知等方面的智能化,是当前汽车行业发展的重要方向。其中,车舱多模交互意在为乘客提供舒适的交互体验,多模交互的方式包括但不限于语音识别、手势识别等。然而,在车舱实际应用中,例如存在窗外风声、车内闲聊等声音干扰的情况下,语音识别的准确度不高。因此,引入利用计算机视觉特征的唇动检测,有利于识别出更精确的说话状态的区间,从而提升语音识别精度。但是,本公开实施例的发明人发现相关技术的唇动检测方案存在局限:一方面,将嘴部区域的图像序列作为模型输入的方案,通过人脸检测找出图像中人脸对应的位置,把图像中嘴部区域切割出来,得到嘴部区域图像的图像序列,将该图像序列输入卷积神经网络进行特征提取,并将特征输入时序预测网络进行分类。由于嘴部区域图像的图像序列对嘴部运动信息不敏感,使得说话状态识别的精准度不高,且三维卷积需要消耗大量计算资源,对硬件要求也很高,难以大范围应用。另一方面,根据上下嘴唇点的距离与阈值进行判断,根据判断结果确定是否处于说话状态的方案,一些张嘴但不说话动作容易引起误报,说话状态识别的精准度不高。
本公开实施例提供一种说话状态识别方法,该方法可以由计算机设备的处理器执行。其中,计算机设备指的可以是车机、服务器、笔记本电脑、平板电脑、台式计算机、智能电视、机顶盒、移动设备(例如移动电话、便携式视频播放器、个人数字助理、专用消息设备、便携式游戏设备)等具备数据处理能力的设备。图1为本公开实施例提供的一种说话状态识别方法的实现流程示意图,如图1所示,该方法包括如下步骤S101至步骤S104:
步骤S101,获取目标对象的面部图像帧序列。
计算机设备获取到多个图像帧,多个图像帧由摄像头等采集组件对目标对象拍摄得到,按照每一图像帧对应的采集时间进行排序,或者根据图像帧的采集顺序,实时地将采集到的图像帧加入至目标对象的面部图像序列。得到目标对象的面部图像帧序列。面部图像帧序列的长度可以是不固定的。在实施时,面部图像帧序列的长度可以为40帧、50帧或100帧。计算机设备获取多个图像帧的方式,可以是由本计算机设备通过调用摄像头获取的,也可以是从其他计算机设备获取的;例如本计算机设备为车辆,可以通过车载相机获取图像,也可以利用与移动终端的无线传输等方式,获取移动终端采集的图像。需要说明的是,面部图像帧序列的至少一个图像帧可以来源于视频流,一个视频流可以包括多个视频帧,每一视频帧对应一个图像帧。
在一些实施方式中,可以根据预设规则从视频中获取与每一目标面部图像帧对应的至少一个面部图像帧序列。其中,预设规则可以是滑动窗口法,从滑动窗口中多次取出面部图像帧序列,也就是利用预设的滑动步长,每一次从连续的多个面部图像帧中选取连续的预设数量个图像帧为面部图像帧序列,在完成一个面部图像帧序列的处理(即完成基于该面部图像帧序列的说话状态识别)之后,将滑动窗口沿预设方向、按照滑动步长滑动,取出滑动窗口内的面部图像帧,形成新的面部图像帧序列;可以是以固定间隔或不固定间隔选取图像帧为面部图像帧序列。目标面部图像帧的图像画面可以包含目标对象的部分或全部的面部,且至少包括嘴部;目标对象通常是人类,但也可以是其他具有表达能力的动物,例如猩猩。并且,目标面部图像帧可以理解为待识别说话状态的图像帧。
这样,能够获取目标对象的面部图像帧序列。
步骤S102,获取面部图像帧序列中各图像帧的嘴部关键点信息。
针对至少一个面部图像帧序列中的每一面部图像帧序列,面部图像帧序列包括至少一个图像帧,可以对该面部图像帧序列中的至少一个图像帧进行关键点检测,得到至少包括在图像帧中各嘴部关键点的位置信息的嘴部关键点信息。
在一些实施方式中,获取面部图像帧序列中各图像帧的嘴部关键点信息,包括:针对面部图像帧序列中的每一面部图像帧进行人脸关键点检测,得到每一面部图像帧中的嘴部关键点信息。
每一面部图像帧中的嘴部关键点信息,可以采用任意合适的方式得到。例如,可以利用已训练的关键点检测模型对面部图像帧进行人脸关键点检测。在实施时,卷积神经网络、循环神经网络等对面部图 像帧进行关键点检测得到。
在一些实施方式中,位置信息可以通过位置参数表示,例如以图像坐标系中的二维坐标表示,二维坐标包括宽度(横坐标)和高度(纵坐标);位移特征可以表示关键点在面部图像帧序列的运动特征。关键点的位置信息与嘴部形状相关,同一关键点在不同图像帧的位置信息随嘴部形状变化而变化。
步骤S103,基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化。
根据嘴部关键点信息,确定能够表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化的位移特征。
在一些实施方式中,在面部图像帧序列包括至少两个图像帧的情况下,针对每一图像帧,可以计算嘴部关键点在该图像帧和在面部图像帧序列中与该图像帧相邻的第一设定数量的图像帧之间的位置信息的差异信息,并根据图像帧的嘴部关键点信息得到位移特征,例如,可以根据设定顺序对差异信息进行排序,将得到的结果作为位移特征。其中,第一设定数量可以为一个,也可以为两个或以上,与该图像帧相邻的第一设定数量的图像帧可以是在该图像帧之前和/或在该图像帧之后的连续的图像帧。
例如,第一设定数量为一个,位移特征可以包括以下至少之一:该图像帧与前一图像帧之间的位置信息的差异信息;该图像帧与后一图像帧之间的位置信息的差异信息。以位移特征为该图像帧与前一图像帧之间的位置信息的差异信息,每一图像帧包括4个嘴部关键点为例,嘴部关键点在该图像帧的位置信息分别为(x1,y1)、(x2,y2)、(x3,y3)、(x4,y4),嘴部关键点在前一图像帧的位置信息分别为(x'1,y'1)、(x'2,y'2)、(x'3,y'3)、(x'4,y'4),得到的位移特征为[(x'1-x1,y'1-y1),(x'2-x2,y'2-y2),(x'3-x3,y'3-y3),(x'4-x4,y'4-y4)]。
步骤S104,根据位移特征确定目标对象的说话状态的识别结果。
利用面部图像帧序列对应的位移特征,对目标对象的说话状态进行识别,得到识别结果,识别结果表征目标对象在面部图像帧序列的设定图像帧时是否处于正在说话状态。
目标对象的说话状态可以采用任意合适的识别方式得到,例如,可以采用用于分类的神经网络对位移特征进行分类得到;又例如,可以通过预先设置规则对位移特征进行匹配得到。
目标对象的说话状态的识别结果,可以表示在设定图像帧时,目标对象是否处于正在说话的状态。其中,设定图像帧可以是图像帧序列中设定序号的图像帧,包括但不限于第一帧、第二帧、最后一帧或倒数第二帧。
识别结果包括任意合适的能够描述目标对象是否处于说话状态的信息,例如,可以是直接描述目标对象是否处于正在说话状态的信息,也可以包括间接描述目标对象是否处于正在说话状态的信息,例如置信度。这里,目标对象处于说话状态,表示对应的目标图像帧是对正在说话的目标对象拍摄得到的图像帧;目标对象处于未在说话状态,表示对应的目标图像帧是对未在说话的目标对象拍摄得到的图像帧。
这样,能够提升确定出的目标对象的说话状态的识别结果的精确度。
在本公开实施例中,由于面部图像帧序列对应的嘴部关键点的位移特征,能够表示目标对象在面部图像帧序列中嘴部关键点的位置变化过程,根据位移特征确定目标对象的说话状态的识别结果,能够精确识别目标对象的说话状态,从而能够提升说话状态的识别的精确度。并且,相较于利用嘴部区域图像序列进行说话状态识别,利用嘴部关键点的位移特征进行说话状态识别,能够降低说话状态识别所需的计算量,从而降低执行说话状态识别方法的计算机设备的硬件要求。此外,利用嘴部关键点的位移特征,对不同脸型、纹理等外观信息的面部图像帧都能取得良好的识别效果,从而提高了说话状态识别的泛化能力。
在一些实施例中,在根据位移特征确定目标对象的说话状态的识别结果之后,可以根据识别结果从面部图像帧序列来源的视频流中取出目标对象处于说话状态的图像帧序列。这样,能够提升从视频流中选取目标对象处于正在说话状态的图像帧序列的精准度。并且,在利用识别结果从视频流中选取的图像帧序列进行唇语识别时,还能够提升唇语识别的准确度,降低唇语识别的图像处理过程所需的计算量。
在一些实施方式中,在面部图像帧序列包括多个面部图像帧的情况下,上述步骤S103可以通过图2所示的步骤实现。图2为本公开实施例提供的一种说话状态识别方法的实现流程示意图,结合图2所示的步骤进行以下说明:
步骤S1031,针对每一面部图像帧,执行以下步骤:根据每一嘴部关键点在面部图像帧和面部图像帧的相邻帧中的嘴部关键点信息,确定每一嘴部关键点的帧间位移信息;根据面部图像帧中的多个嘴部关键点对应的嘴部关键点信息,确定面部图像帧中的多个嘴部关键点的帧内差异信息;基于多个嘴部关键点各自的帧间位移信息以及帧内差异信息,确定面部图像帧对应的嘴部关键点的位移特征。
在一些实施例中,对于每一面部图像帧,根据嘴部关键点在该面部图像帧中和在面部图像帧序列中 与该面部图像帧相邻的第二设定数量的面部图像帧之间的位置信息的差异信息,确定该嘴部关键点的帧间位移信息。其中,第二设定数量可以为一个,也可以为两个或以上,与该面部图像帧相邻的第二设定数量的图像帧可以是在该面部图像帧之前和/或在该面部图像帧之后的连续的面部图像帧。以第二设定数量为两个,该面部图像帧在面部图像帧序列中的序号为20为例进行说明,与该面部图像帧相邻的第二设定数量的图像帧可以是面部图像帧序列中序号为18、19、21、22的图像帧。
在一些实施方式中,面部图像帧之间的位置信息的差异信息包括但不限于:第一高度差、第一宽度差等中的至少之一;第一宽度差为该嘴部关键点在图像帧帧间的宽度差值,第一高度差为该嘴部关键点在图像帧帧间的高度差值。在实施时,可以将面部图像帧序列中在后的图像帧的位置信息作为被减数,在前的图像帧的位置信息作为减数;也可以将面部图像帧序列中在前的图像帧的位置信息作为被减数,在后的图像帧的位置信息作为减数。
在一些实施例中,针对每一面部图像帧,计算嘴部关键点所属的预设关键点对在该面部图像帧的第二高度差、第二宽度差等中的至少之一,得到该预设关键点对中的每一嘴部关键点在面部图像帧的帧内差异信息。其中,预设关键点对包括两个关键点,在设置预设关键点对时通常考虑关键点在图像中的位置信息,也就是说,属于同一预设关键点对的两个关键点之间满足设定位置关系;例如,将分别位于上下嘴唇的两个关键点作为一个关键点对。实际应用中,可以将图像中宽度的差异信息小于预设值的两个关键点确定为预设关键点对。
在一些实施方式中,一个嘴部关键点可以分别与两个或以上的关键点构成预设关键点对,也就是说,每一嘴部关键点可以属于至少一个关键点对。此时,分别确定该嘴部关键点所属每一关键点对的第二高度差,并可以通过至少两个第二高度差加权计算或取最值的方式,得到该嘴部关键点在该面部图像帧的帧内差异信息。图3为本公开实施例提供的一种脸部关键点示意图,以图3示出的106点脸部关键点示意图为例,包括0-105号共106个关键点,可以描述人脸的脸部轮廓、眉毛、眼睛、鼻子、嘴巴等特征,其中的84至103号关键点是用于描述嘴巴的嘴部关键点。在实施时,86号关键点可以分别与103号关键点和94号关键点构成预设关键点对,也就是说,86号关键点可以属于两个预设关键点对,分别计算得到两个第二高度差,再通过加权求和确定86号关键点在该面部图像帧的帧内差异信息。这样,可以改善因关键点检测误差导致的位移特征计算偏差,基于位移特征进行说话状态识别,能够提升说话状态识别的精准度。
对于每一面部图像帧,基于面部图像帧中的每一嘴部关键点的帧内差异信息和帧间位移信息,通过顺序拼接或加权计算的方式确定该面部图像帧的位移特征。这样,基于所有关键点在该面部图像帧的帧间位移信息和帧内差异信息,可以确定该面部图像帧的位移特征。例如,每一嘴部关键点在位移特征中对应一个5维特征,5维特征中的前4维为帧间位移信息,分别是该图像帧和前一图像帧的宽度差、该图像帧和前一图像帧的高度差、该图像帧和后一图像帧的宽度差、该图像帧和后一图像帧的高度差,第5维为帧内差异信息,是预设关键点对在该图像帧的第二高度差。
步骤S1032,根据面部图像帧序列中的多个面部图像帧分别对应的嘴部关键点的位移特征,确定面部图像帧序列对应的嘴部关键点的位移特征。
在一些实施方式中,可以根据设定顺序对多个面部图像帧分别对应的嘴部关键点的位移特征进行排序,得到面部图像帧序列对应的嘴部关键点的位移特征。
在本公开实施例中,帧内差异信息可以表示满足设定关系的嘴部关键点之间的差异,提升每一面部图像帧中的嘴部形状识别的准确度;帧间位移信息可以表示在图像帧序列对应的说话过程中嘴部关键点的帧间变化过程;这样,利用每一面部图像帧中的帧内差异信息和帧间位移信息,可以更好地提取说话过程中嘴部形状的变化特征,进而能够提升说话状态识别的精确度。
在一些实施方式中,步骤S1031,可以包括如下步骤S10311至步骤S10314:
步骤S10311:确定面部图像帧序列中各图像帧中目标对象的眼嘴距离。
眼嘴距离表示图像帧中目标对象的眼睛与嘴部之间的距离。在一些实施方式中,针对面部图像帧序列内每一图像帧,将该图像帧中两眼关键点坐标均值作为第一坐标,以及将嘴部关键点坐标均值作为第二坐标,计算第一坐标和第二坐标的距离得到该图像帧中目标对象的眼嘴距离。其中,眼嘴距离可以是第一坐标和第二坐标之间的横向距离,可以是第一坐标和第二坐标之间的纵向距离,还可以是第一坐标和第二坐标之间的二维距离。
步骤S10312:根据面部图像帧序列中各图像帧中目标对象的眼嘴距离,确定参考距离。
在一些实施方式中,可以将面部图像帧序列对应的多个眼嘴距离中的最大值、最小值、均值或中位数值等中的之一作为参考距离。
在一些实施方式中,在存在多个面部图像帧序列的情况下,可以从多个面部图像帧序列对应的眼嘴距离中确定出最大的眼嘴距离,将这个最大的眼嘴距离作为参考距离。
步骤S10313:将参考距离作为归一化分母,分别对多个嘴部关键点各自的帧间位移信息和帧内差异 信息进行归一化处理,得到处理后的帧间位移信息和处理后的帧内差异信息。
将参考距离作为归一化分母,各嘴部关键点的帧间位移信息作为归一化分子,得到该嘴部关键点的处理后的帧间位移信息;将参考距离作为归一化分母,各嘴部关键点的帧内差异信息作为归一化分子,得到该嘴部关键点的处理后的帧内差异信息。
步骤S10314:基于多个嘴部关键点各自的处理后的帧间位移信息以及处理后的帧内差异信息,确定面部图像帧对应的嘴部关键点的位移特征。
对于每一面部图像帧,基于面部图像帧中的多个嘴部关键点各自的处理后的帧内差异信息,以及处理后的帧间位移信息,通过顺序拼接或加权计算的方式确定该面部图像帧的位移特征。
在本公开实施例中,以面部图像帧序列中各图像帧中目标对象的眼嘴距离,确定帧间位移信息和帧内差异信息的归一化分母,根据归一化处理得到的位移特征,这样,能够使得位移特征更加规范,从而提升确定出的目标对象的说话状态的识别结果的精确度。并且,在使用模型实现目标对象的说话状态的识别的情况下,可以提升该模型在训练过程中的收敛速度。
图4为本公开实施例提供的一种说话状态识别方法的实现流程示意图,结合图4所示的步骤进行以下说明:
步骤S201,以滑动窗口的方式从包含目标对象的面部信息的视频流中,依次取出预设长度的图像帧序列,作为目标对象的面部图像帧序列。
获取包含目标对象的面部信息的视频流,以预设的窗口大小的滑动窗口、预设的滑动步长对该视频流进行处理,从该视频流中依次取出多个与窗口大小相同的预设长度的图像帧序列,将取出的多个图像帧序列中的每个图像帧序列分别作为目标对象的面部图像帧序列。其中,滑动窗口的滑动步长不小于1,且滑动窗口的滑动步长不大于预设长度,由此滑动窗口每滑动一次所取出的面部图像帧序列与上一次取出的面部图像帧序列中至少具有一个非重叠帧,同时至少具有一个重叠帧。
在实施时,考虑说话状态的识别精度等因素,可以将窗口大小设置为22个图像帧,滑动步长设置为1至22中的任一整数,这样能够得到多个长度为22的图像帧序列。
步骤S202,获取面部图像帧序列中各图像帧的嘴部关键点信息。
步骤S203,基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化。
在一些实施方式中,面部图像帧序列包括多个面部图像帧;基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,包括:针对每一面部图像帧,执行以下步骤:根据每一嘴部关键点在面部图像帧和面部图像帧的相邻帧中的嘴部关键点信息,确定每一嘴部关键点的帧间位移信息;根据面部图像帧中的多个嘴部关键点对应的嘴部关键点信息,确定面部图像帧中的多个嘴部关键点的帧内差异信息;基于多个嘴部关键点各自的帧间位移信息以及帧内差异信息,确定面部图像帧对应的嘴部关键点的位移特征;根据面部图像帧序列中的多个面部图像帧分别对应的嘴部关键点的位移特征,确定面部图像帧序列对应的嘴部关键点的位移特征。
在一些实施方式中,基于多个嘴部关键点各自的帧间位移信息以及帧内差异信息,确定面部图像帧对应的嘴部关键点的位移特征,包括:确定面部图像帧序列中各图像帧中目标对象的眼嘴距离;根据面部图像帧序列中各图像帧中目标对象的眼嘴距离,确定参考距离;将参考距离作为归一化分母,分别对多个嘴部关键点各自的帧间位移信息和帧内差异信息进行归一化处理,得到处理后的帧间位移信息和处理后的帧内差异信息;基于多个嘴部关键点各自的处理后的帧间位移信息以及处理后的帧内差异信息,确定面部图像帧对应的嘴部关键点的位移特征。
步骤S204,根据位移特征确定目标对象的说话状态的识别结果。
这里,上述步骤S202至步骤S204分别对应于前述步骤S102至步骤S104,在实施时可以参照前述步骤S102至步骤S104的实施方式。
在本公开实施例中,利用滑动窗口从视频流中依次取出多个预设长度的面部图像帧序列,以这些预设长度的面部图像帧序列确定在视频流中设定图像帧时,目标对象是否处于正在说话的状态的识别结果,得到视频流中多个图像帧的识别结果。可以通过滑动窗口多次获取的面部图像帧序列进行说话状态识别,能够反映目标对象在滑动窗口取出的多个面部图像帧序列中嘴部关键点的位置变化过程,且多个面部图像帧序列之间至少有部分重叠帧,从而可以精确识别目标对象在连续的图像帧中的任意设定图像帧的说话状态,提升目标对象的说话状态的识别结果的精确度,进而可以提升从视频流中选取目标对象处于正在说话状态的图像帧序列的精准度。
图5为本公开实施例提供的一种说话状态识别方法的实现流程示意图,结合图5所示的步骤进行以下说明:
步骤S301,获取目标对象的面部图像帧序列。
步骤S302,获取面部图像帧序列中各图像帧的嘴部关键点信息。
步骤S303,基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化。
这里,上述步骤S301至步骤S303分别对应于前述步骤S101至步骤S103,在实施时可以参照前述步骤S101至步骤S103的实施方式。
步骤S304,采用经过训练的关键点特征提取网络对位移特征进行处理,得到面部图像帧序列的空间特征。
在一些实施方式中,可以先对位移特征中帧间位移信息、帧内差异信息分别进行特征提取,得到嘴部关键点的帧间位移特征和帧内差异特征,再在帧间位移特征和帧内差异特征之间进行空间特征提取,得到该图像帧的空间特征,根据面部图像帧序列中的各图像帧的空间特征,得到面部图像帧序列的空间特征。例如,每一关键点在位移特征中对应一个5维特征,5维特征中的前4维是帧间位移信息,分别是图像帧和前一图像帧的宽度差、图像帧和前一图像帧的高度差、图像帧和后一图像帧的宽度差、图像帧和图像帧的高度差,第5维是帧内差异信息。分别对5维特征中的每一维在不同关键点之间进行特征提取得到特征,在该特征中前4维是嘴部关键点在该图像帧的帧间位移特征,第5维是嘴部关键点在该图像帧的帧内差异特征。再对这5维之间进行进行空间特征提取,得到该图像帧的空间特征。
经过训练的关键点特征提取网络经过预设样本集训练得到,可以由任意合适的网络架构实现,包括但不限于卷积神经网络、循环神经网络等中的至少之一。
步骤S305,采用经过训练的时序特征提取网络对空间特征进行处理,得到面部图像帧序列的时空特征。
在一些实施方式中,对面部图像帧序列中多个图像帧的空间特征进行至少一次时间特征提取,得到该图像帧对应的时空特征,根据面部图像帧序列中的各图像帧的时空特征,得到面部图像帧序列的时空特征。时空特征可以是采用任意合适的特征提取方式从空间特征中提取得到的。例如,以一次时间特征提取为例,利用1×5的卷积核进行特征提取,每次卷积对该图像帧前后各两个图像帧的空间特征进行提取,提取得到的时空特征包括五个图像帧的信息。
经过训练的时序特征提取网络经过预设样本集训练得到,可以由任意合适的网络架构实现,包括但不限于卷积神经网络、循环神经网络等中的至少之一。
由于时间特征提取的次数越多、使用的卷积核越大,每一图像帧的时空特征能表示更多图像帧的信息,对应的感受野越大,利于提升说话状态识别的精确度,但需要消耗的计算资源更大,影响硬件运算效率;综合考虑精确度和硬件运算效率等因素,在实施时可以将时间特征提取的次数设置为5次。
在一些实施方式中,基于训练样本集对关键点特征提取网络和时序特征提取网络进行训练,其中,训练样本集包括已标注所包含的各视频帧中的对象说话状态的连续视频帧序列。
这里,以包括已标注所包含的各视频帧中的对象说话状态的连续视频帧序列,对关键点特征提取网络和时序特征提取网络进行训练,得到经过训练的关键点特征提取网络和经过训练的时序特征提取网络。
步骤S306,基于时空特征确定目标对象的说话状态的识别结果。
利用面部图像帧序列中图像帧的时空特征,对目标对象的说话状态进行识别,得到识别结果,识别结果表征目标对象在面部图像帧序列中设定图像帧时,是否处于正在说话状态。
目标对象的说话状态可以采用任意合适的识别方式得到,例如,可以采用分类网络对位移特征识别得到,例如全局平均池化层(Global Average Pooling,GAP),或者全连接层;又例如,可以通过预先设置规则对位移特征进行匹配得到。
在本公开实施例中,由于各网络是可学习的,通过学习能够精确识别目标对象的说话状态,从而提升说话状态的识别的精确度。并且,本公开实施例支持使用卷积神经网络进行时空特征提取;相较于采用循环神经网络(例如,递归神经网路)等时序预测网络提取时空特征,通过卷积神经网络提取时空特征的计算量较少,能够降低计算资源的消耗,降低说话状态识别的计算机设备的硬件要求。并且,对于采用卷积神经网络能够降低对芯片计算能力的要求,从而本公开实施例提供的说话状态识别方法能够通过更多轻量化的芯片实现,更多硬件支持本公开实施例的说话状态识别方法,使得更多的硬件支持说话状态识别,提升了说话状态识别的通用性,例如车机等计算机设备也可以实现说话状态识别。
图6为本公开实施例提供的一种说话状态识别方法的实现流程示意图,结合图6所示的步骤进行以下说明:
步骤S401,获取目标对象的面部图像帧序列。
步骤S402,获取面部图像帧序列中各图像帧的嘴部关键点信息。
步骤S403,基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化。
这里,上述步骤S401至步骤S403分别对应于前述步骤S101至步骤S103,在实施时可以参照前述步骤S101至步骤S103的实施方式。
步骤S404,采用经过训练的关键点特征提取网络对位移特征进行处理,得到面部图像帧序列的空间特征。
步骤S405,采用经过训练的时序特征提取网络对空间特征进行处理,得到面部图像帧序列的时空特征。
这里,上述步骤S404至步骤S405分别对应于前述步骤S304至步骤S305,在实施时可以参照前述步骤S304至步骤S305的实施方式。
步骤S406,根据时空特征确定目标对象与面部图像帧序列对应的说话状态的识别结果,作为目标对象在面部图像帧序列中的最后一个图像帧中的说话状态的识别结果。
利用面部图像帧序列中图像帧的时空特征,对目标对象的说话状态进行识别,得到识别结果,识别结果表征目标对象在面部图像帧序列中最后一个图像帧的对应时刻时,是否处于正在说话状态。
步骤S407,根据目标对象在多个滑动窗口中分别取出的面部图像帧序列中的最后一个图像帧中的说话状态的识别结果,确定目标对象说话的起始帧和结束帧。
对于以多个滑动窗口分别从视频流中取出的对应面部图像帧序列,根据每一面部图像帧序列中的最后一个图像帧中的说话状态的识别结果,获知目标对象在该最后一个图像帧中是否处于正在说话状态,确定目标对象在视频流中满足设定位置关系的多个图像帧中是否处于正在说话状态,从而确定目标对象在视频流中开始说话的起始帧,以及目标对象在视频流中结束说话的结束帧。其中,设定位置关系与滑动窗口的步长相关,例如,步长为1,能够确定目标对象在连续的多个图像帧中是否处于正在说话状态。
在一些实施方式中,获取包含目标对象的面部信息的视频流,以预设的窗口大小的滑动窗口、预设的滑动步长对该视频流进行处理,从该视频流中依次取出多个与窗口大小相同的预设长度的图像帧序列,将取出的多个图像帧序列中的每个图像帧序列分别作为目标对象的面部图像帧序列。其中,滑动窗口的滑动步长不小于1,且滑动窗口的滑动步长不大于预设长度。
可以将面部图像帧序列中的每一图像帧作为待判断图像帧,确定待判断图像帧是否为说话的起始帧或结束帧。在一些实施方式中,说话状态的识别结果包括目标对象处于表征正在说话的第一状态的第一置信度;在待判断图像帧对应的第一置信度大于或等于第一预设阈值,且待判断图像帧在面部图像帧序列中的前一图像帧对应的第一置信度小于第一预设阈值的情况下,将待判断图像帧作为目标对象说话的起始帧;在待判断图像帧对应的第一置信度大于或等于第一预设阈值,且待判断图像帧在面部图像帧序列中的后一图像帧对应的第一置信度小于第一预设阈值的情况下,将待判断图像帧作为目标对象说话的结束帧。
在一些实施方式中,说话状态的识别结果包括目标对象处于表征未在说话的第二状态的第二置信度;在待判断图像帧对应的第二置信度小于第二预设阈值,且待判断图像帧在面部图像帧序列中的前一图像帧对应的第二置信度大于或等于第二预设阈值的情况下,将待判断图像帧作为目标对象说话的起始帧;在待判断图像帧对应的第二置信度小于第一预设阈值,且待判断图像帧在面部图像帧序列中的后一图像帧对应的第二置信度大于或等于第二预设阈值的情况下,将待判断图像帧作为目标对象说话的结束帧。
在本公开实施例中,根据从视频流中滑动窗口取出的多个面部图像帧序列中最后一个图像帧的识别结果,确定目标对象在该视频流中说话的起始帧和结束帧,这样能够提升从视频流中选取目标对象处于正在说话状态的图像帧序列的精准度。并且,在利用识别结果从视频流中选取的图像帧序列进行唇语识别时,还能够提升唇语识别的准确度,降低唇语识别的图像处理过程所需的计算量。
本公开实施例提供一种模型训练方法,该方法可以由计算机设备的处理器执行。如图7所示,该方法包括如下步骤S501至步骤S505:
步骤S501,获取目标对象的样本面部图像帧序列。
其中,样本面部图像帧序列标注有表征目标对象的说话状态的样本标签。
计算机设备获取已标注样本标签的样本面部图像帧序列,样本面部图像帧序列包括样本图像帧,样本图像帧包含设定的目标对象的部分或全部的面部,且至少包括嘴部,样本标签能够描述目标对象在样本图像帧中的说话状态。
在一些实施例中,可以将所有样本图像帧中的目标对象均处于正在说话状态的样本面部图像帧序列标注为样本标签1,将所有样本图像帧中的目标对象均处于未在说话状态的样本面部图像帧序列标注为样本标签0。
在一些实施方式中,样本面部图像帧序列可以利用预先设置的窗口大小和滑动步长,以滑动窗口的方式从视频流中依次取出。
步骤S502,获取样本面部图像帧序列中各样本图像帧的嘴部关键点信息。
步骤S503,基于嘴部关键点信息,确定样本面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在样本面部图像帧序列中的多个样本图像帧之间的位置变化。
步骤S504,利用待训练的模型中的识别结果生成网络,根据位移特征确定目标对象的说话状态的识 别结果。
这里,待训练的模型可以是任意合适的深度学习模型,这里并不限定。在实施时,本领域技术人员可以根据实际情况采用合适的网络结构构建待训练的模型。
在一些实施方式中,待训练的模型还可以包括上述关键点特征提取网络和时序特征提取网络,则在步骤S503中,可以将位移特征输入至关键点特征提取网络,并利用时序特征提取网络进一步处理关键点特征提取网络的输出数据,之后利用识别结果生成网络处理时序特征提取网络输出的时空特征,得到说话状态的识别结果。
在一些实施方式中,待训练的模型采用端到端的方式训练分类得分,得到识别结果。端到端的优势在于,通过缩减人工预处理和后续处理,尽可能使模型从原始输入到最终输出,给模型更多可以根据数据自动调节的空间,增加模型的拟合程度。
这里,上述步骤S501至步骤S504分别对应于前述步骤S101至步骤S104,在实施时可以参照前述步骤S101至步骤S104的实施方式。
步骤S505,基于识别结果和样本标签,对模型的网络参数进行至少一次更新,得到训练后的模型。
这里,可以基于识别结果和样本标签,确定是否对模型的网络参数进行更新,在确定对模型的网络参数进行更新的情况下,采用合适的参数学习难度更新算法对模型的网络参数进行更新,并利用参数更新后的模型重新确定识别结果,以基于重新确定的识别结果和样本标签,确定是否对模型的网络参数进行继续更新。在确定不对模型的网络参数进行继续更新的情况下,将最终更新后的模型确定为训练后的模型。
在一些实施例中,可以基于识别结果和样本标签确定损失值,并在该损失值不满足预设条件的情况下,对模型的网络参数进行更新,在损失值满足预设条件或对模型的网络参数进行更新的次数达到设定阈值的情况下,停止对模型的网络参数进行更新,并将最终更新后的模型确定为训练后的模型。预设条件可以包括但不限于损失值小于设定的损失阈值、损失值的变化收敛等至少之一。在实施时,预设条件可以根据实际情况设定,本公开实施例对此并不限定。
对模型的网络参数进行更新的方式可以是根据实际情况确定的,可以包括但不限于梯度下降法、牛顿动量法等中的至少一种,这里并不限定。
下面说明本公开实施例提供的说话状态识别方法在实际场景中的应用,以一段包含人物说话的视频流的说话状态识别为例,对本公开实施例的说话状态识别方法进行说明。
本公开实施例提供一种说话状态识别方法,该方法可以由计算机设备的处理器执行。其中,计算机设备指的可以是车机等具备数据处理能力的设备。说话状态识别方法可以至少包括以下两个步骤:
步骤一,时序特征构造。
对输入的视频流进行处理,得到的每一帧图像。例如,输入的视频流可以表示为[N,720,1280,3]。其中,第一维的N为视频流的长度,第二维的720为每个图像帧的高度,第三维的1280为每个图像帧的宽度,第四维的3为图像通道数。
对每一帧图像进行人脸检测,得到每个人脸对应的检测框,利用检测框辅助关键点检测和定位。这里,以图3示出的106点脸部关键点示意图为例进行说明,其中的84至103号关键点是嘴部关键点,共20个。
考虑识别精度等因素,技术人员根据经验设置窗口大小为22个图像帧,对视频流的所有视频帧以滑动步长为1进行滑动,得到多个面部图像帧序列,这些面部图像帧序列可以表示为[N-21,22,106,2]。其中,N为视频流的帧数,第一维的N-21为面部图像帧序列的数量,第二维的22为每个面部图像帧序列的长度,第三维的106为关键点的数量,第四维的2为每个关键点的二维坐标。
针对每个面部图像帧序列,对其中的第2帧至第21帧的每一图像帧,计算20个嘴部关键点中的每个关键点在当前图像帧与前后图像帧之间的位移差量,每个关键点的位移差量可以表示为[xpre_diff,ypre_diff,xnext_diff,ynext_diff]。其中,第一维为当前图像帧与前一图像帧之间的横坐标的位移差量,第二维为当前图像帧与前一图像帧之间的纵坐标的位移差量,第三维为当前图像帧与后一图像帧之间的横坐标的位移差量,第四维为当前图像帧与后一图像帧之间的纵坐标的位移差量。
计算预设关键点对之间的高度差的绝对值,将计算结果作为这些点的上下嘴唇距离特征。例如,85至89分别对应于95至91,97至99分别对应于103至101。
针对每个面部图像帧序列,计算面部图像帧序列内所有图像帧中的眼部关键点的平均坐标与嘴部关键点的平均坐标之间的距离,将距离的最大值确定为归一化分母,对得到的上下嘴唇距离特征值进行归一化,得到每一面部图像帧序列的关键点位移特征,输出可以表示为[N-21,20,20,5]。其中,第一维的N-21为面部图像帧序列的数量,第二维的20为每个输入序列的长度,通过每个面部图像帧序列的长 度(22帧)确定,第三维的20为嘴部关键点个数,第四维的5为特征维数。
步骤二,特征提取模型处理。
利用本公开实施例提供说话状态识别模型,以步骤一的输出[N-21,20,20,5]为说话状态识别模型的输入,预测视频流中人物说话的开始和结束时间点。图8为本公开实施例提供的一种说话状态识别模型的组成结构示意图。如图8所示,该说话状态识别模型结构包括两个部分:关键点特征提取主干网络(backbone)81和时序特征提取分支82。两个部分为串联方式,即模型输入831为关键点特征提取主干网络81的输入,关键点特征提取主干网络81的主干网络输出832为时序特征提取分支82的输入,时序特征提取分支82的输出为模型输出说话得分833。
实际应用中,模型输入831可以为[N-21,20,20,5],同步骤一的输出;主干网络输出832可以为[N-21,64,20,1],其中,第一维的N-21为面部图像帧序列的数量,第二维的64为时空特征的维度,第三维的20为嘴部关键点的数量,第四维的1为帧内特征融合后的特征维数;模型输出说话得分833可以为[N-21,2],其中,第一维的N-21为面部图像帧序列的数量,第二维的2分别为表征正在说话的第一状态的第一置信度和表征未在说话的第二状态的第二置信度。
关键点特征提取主干网络81包括4个卷积模块,每个卷积模块包含卷积核(kernel)为(1,1)或(5,1)的卷积、批量归一化(Batch Normalization,BN)、线性整流函数(Linear rectification function,ReLU)和残差网络(Residual Networks,ResNets),用于学习面部图像帧序列中的每个图像帧内的嘴部20个关键点的共现特征(Co-occurrence Feature),共现特征包括但不限于嘴部形状、唇距。
时序特征提取分支82包括5个卷积模块、GAP、全连接层(Fully Connected layer,FC)、矩阵变换(Reshape)层、丢弃(dropout)层、窗口分类层(Cls)、softmax,每个卷积模块包含卷积核为(1,5)的卷积、BN、ReLU,整个时序特征提取分支用于学习图像帧间特征,和关键点在整个面部图像帧序列中的全局运动位移信息,从而最终输出该面部图像帧序列是否为说话的预测得分,也就是预测的模型输出说话得分833。将面部图像帧序列的模型输出说话得分833作为面部图像帧序列中的特定的图像帧的得分,利用模型输出说话得分833与预设阈值的比较结果,可以判断特定的图像帧是否处于说话状态。例如,将模型输出说话得分833大于或等于预设阈值的图像帧确定为正在说话的图像帧,将模型输出说话得分833小于预设阈值的图像帧确定为未在说话的图像帧。实际应用中,根据检测精度的要求,预设阈值可以设置为0.7。并且,多个面部图像帧序列是对视频流以滑动步长为1的滑动窗口方式得到,对应的多个特定的图像帧也是相邻的,在预测视频流中说话开始图像帧和说话结束图像帧时,还可以利用相邻的图像帧的得分变化趋势。
实际应用中,在时序特征提取分支82中,可以利用(1,5)的卷积核在面部图像帧序列的长度维度上卷积,将面部图像帧序列中的每一图像帧的空间特征和前后各两个图像帧的空间特征融合,并重复5次上述卷积以提升感受野,完成帧间特征融合,得到每一图像帧的时空特征。这样,使得帧间的信息得到交流,加强相邻帧间关联。由于该步骤将占用一定的计算资源,为提高性能可以将卷积核尺寸增大,并将重复次数增多,相应地影响效率。综合考虑准确度和硬件运算效率,实际应用中可以将提取次数设置为5次,卷积核尺寸设置为5。
对图8示出的说话状态识别模型的训练,可以采用以下方式实现:
获取一段标注说话开始图像帧和说话结束图像帧的第一样本图像帧序列,第一样本图像帧序列是连续的视频帧,以步长为S、窗口大小为L的滑动窗口得到样本面部图像帧序列。若每个样本面部图像帧序列中全部帧均处于说话状态,确定样本面部图像帧序列的标签为1;若每个样本面部图像帧序列中全部帧均未处于说话状态全部帧,确定样本面部图像帧序列的标签为0。这里,包含部分说话帧的样本暂不加入训练。整个模型采用端到端的方式训练分类得分,损失函数为裕量Softmax损失函数(Margin Softmax Loss)。
这里,利用标注说话开始图像帧和说话结束图像帧的标签,可以将连续的第一样本图像帧序列划分为说话区间和不说话区间,分别从两个区间选取样本面部图像帧序列。
本公开实施例中,首先通过人脸检测、关键点定位的方式,得到每个人脸图像对应的检测框和关键点,然后以滑动窗口的形式逐帧处理得到长度为L的面部图像帧序列。根据嘴部关键点构造面部图像帧序列的运动特征,将特征输入模型后,得到用于预测该面部图像帧序列是否为说话的得分,以面部图像帧序列的得分作为特定的一个图像帧(通常为第21帧)的得分,若该帧得分高于预设阈值则判断为正在说话,从而确定视频流的开始说话和结束说话的时间点结果。
这样,仅利用视频流中的嘴部关键点作为输入,对视频流进行滑动窗口处理,构造关键点运动特征进行模型预测,可以使用较小的模型计算量和资源占用实现对视频流中人物说话的起始帧和结束帧的实时预测,并对各类复杂不说话嘴部动作有较好的识别精度。尤其,对于用户在智能座舱内使用语音交互 时,在车窗外风声、车内闲聊声或音乐外放声过大的情况下,语音识别准确度不高,采用本公开实施例提供的说话状态识别方法,结合语音进行多模态识别,利用视觉特征可有效减少声音干扰,提供更准确的说话区间,提升语音识别精度,减少漏报误报。
需要说明的是,在实施时,上述模型输出说话得分833可以对应于前述实施例中的识别结果,运动特征可以对应于前述实施例中的位移特征,位移差量可以对应于前述实施例中的帧间位移信息,上下嘴唇距离特征可以对应于前述实施例中的帧内差异信息,样本视频帧序列可以对应于前述实施例中的样本面部图像帧序列。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于前述的实施例,本公开实施例提供一种说话状态识别装置,该装置包括所包括的各单元、以及各单元所包括的各部分,可以通过计算机设备中的处理器来实现;当然也可通过一些实施例中的逻辑电路实现;在实施的过程中,处理器可以为中央处理器(Central Processing Unit,CPU)、微处理器(Microprocessor Unit,MPU)、数字信号处理器(Digital Signal Processor,DSP)或现场可编程门阵列(Field Programmable Gate Array,FPGA)等。
图9为本公开实施例提供的一种说话状态识别装置的组成结构示意图,如图9所示,说话状态识别装置900包括:第一获取部分910、第二获取部分920、第一确定部分930和第二确定部分940,其中:
第一获取部分910,被配置为获取目标对象的面部图像帧序列;
第二获取部分920,被配置为获取所述面部图像帧序列中各图像帧的嘴部关键点信息;
第一确定部分930,被配置为基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;
第二确定部分940,被配置为根据所述位移特征确定所述目标对象的说话状态的识别结果。
在一些实施例中,所述第二获取部分920,包括:第一检测子部分,被配置为针对所述面部图像帧序列中的每一面部图像帧进行人脸关键点检测,得到所述每一面部图像帧中的嘴部关键点信息。
在一些实施例中,所述第一获取部分910,包括:第一获取子部分,被配置为以滑动窗口的方式从包含所述目标对象的面部信息的视频流中,依次取出预设长度的图像帧序列,作为所述目标对象的面部图像帧序列,其中,所述滑动窗口的滑动步长不小于1,且所述滑动窗口的滑动步长不大于所述预设长度。
在一些实施例中,所述面部图像帧序列包括多个所述面部图像帧;所述第一确定部分930,包括:第一执行子部分,被配置为针对每一面部图像帧,执行以下步骤:根据每一嘴部关键点在所述面部图像帧和所述面部图像帧的相邻帧中的嘴部关键点信息,确定每一嘴部关键点的帧间位移信息;根据所述面部图像帧中的多个所述嘴部关键点对应的嘴部关键点信息,确定所述面部图像帧中的多个嘴部关键点的帧内差异信息;基于所述多个嘴部关键点各自的帧间位移信息以及所述帧内差异信息,确定所述面部图像帧对应的嘴部关键点的位移特征;第一确定子部分,被配置为根据所述面部图像帧序列中的多个所述面部图像帧分别对应的嘴部关键点的位移特征,确定所述面部图像帧序列对应的嘴部关键点的位移特征。
在一些实施例中,所述第一确定子部分,包括:第一确定单元,被配置为确定所述面部图像帧序列中各图像帧中目标对象的眼嘴距离;第二确定单元,被配置为根据所述面部图像帧序列中各图像帧中目标对象的眼嘴距离,确定参考距离;第一处理单元,被配置为将所述参考距离作为归一化分母,分别对所述多个嘴部关键点各自的所述帧间位移信息和所述帧内差异信息进行归一化处理,得到处理后的帧间位移信息和处理后的帧内差异信息;第三确定单元,被配置为基于所述多个嘴部关键点各自的处理后的帧间位移信息以及处理后的帧内差异信息,确定所述面部图像帧对应的嘴部关键点的位移特征。
在一些实施例中,所述第二确定部分940,包括:第一处理子部分,被配置为采用经过训练的关键点特征提取网络对所述位移特征进行处理,得到所述面部图像帧序列的空间特征;第二处理子部分,被配置为采用经过训练的时序特征提取网络对所述空间特征进行处理,得到所述面部图像帧序列的时空特征;第一识别子部分,被配置为基于所述时空特征确定所述目标对象的说话状态的识别结果。
在一些实施例中,所述第一识别子部分,包括:第一识别单元,被配置为根据所述时空特征确定所述目标对象与所述面部图像帧序列对应的说话状态的识别结果,作为所述目标对象在所述面部图像帧序列中的最后一个图像帧中的说话状态的识别结果;所述装置还包括:第五确定部分,被配置为根据所述目标对象在多个所述滑动窗口中分别取出的面部图像帧序列中的最后一个图像帧中的说话状态的识别结果,确定所述目标对象说话的起始帧和结束帧。
在一些实施例中,所述说话状态的识别结果包括所述目标对象处于表征正在说话的第一状态的第一置信度、或者所述目标对象处于表征未在说话的第二状态的第二置信度;所述第五确定部分,包括:第二执行子部分,被配置为将所述面部图像帧序列中的每一所述图像作为待判断图像帧,针对待判断图像帧执行以下步骤之一:在所述待判断图像帧对应的所述第一置信度大于或等于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的前一图像帧对应的所述第一置信度小于第一预设阈值的情况下, 将所述待判断图像帧作为所述目标对象说话的起始帧;在所述待判断图像帧对应的所述第一置信度大于或等于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的后一图像帧对应的所述第一置信度小于第一预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的结束帧;在所述待判断图像帧对应的所述第二置信度小于第二预设阈值,且所述待判断图像帧在所述面部图像帧序列中的前一图像帧对应的所述第二置信度大于或等于第二预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的起始帧;在所述待判断图像帧对应的所述第二置信度小于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的后一图像帧对应的所述第二置信度大于或等于第二预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的结束帧。
在一些实施例中,所述装置还包括:第一训练部分,被配置为基于训练样本集对所述关键点特征提取网络和所述时序特征提取网络进行训练,其中,所述训练样本集包括已标注所包含的各视频帧中的对象说话状态的连续视频帧序列。
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。在一些实施例中,本公开实施例提供的装置具有的功能或包含的部分可以被配置为执行上述方法实施例描述的方法,对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
基于前述的实施例,本公开实施例提供一种模型训练装置,该装置包括所包括的各单元、以及各单元所包括的各部分,可以通过计算机设备中的处理器来实现;当然也可通过一些实施例中的逻辑电路实现;在实施的过程中,处理器可以为CPU、MPU、DSP或FPGA等。
图10为本公开实施例提供的模型训练装置的组成结构示意图,如图10所示,模型训练装置1000包括:第三获取部分1010、第四获取部分1020、第三确定部分1030、第四确定部分1040和更新部分1050,其中:
第三获取部分1010,被配置为获取目标对象的样本面部图像帧序列,其中,所述样本面部图像帧序列标注有表征所述目标对象的说话状态的样本标签;
第四获取部分1020,被配置为获取所述样本面部图像帧序列中各样本图像帧的嘴部关键点信息;
第三确定部分1030,被配置为基于所述嘴部关键点信息,确定所述样本面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述样本面部图像帧序列中的多个样本图像帧之间的位置变化;
第四确定部分1040,被配置为利用待训练的模型中的识别结果生成网络,根据所述位移特征确定所述目标对象的说话状态的识别结果;
更新部分1050,被配置为基于所述识别结果和所述样本标签,对所述模型的网络参数进行至少一次更新,得到训练后的所述模型。
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。在一些实施例中,本公开实施例提供的装置具有的功能或包含的部分可以被配置为执行上述方法实施例描述的方法,对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
本公开实施例提供一种车辆,包括:
车载相机,用于拍摄包含目标对象的面部图像帧序列;
车机,与所述车载相机连接,用于从所述车载相机获取所述目标对象的面部图像帧序列;获取所述面部图像帧序列中各图像帧的嘴部关键点信息;基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;根据所述位移特征确定所述目标对象的说话状态的识别结果。
以上车辆实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开车辆实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
在本公开实施例以及其他的实施例中,“部分”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是单元,还可以是模块也可以是非模块化的。
需要说明的是,本公开实施例中,如果以软件功能模块的形式实现上述方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何特定的硬件、软件或固件,或者硬件、软件、固件三者之间的任意结合。
本公开实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的部分或全部步骤。
本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行 时实现上述方法中的部分或全部步骤。所述计算机可读存储介质可以是瞬时性的,也可以是非瞬时性的。
本公开实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下,所述计算机设备中的处理器执行用于实现上述方法中的部分或全部步骤。
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。该计算机程序产品在一些实施例中可以通过硬件、软件或其结合的方式实现。在一些实施例中,所述计算机程序产品体现为例如计算机存储介质,在另一些实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
这里需要指出的是:上文对各个实施例的描述倾向于强调各个实施例之间的不同之处,其相同或相似之处可以互相参考。以上设备、存储介质、计算机程序及计算机程序产品实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开设备、存储介质、计算机程序及计算机程序产品实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
需要说明的是,图11为本公开实施例中计算机设备的一种硬件实体示意图,如图11所示,该计算机设备1100的硬件实体包括:处理器1101、通信接口1102和存储器1103,其中:
处理器1101通常控制计算机设备1100的总体操作。
通信接口1102可以使计算机设备通过网络与其他终端或服务器通信。
存储器1103配置为存储由处理器1101可执行的指令和应用,还可以缓存待处理器1101以及计算机设备1100中各部分待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。处理器1101、通信接口1102和存储器1103之间可以通过总线1104进行数据传输。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各步骤/过程的序号的大小并不意味着执行顺序的先后,各步骤/过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
若本申请技术方案涉及个人信息,应用本申请技术方案的产品在处理个人信息前,已明确告知个人信息处理规则,并取得个人自主同意。若本申请技术方案涉及敏感个人信息,应用本申请技术方案的产品在处理敏感个人信息前,已取得个人单独同意,并且同时满足“明示同意”的要求。例如,在摄像头等个人信息采集装置处,设置明确显著的标识告知已进入个人信息采集范围,将会对个人信息进行采集,若个人自愿进入采集范围即视为同意对其个人信息进行采集;或者在个人信息处理的装置上,利用明显的标识/信息告知个人信息处理规则的情况下,通过弹窗信息或请个人自行上传其个人信息等方式获得个人授权;其中,个人信息处理规则可包括个人信息处理者、个人信息处理目的、处理方式、处理的个人信息种类等信息。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者 光盘等各种可以存储程序代码的介质。
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。
工业实用性
本公开实施例公开了一种说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品,其中,说话状态识别方法包括:获取目标对象的面部图像帧序列;获取面部图像帧序列中各图像帧的嘴部关键点信息;基于嘴部关键点信息,确定面部图像帧序列对应的嘴部关键点的位移特征,位移特征表征嘴部关键点在面部图像帧序列中的多个图像帧之间的位置变化;根据位移特征确定目标对象的说话状态的识别结果。本公开实施例中,能够表示目标对象在面部图像帧序列中嘴部关键点的位置变化过程,根据位移特征确定目标对象的说话状态的识别结果,能够精确识别目标对象的说话状态,从而能够提升说话状态的识别的精确度。并且,相较于利用面部图像帧裁剪得到的嘴部区域图像序列进行说话状态识别,上述方案利用嘴部关键点的位移特征,能够降低说话状态识别所需的计算量,从而降低执行说话状态识别方法的计算机设备的硬件要求。此外,利用嘴部关键点的位移特征,对不同脸型、纹理等外观信息的面部图像帧都能取得良好的识别效果,从而提高了说话状态识别的泛化能力。

Claims (17)

  1. 一种说话状态识别方法,包括:
    获取目标对象的面部图像帧序列;
    获取所述面部图像帧序列中各图像帧的嘴部关键点信息;
    基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;
    根据所述位移特征确定所述目标对象的说话状态的识别结果。
  2. 根据权利要求1所述的方法,其中,所述获取所述面部图像帧序列中各图像帧的嘴部关键点信息,包括:
    针对所述面部图像帧序列中的每一面部图像帧进行人脸关键点检测,得到所述每一面部图像帧中的嘴部关键点信息。
  3. 根据权利要求1或2所述的方法,其中,所述获取目标对象的面部图像帧序列,包括:
    以滑动窗口的方式从包含所述目标对象的面部信息的视频流中,依次取出预设长度的图像帧序列,作为所述目标对象的面部图像帧序列,其中,所述滑动窗口的滑动步长不小于1,且所述滑动窗口的滑动步长不大于所述预设长度。
  4. 根据权利要求3所述的方法,其中,所述面部图像帧序列包括多个所述面部图像帧;
    所述基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,包括:
    针对每一面部图像帧,执行以下步骤:根据每一嘴部关键点在所述面部图像帧和所述面部图像帧的相邻帧中的嘴部关键点信息,确定每一嘴部关键点的帧间位移信息;根据所述面部图像帧中的多个所述嘴部关键点对应的嘴部关键点信息,确定所述面部图像帧中的多个嘴部关键点的帧内差异信息;基于所述多个嘴部关键点各自的帧间位移信息以及所述帧内差异信息,确定所述面部图像帧对应的嘴部关键点的位移特征;
    根据所述面部图像帧序列中的多个所述面部图像帧分别对应的嘴部关键点的位移特征,确定所述面部图像帧序列对应的嘴部关键点的位移特征。
  5. 根据权利要求4所述的方法,其中,所述基于所述多个嘴部关键点各自的帧间位移信息以及所述帧内差异信息,确定所述面部图像帧对应的嘴部关键点的位移特征,包括:
    确定所述面部图像帧序列中各图像帧中目标对象的眼嘴距离;
    根据所述面部图像帧序列中各图像帧中目标对象的眼嘴距离,确定参考距离;
    将所述参考距离作为归一化分母,分别对所述多个嘴部关键点各自的所述帧间位移信息和所述帧内差异信息进行归一化处理,得到处理后的帧间位移信息和处理后的帧内差异信息;
    基于所述多个嘴部关键点各自的处理后的帧间位移信息以及处理后的帧内差异信息,确定所述面部图像帧对应的嘴部关键点的位移特征。
  6. 根据权利要求4或5所述的方法,其中,所述根据所述位移特征确定所述目标对象的说话状态的识别结果,包括:
    采用经过训练的关键点特征提取网络对所述位移特征进行处理,得到所述面部图像帧序列的空间特征;
    采用经过训练的时序特征提取网络对所述空间特征进行处理,得到所述面部图像帧序列的时空特征;
    基于所述时空特征确定所述目标对象的说话状态的识别结果。
  7. 根据权利要求6所述的方法,其中,所述基于所述时空特征确定所述目标对象的说话状态的识别结果,包括:
    根据所述时空特征确定所述目标对象与所述面部图像帧序列对应的说话状态的识别结果,作为所述目标对象在所述面部图像帧序列中的最后一个图像帧中的说话状态的识别结果;
    所述方法还包括:
    根据所述目标对象在多个所述滑动窗口中分别取出的面部图像帧序列中的最后一个图像帧中的说话状态的识别结果,确定所述目标对象说话的起始帧和结束帧。
  8. 根据权利要求7所述的方法,其中,所述说话状态的识别结果包括所述目标对象处于表征正在说话的第一状态的第一置信度、或者所述目标对象处于表征未在说话的第二状态的第二置信度;所述根据所述目标对象在多个所述滑动窗口中分别取出的面部图像帧序列中的最后一个图像帧中的说话状态的识别结果,确定所述目标对象说话的起始帧和结束帧,包括:
    将所述面部图像帧序列中的每一所述图像帧作为待判断图像帧,针对所述待判断图像帧执行以下步骤之一:
    在所述待判断图像帧对应的所述第一置信度大于或等于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的前一图像帧对应的所述第一置信度小于第一预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的起始帧;
    在所述待判断图像帧对应的所述第一置信度大于或等于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的后一图像帧对应的所述第一置信度小于第一预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的结束帧;
    在所述待判断图像帧对应的所述第二置信度小于第二预设阈值,且所述待判断图像帧在所述面部图像帧序列中的前一图像帧对应的所述第二置信度大于或等于第二预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的起始帧;
    在所述待判断图像帧对应的所述第二置信度小于第一预设阈值,且所述待判断图像帧在所述面部图像帧序列中的后一图像帧对应的所述第二置信度大于或等于第二预设阈值的情况下,将所述待判断图像帧作为所述目标对象说话的结束帧。
  9. 根据权利要求6至8中任一项所述的方法,其中,所述方法还包括:
    基于训练样本集对所述关键点特征提取网络和所述时序特征提取网络进行训练,其中,所述训练样本集包括已标注所包含的各视频帧中的对象说话状态的连续视频帧序列。
  10. 一种模型训练方法,所述方法包括:
    获取目标对象的样本面部图像帧序列,其中,所述样本面部图像帧序列标注有表征所述目标对象的说话状态的样本标签;
    获取所述样本面部图像帧序列中各样本图像帧的嘴部关键点信息;
    基于所述嘴部关键点信息,确定所述样本面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述样本面部图像帧序列中的多个样本图像帧之间的位置变化;
    利用待训练的模型中的识别结果生成网络,根据所述位移特征确定所述目标对象的说话状态的识别结果;
    基于所述识别结果和所述样本标签,对所述模型的网络参数进行至少一次更新,得到训练后的所述模型。
  11. 一种说话状态识别装置,包括:
    第一获取部分,被配置为获取目标对象的面部图像帧序列;
    第二获取部分,被配置为获取所述面部图像帧序列中各图像帧的嘴部关键点信息;
    第一确定部分,被配置为基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;
    第二确定部分,被配置为根据所述位移特征确定所述目标对象的说话状态的识别结果。
  12. 一种模型训练装置,包括:
    第三获取部分,被配置为获取目标对象的样本面部图像帧序列,其中,所述样本面部图像帧序列标注有表征所述目标对象的说话状态的样本标签;
    第四获取部分,被配置为获取所述样本面部图像帧序列中各样本图像帧的嘴部关键点信息;
    第三确定部分,被配置为基于所述嘴部关键点信息,确定所述样本面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述样本面部图像帧序列中的多个样本图像帧之间的位置变化;
    第四确定部分,被配置为利用待训练的模型中的识别结果生成网络,根据所述位移特征确定所述目标对象的说话状态的识别结果;
    更新部分,被配置为基于所述识别结果和所述样本标签,对所述模型的网络参数进行至少一次更新,得到训练后的所述模型。
  13. 一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至10任一项所述方法中的步骤。
  14. 一种车辆,包括:
    车载相机,用于拍摄包含目标对象的面部图像帧序列;
    车机,与所述车载相机连接,用于从所述车载相机获取所述目标对象的面部图像帧序列;获取所述面部图像帧序列中各图像帧的嘴部关键点信息;基于所述嘴部关键点信息,确定所述面部图像帧序列对应的嘴部关键点的位移特征,所述位移特征表征所述嘴部关键点在所述面部图像帧序列中的多个图像帧之间的位置变化;根据所述位移特征确定所述目标对象的说话状态的识别结果。
  15. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至10任一项所述方法中的步骤。
  16. 一种计算机程序,包括计算机可读代码,在所述计算机可读代码在计算机设备中运行的情况下, 所述计算机设备中的处理器执行用于实现权利要求1至10中任一所述的方法中的步骤。
  17. 一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现权利要求1至10任一项所述方法中的步骤。
PCT/CN2023/093495 2022-06-30 2023-05-11 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品 WO2024001539A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210772934.1A CN115063867A (zh) 2022-06-30 2022-06-30 说话状态识别方法及模型训练方法、装置、车辆、介质
CN202210772934.1 2022-06-30

Publications (1)

Publication Number Publication Date
WO2024001539A1 true WO2024001539A1 (zh) 2024-01-04

Family

ID=83203985

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093495 WO2024001539A1 (zh) 2022-06-30 2023-05-11 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品

Country Status (2)

Country Link
CN (1) CN115063867A (zh)
WO (1) WO2024001539A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015386A (zh) * 2024-04-08 2024-05-10 腾讯科技(深圳)有限公司 图像识别方法和装置、存储介质及电子设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115063867A (zh) * 2022-06-30 2022-09-16 上海商汤临港智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140723A1 (zh) * 2018-12-30 2020-07-09 广州市百果园信息技术有限公司 人脸动态表情的检测方法、装置、设备及存储介质
CN111428672A (zh) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111666820A (zh) * 2020-05-11 2020-09-15 北京中广上洋科技股份有限公司 一种讲话状态识别方法、装置、存储介质及终端
WO2020253051A1 (zh) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 唇语的识别方法及其装置
CN112633208A (zh) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 一种唇语识别方法、服务设备及存储介质
CN113486760A (zh) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 对象说话检测方法及装置、电子设备和存储介质
CN113873195A (zh) * 2021-08-18 2021-12-31 荣耀终端有限公司 视频会议控制方法、装置和存储介质
CN115063867A (zh) * 2022-06-30 2022-09-16 上海商汤临港智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140723A1 (zh) * 2018-12-30 2020-07-09 广州市百果园信息技术有限公司 人脸动态表情的检测方法、装置、设备及存储介质
WO2020253051A1 (zh) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 唇语的识别方法及其装置
CN111428672A (zh) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111666820A (zh) * 2020-05-11 2020-09-15 北京中广上洋科技股份有限公司 一种讲话状态识别方法、装置、存储介质及终端
CN112633208A (zh) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 一种唇语识别方法、服务设备及存储介质
CN113486760A (zh) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 对象说话检测方法及装置、电子设备和存储介质
CN113873195A (zh) * 2021-08-18 2021-12-31 荣耀终端有限公司 视频会议控制方法、装置和存储介质
CN115063867A (zh) * 2022-06-30 2022-09-16 上海商汤临港智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015386A (zh) * 2024-04-08 2024-05-10 腾讯科技(深圳)有限公司 图像识别方法和装置、存储介质及电子设备
CN118015386B (zh) * 2024-04-08 2024-06-11 腾讯科技(深圳)有限公司 图像识别方法和装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN115063867A (zh) 2022-09-16

Similar Documents

Publication Publication Date Title
WO2024001539A1 (zh) 说话状态识别方法及模型训练方法、装置、车辆、介质、计算机程序及计算机程序产品
WO2021017606A1 (zh) 视频处理方法、装置、电子设备及存储介质
CN108197589B (zh) 动态人体姿态的语义理解方法、装置、设备和存储介质
CN109145766B (zh) 模型训练方法、装置、识别方法、电子设备及存储介质
WO2020182121A1 (zh) 表情识别方法及相关装置
WO2019033573A1 (zh) 面部情绪识别方法、装置及存储介质
WO2020253051A1 (zh) 唇语的识别方法及其装置
CN110956060A (zh) 动作识别、驾驶动作分析方法和装置及电子设备
CN111696176B (zh) 图像处理方法、装置、电子设备及计算机可读介质
WO2023098128A1 (zh) 活体检测方法及装置、活体检测系统的训练方法及装置
CN110765294B (zh) 图像搜索方法、装置、终端设备及存储介质
CN111428666A (zh) 基于快速人脸检测的智能家庭陪伴机器人系统及方法
CN113298018A (zh) 基于光流场和脸部肌肉运动的假脸视频检测方法及装置
CN114639150A (zh) 情绪识别方法、装置、计算机设备和存储介质
CN112200110A (zh) 一种基于深度干扰分离学习的人脸表情识别方法
WO2023208134A1 (zh) 图像处理方法及模型生成方法、装置、车辆、存储介质及计算机程序产品
CN111506183A (zh) 一种智能终端及用户交互方法
CN107103269A (zh) 一种表情反馈方法及智能机器人
CN116721449A (zh) 视频识别模型的训练方法、视频识别方法、装置以及设备
CN113269068B (zh) 一种基于多模态特征调节与嵌入表示增强的手势识别方法
WO2022110059A1 (zh) 视频处理、景别识别方法、终端设备和拍摄系统
CN114140718A (zh) 一种目标跟踪方法、装置、设备及存储介质
CN111796663B (zh) 场景识别模型更新方法、装置、存储介质及电子设备
CN109325521B (zh) 用于虚拟人物的检测方法及装置
WO2020124390A1 (zh) 一种面部属性的识别方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829725

Country of ref document: EP

Kind code of ref document: A1