WO2024000867A1 - 情绪识别方法、装置、设备及存储介质 - Google Patents

情绪识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024000867A1
WO2024000867A1 PCT/CN2022/121852 CN2022121852W WO2024000867A1 WO 2024000867 A1 WO2024000867 A1 WO 2024000867A1 CN 2022121852 W CN2022121852 W CN 2022121852W WO 2024000867 A1 WO2024000867 A1 WO 2024000867A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
text
data
tested
emotion
Prior art date
Application number
PCT/CN2022/121852
Other languages
English (en)
French (fr)
Inventor
张润泽
李仁刚
赵雅倩
郭振华
范宝余
李晓川
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2024000867A1 publication Critical patent/WO2024000867A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • This application relates to the field of neural network technology, and in particular to emotion recognition methods, devices, electronic equipment and non-volatile readable storage media.
  • facial emotion recognition With the maturity of current face recognition technology, it is a relatively mature technology to find the faces of focused people from pictures or videos. Therefore, the current research on emotion recognition focuses on the research on facial emotion recognition.
  • the purpose of this application is to provide an emotion recognition method, device, electronic device and non-volatile readable storage medium to improve the emotion recognition accuracy and model versatility.
  • an emotion recognition model training method including:
  • the emotion label corresponding to the maximum similarity data to be tested is determined as the emotion recognition result corresponding to the video to be tested.
  • each emotion label in the label set to splice with the text template to be tested to generate text data to be tested corresponding to each emotion label including:
  • the template vector is spliced with each label vector to obtain the text data to be tested.
  • the initial model includes a text encoder, an image encoder and an audio encoder, and also includes a pooling network module and a temporal recursive network module.
  • the output of the text encoder is the input of the pooling network module, and the output of the image encoder is Input to the temporal recursive network module.
  • training video frames, training text data and training audio into the initial model to obtain training text encoding data and training non-text encoding data, including:
  • the intermediate image encoding and the initial audio encoding are spliced to obtain training non-text encoding data.
  • the text encoder and image encoder belong to the language-image comparison learning pre-training model, and the audio encoder has been pre-trained.
  • parameter adjustments are made to the initial model based on the loss value, including:
  • the parameters of the pooling network module and the time recursive network module in the initial model are adjusted based on the loss value.
  • use emotion labels to generate training text data including:
  • the template vector and label vector are spliced to obtain training text data.
  • training completion conditions including:
  • test data to test the accuracy of the initial model after parameter adjustment and obtain the test results
  • test result is greater than the preset threshold, it is determined that the training completion conditions are met.
  • test data includes multiple sets of test sub-data, including target test sub-data, and the target test sub-data includes target test video, target test audio, and target test label.
  • test data to test the accuracy of the initial model after parameter adjustment, and obtain test results, including:
  • the initial model after adjusting the target test video frame, target test text data and target test audio input parameters, obtains target non-text encoded data and multiple target text encoded data;
  • test sub-results corresponding to the test data are counted to obtain the test results.
  • training completion conditions including:
  • This application also provides an emotion recognition device, including:
  • the test acquisition module is used to obtain the video and audio to be tested
  • the data processing module to be tested is used to determine multiple video frames to be tested in the video to be tested, and use each emotion label in the label set to splice with the text template to be tested to generate text data to be tested corresponding to each emotion label;
  • the input module to be tested is used to input the video frame to be tested, the text data to be tested, and the audio to be tested into the emotion recognition model to obtain the non-text encoding data to be tested and each text encoding data to be tested corresponding to each text data to be tested;
  • the similarity to be tested generation module is used to generate the similarity data to be tested using the non-text encoding data to be tested and the text encoding data to be tested respectively;
  • the recognition result determination module is used to determine the emotion label corresponding to the maximum similarity data to be tested as the emotion recognition result corresponding to the video to be tested.
  • the application also provides an electronic device, including a memory and a processor, wherein:
  • Memory used to hold computer programs
  • a processor is used to execute a computer program to implement the above emotion recognition method.
  • This application also provides a non-volatile readable storage medium for storing a computer program, wherein the computer program implements the above emotion recognition method when executed by the processor.
  • the emotion recognition model training method obtains the video to be tested and the audio to be tested; determines multiple video frames to be tested in the video to be tested, and uses each emotion label in the label set to splice with the text template to be tested to generate each The text data to be tested corresponding to the emotion tags respectively; input the video frame to be tested, the text data to be tested, and the audio to be tested into the emotion recognition model to obtain the non-text encoding data to be tested and each text encoding to be tested corresponding to each text data to be tested. data; use the non-text encoding data to be tested and each text encoding data to be tested to generate the similarity data to be tested; determine the emotion label corresponding to the maximum similarity data to be tested as the emotion recognition result corresponding to the video to be tested.
  • this method converts the emotion recognition process from the original probability prediction problem to a similarity matching problem. At the same time, it introduces the semantic information contained in the label itself, which not only improves the accuracy, but also enables the model to have a certain zero-shot learning ( transfer ability of zero-shot learning). Specifically, when identifying emotions, this application uses various emotion labels and the same text template to generate multiple text data to be tested. The emotion recognition model has been trained and can learn the semantic information carried by the emotion labels.
  • the similarity between the non-text encoding data to be tested in the test video and the text encoding data to be tested corresponding to each emotion label is used to select the maximum similarity data to be tested and determine the most similar emotion label to improve the accuracy of emotion recognition.
  • the emotion recognition model can distinguish it from other emotion labels based on the semantic information of the emotion label, and has certain zero-sample learning capabilities. , improving model versatility.
  • this application also provides devices, electronic equipment and non-volatile readable storage media, which also have the above beneficial effects.
  • Figure 1 is a flow chart of an emotion recognition model training method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of an emotion recognition method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an identification terminal provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of an emotion recognition device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the main solution for dynamic facial emotion recognition is to use multi-modal fusion information of vision and sound to achieve emotion recognition. That is, the visual images and sound audio in the video are extracted using feature extractors respectively, and then fused using a feature fusion network to finally predict a fixed set of predefined emotion categories.
  • this scheme completely ignores the semantic information contained in the emotion label itself, but directly maps the emotion label to a fixed number of category indexes (numbers).
  • This solution not only limits the versatility of the model, does not have the migration/prediction capabilities of zero-shot learning, and requires additional training data to migrate the model application to new scenarios, it will also lead to low accuracy in emotion recognition.
  • this application we draw lessons from human emotion recognition methods.
  • people can associate the characteristics of the images in the video (whether they have seen it or not) with the characteristics of the natural language in their minds. and corresponding, rather than corresponding to numbers/indexes. Therefore, this application uses an unconventional training method to mine the semantic information of the label text during training and associate it with the corresponding video features. This not only enhances the semantics of the video representation and improves the recognition accuracy, but also enables the model to have certain The transferability of zero-shot learning.
  • Figure 1 is a flow chart of an emotion recognition model training method provided by an embodiment of the present application.
  • the method includes:
  • S101 Obtain training videos, training audios and emotion labels.
  • each step in this application can be completed by a designated electronic device.
  • the electronic device for execution can be in any form such as a server or a computer.
  • the number of electronic devices can be one or more, that is, it can be performed by any electronic device. All steps are executed, or multiple electronic devices execute part of the steps separately to cooperate together to complete the process of model training and/or emotion recognition.
  • Training videos, training audios, and emotion labels correspond to each other.
  • Training videos refer to videos that record changes in facial expressions.
  • Training audios refer to audios corresponding to training videos, which usually record changes in facial expressions similar to those recorded in training videos.
  • Corresponding sounds such as crying, laughter, etc.
  • Emotion tags refer to text names corresponding to the emotions expressed in training videos and training audios, such as happy, angry, sad, fear and other texts.
  • S102 Determine multiple training video frames in the training video, and use emotion labels to generate training text data.
  • the training video frame can be any video frame in the training video, and the number of training video frames can be multiple, for example, M, where M is a fixed positive number. Using multiple training video frames, the emotional changes of faces in the training videos can be characterized in the temporal direction.
  • the method of determining the training video frames is not limited.
  • the training video frames can be extracted from the first frame of the training video according to a preset time interval.
  • the training video frames can be determined. The number, and based on the number, the training video is extracted at an average interval to obtain the training video frame.
  • Training text data refers to data used to represent the semantic information of emotion labels. Its specific form is not limited. For example, it can be in text form or in vector form.
  • the emotion labels can be directly used as training text data, or the emotion labels can be mapped from text to vector to obtain the corresponding label vector, and the label vector can be determined as the training text data.
  • a preset text template can be obtained, and the text template and emotion labels are used to jointly generate training text data to further provide more semantic information.
  • the specific content of the text template is not limited, for example It can be "The person seems to express the feeling of the [CLASS]", "From this video, we can see that the person is [CLASS]", where the [CLASS] position is used to insert the emotion tag.
  • multiple text templates can be preset to form a preset template library.
  • a target text template can be selected from the preset template library, which can be selected randomly or in sequence.
  • the specific vector mapping method is not limited. After the mapping is completed, the template vector and label vector are spliced to obtain the training text data. This method enables the model to adapt to various prompt sentence patterns.
  • S103 Input the training video frames, training text data and training audio into the initial model to obtain training text encoding data and training non-text encoding data.
  • training video frames and training text data After obtaining the training video frames and training text data, they are input into the initial model together with the training audio, and the initial model encodes them to obtain training text encoding data that represents text features and training non-text encoding data that represents non-text features.
  • the training text encoding data is obtained based on the training text data, which can represent the emotional semantic characteristics of the emotional label.
  • Non-text features are obtained based on training video frames and training audio, which can characterize the emotional characteristics of images and sounds.
  • the initial model refers to the emotion recognition model that has not yet been trained. After iterative training and parameter adjustment, it improves its ability to extract features and then transforms into an emotion recognition model.
  • the specific type of the initial model is not limited, and any feasible neural network architecture can be used.
  • the initial model includes a text encoder, an image encoder and an audio encoder.
  • the text encoder is used to process training text data to obtain training text encoding data.
  • the image encoder and audio encoder are used to process respectively. Training video frames and training audio are combined to obtain training non-text encoded data.
  • a pooling network module and a time recursive network module can also be pooled in the initial model.
  • the output of the text encoder is the input of the pooling network module
  • the output of the image encoder is the input of the temporal recursive network module.
  • the time recursive network module can specifically be an LSTM (Long Short-Term Memory, long short-term memory network) network
  • the pooling network module is specifically used to perform temporal pooling operations on the output of the text encoder.
  • This embodiment does not limit the way in which the initial model obtains training text-encoded data and training non-text-encoded data.
  • the specific generation method is related to the model structure of the initial model.
  • the initial model is the above-mentioned structure including a text encoder, an image encoder, an audio encoder, a pooling network module and a temporal recursive network module
  • the training text can be input into the text encoder to obtain multiple initial text encodings, the number of initial text encodings is the same as the number of training video frames. Then multiple initial text encodings are input into the pooling network module to obtain training text encoding data.
  • the training video frame can be input into the image encoder to obtain multiple initial image encodings
  • the training audio can be input into the audio encoder to obtain the initial audio encoding
  • the multiple initial image encodings can be input into the temporal recursive network module to obtain the intermediate image encoding
  • the intermediate image encoding and the initial audio encoding are spliced to obtain training non-text encoding data.
  • the specific method of splicing is not limited, the initial audio encoding can be first, or the intermediate image encoding can be first.
  • S104 Generate similarity data using training text-encoded data and training non-text-encoded data.
  • S105 Use the similarity data to generate a loss value, and adjust the parameters of the initial model based on the loss value.
  • This application converts the emotion recognition process from the original probability prediction problem to a similarity matching problem. Therefore, during training, similarity data is generated by using training text encoding data and training non-text encoding data, and similarity data is used to characterize the training text encoding.
  • the gap between the data and training non-text-encoded data Since the emotion label, training video, and training audio represent the same emotion, the gap can represent the defects of the initial model in feature extraction, that is, the loss value. Then the parameters of the initial model can be adjusted based on the loss value, so that the initial model learns How to accurately extract text-type emotional features and non-text-type emotional features.
  • the calculation method of the similarity data can be set as needed.
  • the training text-encoded data and the training non-text-encoded data are both in vector form.
  • the cosine similarity can be calculated as the similarity data.
  • the specific type of the loss value is not limited, for example, it can be a cross-entropy loss value.
  • the audio encoder (or sound encoder) can use the YAMNET model, which is an audio event classifier trained on the AudioSet data set (a large audio and video data set).
  • the overall network architecture of YAMNET adopts MobileNet v1 (depth separable convolution architecture), and the feature dimension of extracted sound is 1024 dimensions.
  • step S101 After the parameters are adjusted, it can be tested whether the training completion conditions are met. This test can be performed periodically, for example, once after completing several rounds of iterative training. If the training completion conditions are not met, continue to execute step S101 and continue training; otherwise, execute step S106.
  • the training completion condition refers to the condition that indicates that the training of the initial model can be ended. Its number and content are not limited. For example, it can be a condition that limits the training time, or it can be a condition that limits the number of training rounds, or it can Conditions that limit the detection accuracy of the initial model. When one, part or all of the training completion conditions are met, the initial model after parameter adjustment can be determined as the emotion recognition model, and the representation training is completed.
  • the methods of detecting whether they are met are different. For example, when the training completion condition is to limit the training duration, it can be determined that the training completion condition is met when it is detected that the training duration reaches the preset duration limit; when the training completion condition can be to limit the number of training rounds conditions, it can be determined that the training completion condition is met when it is detected that the number of training rounds reaches the preset number of training times; when the training completion condition is the accuracy condition, the test data can be used to test the accuracy of the initial model after parameter adjustment. Test and obtain the test results. If the test results are greater than the preset threshold, it is determined that the training completion conditions are met.
  • the test data may include multiple sets of test sub-data, including target test sub-data.
  • the target test sub-data may be any set of test sub-data.
  • the target test sub-data includes target test video, target test audio, and target test tags.
  • target test video frames are determined in the target test video, and multiple target test text data are generated using each emotion label in the label set.
  • the target test text data corresponds to at least one text template. That is, when the number of text templates is multiple, each emotion tag can be used to cooperate with each text template to generate corresponding target test text data.
  • target test text data and target test audio input parameters obtains target non-text encoded data and multiple target text encoded data, where each target text encoded data and each target test text data are respectively One-to-one correspondence. Calculate the test similarity data between the target non-text encoded data and each target text encoded data respectively.
  • test similarity data is used to determine at least one maximum similarity data corresponding to at least one text template.
  • Each maximum similarity data represents the result obtained when using the text template for emotion recognition.
  • the most reliable prediction results Determine at least one emotion label corresponding to the maximum similarity data as the initial prediction result corresponding to the target test video, and filter the initial prediction results by the maximum number to obtain the prediction result, that is, among the initial prediction results corresponding to multiple text templates, the largest number
  • the result is used as the predicted result.
  • the test sub-result corresponding to the target test sub-data is determined based on the prediction result and the target test label. If the two are the same, the test sub-result indicates that the prediction is correct, otherwise it is wrong.
  • the test results can be obtained by counting all test sub-results corresponding to the test data.
  • FIG. 2 is a flow chart of an emotion recognition method provided by an embodiment of the present application, including:
  • S202 Determine multiple video frames to be tested in the video to be tested, and use each emotion label in the label set to splice with the text template to be tested to generate text data to be tested corresponding to each emotion label.
  • S203 Input the video frame to be tested, the text data to be tested, and the audio to be tested into the emotion recognition model, and obtain the non-text encoding data to be tested and each text encoding data to be tested corresponding to each text data to be tested.
  • S204 Use the non-text encoding data to be tested and each text encoding data to be tested to generate similarity data to be tested.
  • S205 Determine the emotion label corresponding to the maximum similarity data to be tested as the emotion recognition result corresponding to the video to be tested.
  • the emotion recognition model is obtained based on any of the above emotion recognition model training methods.
  • the label set includes various emotion labels, which may include some or all of the emotion labels used during the training process, and may also include emotion labels that have not been used during the training process. Since it is not possible to determine the specific emotion represented by the video to be tested when performing emotion recognition, each emotion label can be used to generate a corresponding text data to be tested. Wherein, if a text template is used to generate text data to be tested, each text data to be tested may use the same or different text template.
  • the process of generating text data to be tested can be as follows: selecting a text template to be tested from a preset template library; performing vector mapping processing on the text template to be tested and each emotion label respectively, to obtain a template vector to be tested and each label vector; The template vector is spliced with each label vector to obtain the text data to be tested.
  • the specific generation process is similar to the training process and will not be described in detail here.
  • the non-text encoding data to be tested corresponding to the video frame to be tested and the audio to be tested can be obtained, as well as the text encoding data to be tested corresponding to each text data to be tested.
  • the non-text encoding data to be tested and each text encoding data to be tested are used to generate the similarity data to be tested.
  • the multiple similarity data to be tested respectively represent the characteristics of the video to be tested and the similarity between each emotion label. , select the most similar one, that is, the maximum similarity data to be tested, and use its corresponding emotion label as the emotion recognition result corresponding to the video to be tested.
  • Figure 3 is a specific data processing flow chart provided by an embodiment of the present application.
  • the target text template and emotion label are obtained, mapped into prompt embedding vectors and label embedding vectors respectively through text preprocessing, and vector splicing is used to generate generalized text vectors, that is, training text data.
  • the video is extracted to obtain training video frames, which are then input into the visual encoder.
  • the training audio is input into the sound encoder, and the data vectors of the visual encoder and the sound encoder are spliced to obtain training non-text encoding data.
  • y can be used to represent the label set of emotion labels
  • x can be used to represent the training video or the video to be tested.
  • the emotion label corresponding to the maximum similarity data to be tested can be expressed as y pred , specifically as:
  • argmax represents the maximum value
  • p represents the target text template
  • f vid represents the encoder on the video side, here the sound encoder, visual encoder and LSTM timing module are combined as the encoder on the video side
  • f vid (E 1 (x) ) represents the non-text encoded data to be tested
  • f txt represents the text encoder, so f txt ([ ET (p); E T (y i )]).
  • C represents the number of emotion categories in the label set.
  • E1 and ET respectively represent video preprocessing (i.e., frame extraction) and text preprocessing (i.e., vector mapping).
  • cross-entropy loss can be used, expressed as Loss, specifically:
  • the entire training process includes the following steps:
  • the label vector y (specifically referring to the vector of the emotion label corresponding to the training video) and the vector p are respectively subjected to text preprocessing, and then the text embedding vector t is synthesized through vector splicing.
  • the sound features output the sound encoding vector through the sound encoder, and are vector spliced with the final_img obtained in step f to obtain the final video encoding vector final_vid.
  • a. Input the face video.
  • the video is preprocessed and M frames of pictures are fixedly selected.
  • the vectors corresponding to each emotion label in the label vector set y are separately preprocessed with the vector p, and then the text embedding vector t is synthesized through vector splicing.
  • f vid (E 1 (x)) represents final_vid
  • f txt ([ ET (p); E T (y i )]) represents final_t.
  • a. Input the face video.
  • the video is preprocessed and M frames of pictures are fixedly selected.
  • the vectors corresponding to each emotion label in the label vector set y are separately preprocessed with the vector p0, and then the text embedding vector t0 is synthesized through vector splicing.
  • f vid (E 1 (x)) represents final_vid
  • f txt ([ ET (p); E T (y i )]) represents final_t0.
  • the emotion recognition process is converted from the original probability prediction problem to a similarity matching problem.
  • the semantic information contained in the tag itself is introduced, while improving the accuracy. It also enables the model to have certain zero-shot learning transfer capabilities.
  • this application uses emotion labels to generate training text data, and uses it to train the initial model, so that the initial model can learn the semantic information carried by the emotion labels.
  • the loss value is calculated through the similarity data and the parameters are adjusted so that the encoding process of the initial model focuses on reflecting the degree of similarity between text and non-text.
  • the similarity between the non-text encoding data of the video to be tested and the text encoding data to be tested corresponding to each emotion label is also used to determine the most similar emotion label and improve the accuracy of emotion recognition.
  • the emotion recognition model can distinguish it from other emotion labels based on the semantic information of the emotion label, and has certain zero-sample learning capabilities. , improving model versatility.
  • the trained emotion recognition model can be applied to the recognition terminal.
  • the identification terminal may include a processor, a detection component and a display screen, and of course may also include an input component.
  • the processor is connected to the detection component, the input component and the display screen respectively.
  • the processor can obtain the video to be tested and the audio to be tested; determine multiple video frames to be tested in the video to be tested, and use each emotion label in the label set to
  • the text templates to be tested are spliced to generate the text data to be tested corresponding to each emotion label; the video frames to be tested, the text data to be tested and the audio to be tested are input into the emotion recognition model to obtain the non-text encoded data to be tested and each text data to be tested respectively.
  • Corresponding text encoding data to be tested use non-text encoding data to be tested and each text encoding data to be tested to generate similarity data to be tested; determine the emotion label corresponding to the maximum similarity data to be tested as the emotion corresponding to the video to be tested Recognition results. After the emotion recognition result is obtained, the emotion recognition result can be displayed on the display screen.
  • the detection components may include detection interfaces and collection components (such as cameras and microphones).
  • the input component may include an input interface and an input keyboard.
  • the input keyboard may facilitate the user to input relevant instructions or data to the identification terminal.
  • a wireless transmission module can also be set on the identification terminal.
  • the wireless transmission module can be a Bluetooth module or a wifi module, etc.
  • FIG. 4 is a schematic structural diagram of an identification terminal provided by an embodiment of the present application.
  • the identification terminal may include a processor, a display screen 41, an input interface 42, an input keyboard 43, a detection interface 44, a camera 45, a microphone 46, and a wireless transmission module 47 .
  • the input keyboard 43 may be a soft keyboard presented on the display screen 41 .
  • the input interface 42 can be used to realize connection with external devices. There may be multiple input interfaces. In FIG. 3 , one input interface is taken as an example.
  • the detection interface 44 is connected to the collection component 45 .
  • the processor is embedded inside the identification terminal and is therefore not shown in FIG. 3 .
  • the identification terminal can be a smart phone, a tablet computer, a notebook computer or a desktop computer.
  • the form of the identification terminal is not limited.
  • the input interface 42 can be connected to an external device through a data cable, and the input keyboard 43 can be a soft keyboard presented on the display interface.
  • the input interface 42 may be a USB interface for connecting external devices such as a USB flash drive, and the input keyboard 43 may be a hard keyboard.
  • the user can import the video and audio to be tested into a USB flash drive, and insert the USB flash drive into the input interface 52 of the identification terminal.
  • the recognition terminal determines multiple video frames to be tested in the video to be tested, and uses each emotion label in the label set to splice with the text template to be tested to generate a corresponding emotion label respectively.
  • the text data to be tested input the video frame to be tested, the text data to be tested, and the audio to be tested into the emotion recognition model to obtain the non-text encoding data to be tested and each text encoding data to be tested corresponding to each text data to be tested, and use the test text data to be tested.
  • the non-text encoding data and each text encoding data to be tested generate similarity data to be tested, the emotion label corresponding to the maximum similarity data to be tested is determined as the emotion recognition result corresponding to the video to be tested, and the recognition result is displayed on the display screen 41 .
  • the functional modules such as the display screen 41, the input interface 42, the input keyboard 43, the detection interface 44, the camera 45, the microphone 46, and the wireless transmission module 47 included in the identification terminal in Figure 5 are only examples. In actual applications, , Based on actual needs, the Q&A terminal may also contain more or fewer functional modules, and there is no limit to this.
  • the emotion recognition method provided by the embodiments of this application can be deployed in a software platform based on FPGA (Field Programmable Gate Array) neural network acceleration applications or AI (Artificial Intelligence, artificial intelligence) acceleration chips. It should be noted that the method of compressing the neural network model based on the offset in the embodiments of this application can be used not only for determining text answers, but also for long short-term memory networks based on LSTM (Long Short-Term Memory). ) time series data processing, such as multi-target tracking and other scenarios.
  • FPGA Field Programmable Gate Array
  • AI Artificial Intelligence, artificial intelligence acceleration chips.
  • LSTM Long Short-Term Memory
  • the emotion recognition device provided by the embodiment of the present application is introduced below.
  • the emotion recognition device described below and the emotion recognition model training method described above can be mutually referenced.
  • Figure 5 is a schematic structural diagram of an emotion recognition device provided by an embodiment of the present application, including:
  • the test acquisition module 51 is used to obtain the video to be tested and the audio to be tested;
  • the data processing module 52 to be tested is used to determine multiple video frames to be tested in the video to be tested, and use each emotion label in the label set to splice with the text template to be tested to generate text data to be tested corresponding to each emotion label;
  • the input module 53 to be tested is used to input the video frame to be tested, the text data to be tested and the audio to be tested into the emotion recognition model to obtain the non-text coded data to be tested and each text coded data to be tested corresponding to each text data to be tested;
  • the similarity to be tested generation module 54 is used to generate the similarity data to be tested using the non-text encoding data to be tested and the text encoding data to be tested respectively;
  • the recognition result determination module 55 is used to determine the emotion label corresponding to the maximum similarity data to be tested as the emotion recognition result corresponding to the video to be tested.
  • the data processing module 52 to be tested includes:
  • the test template determination unit is used to select the text template to be tested from the preset template library
  • the vector mapping unit to be tested is used to perform vector mapping processing on the text template to be tested and each emotion label respectively, to obtain a template vector to be tested and each label vector;
  • the splicing unit to be tested is used to splice the template vector with each label vector to obtain the text data to be tested.
  • Training acquisition module used to acquire training videos, training audios and emotion labels
  • a training data processing module used to determine multiple training video frames in the training video and generate training text data using emotion labels
  • the training input module is used to input training video frames, training text data and training audio into the initial model to obtain training text encoding data and training non-text encoding data;
  • the training similarity generation module is used to generate similarity data using training text-encoded data and training non-text-encoded data;
  • the parameter adjustment module is used to generate loss values using similarity data, and adjust parameters of the initial model based on the loss values;
  • the model determination module is used to determine the initial model after parameter adjustment as the emotion recognition model if it is detected that the training completion conditions are met.
  • the initial model includes a text encoder, an image encoder and an audio encoder, and also includes a pooling network module and a temporal recursive network module.
  • the output of the text encoder is the input of the pooling network module, and the output of the image encoder is Input to the temporal recursive network module.
  • train the input module including:
  • the training text encoding unit is used to input training text into the text encoder to obtain multiple initial text encodings
  • the training pooling processing unit is used to input multiple initial text encodings into the pooling network module to obtain training text encoding data;
  • the training audio coding unit is used to input training video frames into the image encoder to obtain multiple initial image codes, and input the training audio into the audio encoder to obtain initial audio codes;
  • the training image coding unit is used to input multiple initial image codes into the time recursive network module to obtain intermediate image codes
  • the training splicing unit is used to splice the intermediate image encoding and the initial audio encoding to obtain training non-text encoding data.
  • the text encoder and image encoder belong to the language-image comparison learning pre-training model, and the audio encoder has been pre-trained;
  • Parameter adjustment module including:
  • the partial adjustment unit is used to adjust the parameters of the pooling network module and the time recursive network module in the initial model based on the loss value.
  • the training data processing module includes:
  • the target template selection unit is used to select a target text template from the preset template library
  • the vector mapping unit is used to perform vector mapping processing on the target text template and emotion labels to obtain template vectors and label vectors;
  • the text vector splicing unit is used to splice template vectors and label vectors to obtain training text data.
  • the model determination module includes:
  • the test unit is used to use test data to test the accuracy of the initial model after parameter adjustment and obtain test results;
  • the determination unit is used to determine that the training completion condition is met if the test result is greater than the preset threshold.
  • test data includes multiple sets of test sub-data, including target test sub-data, and the target test sub-data includes target test video, target test audio and target test label;
  • Test unit including:
  • the test data processing subunit is used to determine multiple target test video frames in the target test video, and generate multiple target test text data using each emotion tag in the tag set; wherein the target test text data corresponds to at least one text template ;
  • the test input subunit is used to adjust the initial model of the target test video frame, target test text data and target test audio input parameters to obtain target non-text encoded data and multiple target text encoded data;
  • the test calculation subunit is used to calculate the test similarity data between the target non-text encoded data and each target text encoded data, and use the test similarity data to determine at least one maximum similarity data corresponding to at least one text template respectively;
  • the prediction result determination subunit is used to determine at least one emotion label corresponding to the maximum similarity data as the initial prediction result corresponding to the target test video, and to perform a maximum number of screening on the initial prediction results to obtain the prediction result;
  • the sub-result determination sub-unit is used to determine the test sub-result corresponding to the target test sub-data based on the prediction result and the target test label;
  • the statistics subunit is used to count all test sub-results corresponding to the test data and obtain the test results.
  • the electronic device provided by the embodiment of the present application is introduced below.
  • the electronic device described below and the emotion recognition model training method and/or the emotion recognition method described above can correspond to each other and refer to each other.
  • the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
  • a multimedia component 103 may include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
  • I/O information input/information output
  • the processor 101 is used to control the overall operation of the electronic device 100 to complete the above-mentioned emotion recognition model training method and/or all or part of the steps in the emotion recognition method;
  • the memory 102 is used to store various types of data. To support operations on the electronic device 100 , these data may include, for example, instructions for any application or method operating on the electronic device 100 , as well as application-related data.
  • the memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read- Only Memory (ROM), magnetic memory, flash memory, one or more of magnetic disks or optical disks.
  • SRAM static random access memory
  • EEPROM Electrically erasable programmable read-only memory
  • EPROM Erasable Programmable Read-Only Memory
  • PROM Programmable Read-Only Memory
  • ROM Read-Only Memory
  • magnetic memory flash memory, one or more of magnetic disks or optical disks.
  • Multimedia components 103 may include screen and audio components.
  • the screen may be a touch screen, for example, and the audio component is used to output and/or input audio signals.
  • the audio component may include a microphone for receiving external audio signals.
  • the received audio signals may be further stored in memory 102 or sent via communication component 105 .
  • the audio component also includes at least one speaker for outputting audio signals.
  • the I/O interface 104 provides an interface between the processor 101 and other interface modules.
  • the other interface modules may be keyboards, mice, buttons, etc. These buttons can be virtual buttons or physical buttons.
  • the communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 105 may include: Wi-Fi parts, Bluetooth parts, NFC parts.
  • the electronic device 100 may be configured by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processor (Digital Signal Processor, DSP for short), digital signal processing device (Digital Signal Processing Device, DSPD for short), Programmable Logic Device (PLD for short), Field Programmable Gate Array (FPGA for short), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above embodiments
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • the non-volatile readable storage medium provided by the embodiment of the present application is introduced below.
  • the non-volatile readable storage medium described below and the emotion recognition model training method described above, and/or the emotion recognition method may correspond to each other. Reference.
  • This application also provides a non-volatile readable storage medium.
  • a computer program is stored on the non-volatile readable storage medium.
  • the computer program is executed by the processor, the above-mentioned emotion recognition model training method is implemented, and/or the emotion Identify the steps of the method.
  • the storage medium can include: U disk, mobile hard disk, read-only memory (ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

情绪识别方法、装置、设备及存储介质,应用于神经网络技术领域,情绪识别模型训练方法包括:获取待测视频和待测音频(S201);在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据(S202);将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据(S203);利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据(S204);将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果(S205);方法引入了标签的本身所包含的语义信息,提高准确率。

Description

情绪识别方法、装置、设备及存储介质
相关申请的交叉引用
本申请要求于2022年06月30日提交中国专利局,申请号为202210760941.X,申请名称为“情绪识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及神经网络技术领域,特别涉及情绪识别方法、装置、电子设备及非易失性可读存储介质。
背景技术
随着当前人脸识别技术的成熟,从图片或视频中找出重点关注人物的人脸是比较成熟的技术。因此当前对于情感识别的研究着重在于对于人脸情感识别的研究。研究者通常将人脸情感识别分为静态人脸情感识别(static facial emotion recognition)和动态人脸情感识别(dynamic facial emotion recognition).前者通过单张人脸图片辨别人的情绪,后者通过动态图像或者视频辨别人的情绪。由于人脸情感识别是一个动态的过程,因此有时候仅仅凭一张图片很难界定当前人物真实的情感。然而,当前动态人脸情感识别方法的识别准确率较差,其不具备零样本学习的迁移能力。
发明内容
有鉴于此,本申请的目的在于提供一种情绪识别方法、装置、电子设备及非易失性可读存储介质,提高情绪识别准确率和模型通用性。
为解决上述技术问题,本申请提供了一种情绪识别模型训练方法,包括:
获取待测视频和待测音频;
在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据;
将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;
利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据;
将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。
可选地,利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据,包括:
从预设模板库中选取待测文本模板;
对待测文本模板分别和各个情绪标签进行向量映射处理,得到一个待测模板向量和各个标签向量;
将模板向量分别和各个标签向量进行拼接,得到待测文本数据。
可选地,情绪识别模型的训练过程,包括:
获取训练视频、训练音频和情绪标签;
在训练视频中确定多个训练视频帧,并利用情绪标签生成训练文本数据;
将训练视频帧、训练文本数据和训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据;
利用训练文本编码数据和训练非文本编码数据生成相似度数据;
利用相似度数据生成损失值,并基于损失值对初始模型进行参数调节;
若检测到满足训练完成条件,则将参数调节后的初始模型确定为情绪识别模型。
可选地,初始模型包括文本编码器、图像编码器和音频编码器,还包括池化网络模块和时间递归网络模块,文本编码器的输出为池化网络模块的输入,图像编码器的输出为时间递归网络模块的输入。
可选地,将训练视频帧、训练文本数据和训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据,包括:
将训练文本输入文本编码器,得到多个初始文本编码;
将多个初始文本编码输入池化网络模块,得到训练文本编码数据;
将训练视频帧输入图像编码器,得到多个初始图像编码,并将训练音频输入音频编码器,得到初始音频编码;
将多个初始图像编码输入时间递归网络模块,得到中间图像编码;
将中间图像编码和初始音频编码进行拼接,得到训练非文本编码数据。
可选地,文本编码器和图像编码器属于语言图像对比学习预训练模型,音频编码器被预训练完毕。
可选地,基于损失值对初始模型进行参数调节,包括:
基于损失值对初始模型中的池化网络模块和时间递归网络模块进行参数调节。
可选地,利用情绪标签生成训练文本数据,包括:
从预设模板库中选取一个目标文本模板;
对目标文本模板和情绪标签进行向量映射处理,得到模板向量和标签向量;
对模板向量和标签向量进行拼接,得到训练文本数据。
可选地,检测到满足训练完成条件,包括:
利用测试数据对参数调节后的初始模型进行准确率测试,得到测试结果;
若测试结果大于预设阈值,则确定满足训练完成条件。
可选地,测试数据包括多组测试子数据,其中包括目标测试子数据,目标测试子数据包括目标测试视频、目标测试音频和目标测试标签。
可选地,利用测试数据对参数调节后的初始模型进行准确率测试,得到测试结果,包括:
在目标测试视频中确定多个目标测试视频帧,并利用标签集合中的各个情绪标签生成多个目标测试文本数据;其中,目标测试文本数据对应于至少一个文本模板;
将目标测试视频帧、目标测试文本数据和目标测试音频输入参数调节后的初始模型,得到目标非文本编码数据和多个目标文本编码数据;
计算目标非文本编码数据分别和各个目标文本编码数据之间的测试相似度数据,并利用测试相似度数据确定至少一个文本模板分别对应的至少一个最大相似度数据;
将至少一个最大相似度数据对应的情绪标签确定为目标测试视频对应的初始预测结果,并对初始预测结果进行最大数量筛选,得到预测结果;
基于预测结果和目标测试标签确定目标测试子数据对应的测试子结果;
统计测试数据对应的全部测试子结果,得到测试结果。
可选的,检测到满足训练完成条件,包括:
检测到训练时长达到预设时长限值的情况下,确定满足训练完成条件;
或检测到训练轮数达到预设训练次数的情况下,确定满足训练完成条件。
本申请还提供了一种情绪识别装置,包括:
待测获取模块,用于获取待测视频和待测音频;
待测数据处理模块,用于在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据;
待测输入模块,用于将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;
待测相似度生成模块,用于利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据;
识别结果确定模块,用于将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。
本申请还提供了一种电子设备,包括存储器和处理器,其中:
存储器,用于保存计算机程序;
处理器,用于执行计算机程序,以实现上述的情绪识别方法。
本申请还提供了一种非易失性可读存储介质,用于保存计算机程序,其中,计算机程序被处理器执行时以实现上述的情绪识别方法。
本申请提供的情绪识别模型训练方法,获取待测视频和待测音频;在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据;将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据;将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。
可见,该方法将情绪识别过程由原本的概率预测问题转换为了相似匹配问题,同时引入了标签的本身所包含的语义信息,提高准确率的同时,还使得模型能够具备一定的zero-shot learning(零样本学习)的迁移能力。具体的,本申请在识别情绪时,利用各种情绪标签与同一个待测文文本模板生成多个待测文本数据,情绪识别模型进过训练,能够学习情绪标签携带的语义信息,通过生成待测视频的待测非文本编码数据分别和各个情绪标签对应的待测文本编码数据之间的相似度,来选取最大待测相似度数据并确定最相似的情绪标签,提高情绪识别准确率。同时,即便在应用时新增了情绪识别模型训练时未涉及到的情绪标签,情绪识别模型也能够基于该情绪标签的语义信息将其与其他情绪标签进行区分,具备了一定零样本学习的能力,提高了模型通用性。
此外,本申请还提供了装置、电子设备及非易失性可读存储介质,同样具有上述有益效果。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实 施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种情绪识别模型训练方法流程图;
图2为本申请实施例提供的一种情绪识别方法流程图;
图3为本申请实施例提供的一种具体的数据处理流程图;
图4为本申请实施例提供的一种识别终端的结构示意图;
图5为本申请实施例提供的一种情绪识别装置的结构示意图;
图6为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
当前,动态人脸情感识别的主要方案主要是利用视觉及声音的多模态融合信息来实现情绪识别。即,将视频中的视觉图像和声音音频分别用特征提取器提取,然后使用特征融合网络进行融合,最终预测一组固定的预先定义的情感类别。然而,该方案完全忽略了情感标签本身所包含的语义信息,而是直接将情感标签映射到一个固定数量的类别索引(数字)中。该方案不但限制了模型的通用性,不具备zero-shot learning的迁移/预测能力,需要额外的训练数据方能将模型应用迁移到新的场景中,还会导致情绪识别的准确率低。
本申请中,借鉴人类对情绪识别的方式,当看到一段视频的时候,人们能将视频中图像的特征(不管是见过还是没见过的),跟脑海中的自然语言的特征进行关联和对应的,而不是与数字/索引进行对应。因此,本申请采用打破常规的训练方式,在训练中挖掘标签文本的语义信息,并与对应的视频特征进行关联,不仅增强了视频表征的语义性,提高识别准确率,同时能够使得模型具备一定的zero-shot learning的迁移能力。
具体的,请参考图1,图1为本申请实施例提供的一种情绪识别模型训练方法流程图。该方法包括:
S101:获取训练视频、训练音频和情绪标签。
需要说明的是,本申请中的各个步骤可以由指定的电子设备完成,该执行的电子设备可 以为服务器、计算机等任意形式,电子设备的数量可以为一个或多个,即可以由可以电子设备执行所有步骤,或多个电子设备分别执行部分步骤,共同配合完成模型训练和/或情绪识别的过程。
训练视频、训练音频和情绪标签相互对应,训练视频,是指记录有人脸情绪变化的视频,训练音频,是指与训练视频对应的音频,其中通常记录与训练视频所记录的人脸情绪变化相对应的声音,例如哭声、笑声等。情绪标签,是指与训练视频和训练音频所表达的情绪对应的文字名称,例如happy(高兴)、angry(生气)、伤心、恐惧等文本。
S102:在训练视频中确定多个训练视频帧,并利用情绪标签生成训练文本数据。
训练视频帧可以为训练视频中的任意一个视频帧,训练视频帧的数量为多个,例如可以为M个,M为固定的正数。利用多个训练视频帧,可以在时序方向上表征训练视频中人脸的情绪变化。训练视频帧的确定方式不做限定,在一种实施方式中,可以按照预设时间间隔,从训练视频的首帧开始抽取训练视频帧;在另一种实施方式中,可以确定训练视频帧的数量,并基于该数量对训练视频进行平均间隔抽帧,得到训练视频帧。
训练文本数据,是指用于表示情绪标签语义信息的数据,其具体形式不做限定,例如可以为文本形式,或者可以为向量形式。在一种实施方式中,可以直接将情绪标签作为训练文本数据,或者可以将情绪标签进行文本-向量映射,得到对应的标签向量,并将该标签向量确定为训练文本数据。在另一种实施方式中,可以获取预设的文本模板(prompt),利用文本模板和情绪标签共同生成训练文本数据,以便进一步提供更多的语义信息,文本模板的具体内容不做限定,例如可以为“The person seems to express the feeling of the[CLASS]”、“From this video,we can see that the person is[CLASS]”,其中[CLASS]位置用于插入情绪标签。
在另一种实施方式中,由于不同的prompt句式可能会使得模型学习到的语义信息不同,为了避免文本模板对模型训练效果造成影响,可以预设有多个文本模板,构成预设模板库。在生成训练文本数据时,可以从预设模板库中选取一个目标文本模板,具体可以为随机选取或按照序号顺序选取。对目标文本模板和情绪标签分别进行向量映射处理,得到模板向量和标签向量,具体的向量映射方式不做限定。在映射完毕后,对模板向量和标签向量进行拼接,即可得到训练文本数据。该方式能够使得模型适应各种prompt句式。
S103:将训练视频帧、训练文本数据和训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据。
在得到训练视频帧和训练文本数据后,将其与训练音频共同输入初始模型,由初始模型对其进行编码,得到表征文本特征的训练文本编码数据和表征非文本特征的训练非文本编码 数据。训练文本编码数据基于训练文本数据得到,其能够表征情绪标签的情绪语义特征。非文本特征基于训练视频帧和训练音频得到,其能够表征图像和声音表征的情绪特征。
初始模型,是指训练未完毕的情绪识别模型,其经过迭代训练和参数调节后,提高对特征的提取能力,进而转变为情绪识别模型。初始模型的具体类型不做限定,可以采用任意可行的神经网络架构。在一种可行的实施方式中,初始模型包括文本编码器、图像编码器和音频编码器,文本编码器用于处理训练文本数据,得到训练文本编码数据,图像编码器和音频编码器分别用于处理训练视频帧和训练音频,二者配合得到训练非文本编码数据。在另一种实施方式中,为了提取时序信息,进而提高识别准确性,初始模型中还可以池化网络模块和时间递归网络模块。其中,文本编码器的输出为池化网络模块的输入,图像编码器的输出为时间递归网络模块的输入。时间递归网络模块具体可以为LSTM(Long Short-Term Memory,长短期记忆网络)网络,池化网络模块具体用于对文本编码器的输出做时序上的池化操作。
本实施例并不限定初始模型得到训练文本编码数据和训练非文本编码数据的方式,具体生成方式与初始模型的模型结构相关。在一种实施方式中,若初始模型为上述的包括文本编码器、图像编码器、音频编码器、池化网络模块和时间递归网络模块的结构,则可以将训练文本输入文本编码器,得到多个初始文本编码,初始文本编码的数量和训练视频帧的数量相同。进而将多个初始文本编码输入池化网络模块,得到训练文本编码数据。此外,可以将训练视频帧输入图像编码器,得到多个初始图像编码,并将训练音频输入音频编码器,得到初始音频编码,然后将多个初始图像编码输入时间递归网络模块,得到中间图像编码,最后将中间图像编码和初始音频编码进行拼接,得到训练非文本编码数据,拼接的具体方式不做限定,可以初始音频编码在前,或者可以为中间图像编码在前。
S104:利用训练文本编码数据和训练非文本编码数据生成相似度数据。
S105:利用相似度数据生成损失值,并基于损失值对初始模型进行参数调节。
为了便于说明,将S104和S105两个步骤合并说明。
本申请将情绪识别过程由原本的概率预测问题转换为了相似匹配问题,因此在进行训练时,通过利用训练文本编码数据和训练非文本编码数据生成相似度数据,利用相似度数据来表征训练文本编码数据和训练非文本编码数据之间的差距。由于情绪标签和训练视频、训练音频表征了相同的情绪,因此该差距即可表征初始模型在特征提取方面的缺陷,即损失值,进而可以基于损失值对初始模型进行参数调节,使得初始模型学习到该如何准确提取文本类型的情绪特征和非文本类型的情绪特征。
相似度数据的计算方式可以根据需要设定,例如在一种实施方式中,训练文本编码数据和训练非文本编码数据均为向量形式,此时可以计算余弦相似度作为相似度数据。损失值的具体类型也不做限定,例如可以为交叉熵损失值。
在进行参数调节时,可以根据需要对整个初始模型进行参数调节,或者对其中的部分进行参数调节。例如在一种实施方式中,若初始模型为上述的包括文本编码器、图像编码器、音频编码器、池化网络模块和时间递归网络模块的结构,文本编码器和图像编码器可以属于语言图像对比学习预训练模型,音频编码器也被预训练完毕,此时在参数调节时,可以基于损失值对初始模型中的池化网络模块和时间递归网络模块进行参数调节。语言图像对比学习预训练模型即为CLIP(Contrastive Language-Image Pre-Training)模型,经过大规模预训练的处理,其已经具备了较优的模型参数,无需继续调参。音频编码器(或称为声音编码器)可以采用的是YAMNET模型,该模型是在AudioSet数据集(一个大型音频、视频数据集)上训练的音频事件分类器。YAMNET整体网络架构采用MobileNet v1(深度可分离卷积架构),提取声音的特征维度为1024维。
在参数调节完毕后,可以检测是否满足训练完成条件,该检测可以周期执行,例如每完成若干轮迭代训练后检测一次。若不满足训练完成条件,则继续执行S101步骤,继续进行训练,否则执行S106步骤。
S106:若检测到满足训练完成条件,则将参数调节后的初始模型确定为情绪识别模型。
训练完成条件,是指表示对初始模型的训练可以结束的条件,其数量和内容不做限定,例如可以为对训练时长进行限制的条件,或者可以为对训练轮数进行限制的条件,或者可以为对初始模型的检测准确率进行限制的条件。在一个、部分或全部的训练完成条件被满足时,可以将参数调节后的初始模型确定为情绪识别模型,表征训练完毕。
可以理解的是,根据训练完成条件的内容不同,检测是否满足的方式不同。例如当训练完成条件为对训练时长进行限制的条件,则可以在检测到训练时长达到预设时长限值的情况下,确定出满足训练完成条件;当训练完成条件可以为对训练轮数进行限制的条件,则可以在检测到训练轮数达到预设训练次数的情况下,确定满足训练完成条件;当训练完成条件为准确率条件时,可以利用测试数据对参数调节后的初始模型进行准确率测试,得到测试结果,若测试结果大于预设阈值,则确定满足训练完成条件。
具体的,测试数据可以包括多组测试子数据,其中包括目标测试子数据,目标测试子数据可以为任意一组测试子数据,目标测试子数据包括目标测试视频、目标测试音频和目标测试标签。在进行测试时,在目标测试视频中确定多个目标测试视频帧,并利用标签集合中的 各个情绪标签生成多个目标测试文本数据。需要说明的是,目标测试文本数据对应于至少一个文本模板。即当文本模板的数量为多个时,可以利用各个情绪标签分别和各个文本模板相配合,生成对应的目标测试文本数据。将目标测试视频帧、目标测试文本数据和目标测试音频输入参数调节后的初始模型,得到目标非文本编码数据和多个目标文本编码数据,其中,各个目标文本编码数据与各个目标测试文本数据分别一一对应。计算目标非文本编码数据分别和各个目标文本编码数据之间的测试相似度数据。
其中测试相似度数据越大,表明越相似。由于最大相似度数据表明二者最为相似,因此利用测试相似度数据确定至少一个文本模板分别对应的至少一个最大相似度数据,每个最大相似度数据表示了利用该文本模板进行情绪识别时得到的最可靠的预测结果。将至少一个最大相似度数据对应的情绪标签确定为目标测试视频对应的初始预测结果,并对初始预测结果进行最大数量筛选,得到预测结果,即将多个文本模板对应的初始预测结果中,数量最多的结果,作为预测结果。基于预测结果和目标测试标签确定目标测试子数据对应的测试子结果,若二者相同,则测试子结果表明预测正确,否则为错误。统计测试数据对应的全部测试子结果,即可得到测试结果。
在得到情绪识别模型后,可以利用其进行情绪识别。请参考图2,图2为本申请实施例提供的一种情绪识别方法流程图,包括:
S201:获取待测视频和待测音频。
S202:在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据。
S203:将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据。
S204:利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据。
S205:将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。
其中,情绪识别模型基于上述的任一种情绪识别模型训练方法得到。在实际应用中,标签集合中包括了各种情绪标签,其中可以包括部分或全部的在训练过程中使用过的情绪标签,还可以包括在训练过程中未使用过的情绪标签。由于在进行情绪识别时,并不能够确定待测视频具体表征的情绪,因此可以利用每个情绪标签生成分别对应的一个待测文本数据。其中,若采用文本模板生成待测文本数据,则每个待测文本数据可以采用相同或不同的文本模板。具体的,生成待测文本数据的过程可以为:从预设模板库中选取待测文本模板;对待测文本模板分别和各个情绪标签进行向量映射处理,得到一个待测模板向量和各个标签向 量;将模板向量分别和各个标签向量进行拼接,得到待测文本数据。具体生成过程与训练过程类似,在此不做赘述。
在利用情绪识别模型进行处理后,可以得到与待测视频帧以及待测音频对应的待测非文本编码数据,以及各个待测文本数据分别对应的待测文本编码数据。利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据,得到的多个待测相似度数据分别表征了待测视频表征出的特征和各个情绪标签之间的相似度,从中选择最相近的一个,即最大待测相似度数据,将其对应的情绪标签作为待测视频对应的情绪识别结果。
请参考图3,图3为本申请实施例提供的一种具体的数据处理流程图。在训练过程中,获取目标文本模板和情绪标签,通过文本预处理的方式将其分别映射为prompt嵌入向量和标签嵌入向量,并通过向量拼接生成广义文本向量,即训练文本数据。将广义文本向量输入基于CLIP预训练权重构建的CLIP模型中的文本编码器,得到训练文本编码数据。此外,对视频进行抽帧得到训练视频帧,进而输入视觉编码器,同时将训练音频输入声音编码器,并将视觉编码器和声音编码器的数据向量进行拼接,得到训练非文本编码数据。计算训练文本编码数据和训练非文本编码数据之间的相似度,进而基于相似度生成交叉熵损失。
本申请中,可以利用y表示情绪标签的标签集合,用x表示训练视频或待测视频,则最大待测相似度数据对应的情绪标签可以表示为y pred,具体为:
Figure PCTCN2022121852-appb-000001
argmax表示最大值,p表示目标文本模板,f vid表示视频端的编码器,这里将声音编码器、视觉编码器以及LSTM时序模块联合在一起作为视频端的编码器,因此f vid(E 1(x))表示待测非文本编码数据,f txt表示文本编码器,因此f txt([E T(p);E T(y i)])。C表示标签集合中的情绪类别的数量。E1和ET分别表示视频预处理(即抽帧)及文本预处理(即向量映射)。
训练时,可以采用交叉熵损失,表示为Loss,具体为:
Figure PCTCN2022121852-appb-000002
整个训练过程包括如下步骤:
a、输入人脸视频,视频经过预处理,固定选取M帧图片。
b、从人为制定的prompt集合中采样得到对应的prompt,记为p。
c、标签向量y(特指训练视频对应的情绪标签的向量)与向量p分别经过文本预处 理,然后通过向量拼接合成文本嵌入向量t。
d、将文本嵌入向量t和M帧图片输入文本编码器和视觉编码器,得到M个时序文本特征和M个时序图像特征。其中文本编码器和视觉编码器加载VIT-CLIP大规模预训练权重。
e、M个时序文本特征进行时序上的池化,得到最终的文本编码向量final_t。
f、M个时序图像特征经过LSTM模型,最后一个节点的特征当作最后的图像编码特征final_img。
g、声音特征经过声音编码器输出声音编码向量,与步骤f得到的final_img进行向量拼接,得到最终的视频编码向量final_vid。
h、将文本编码向量final_t和final_vid计算余弦相似度,计算交叉熵损失,并利用损失调节池化时采用的池化网络模块和LSTM模型的参数。
在测试过程中,可以执行如下步骤:
a、输入人脸视频,视频经过预处理,固定选取M帧图片。
b、将人为制定的prompt集合记为P,其中的每个prompt记为p,每个p都执行步骤c至步骤h。
c、标签向量集合y中的各个情绪标签对应的向量,分别与向量p经过文本预处理,然后通过向量拼接合成文本嵌入向量t。
d、将文本嵌入向量t和M帧图片输入文本编码器和视觉编码器,得到M个时序文本特征和M个时序图像特征。其中文本编码器和视觉编码器加载VIT-CLIP大规模预训练权重。
e、M个时序文本特征进行时序上的池化,得到最终的文本编码向量final_t。
f、M个时序图像特征经过LSTM模型,最后一个节点的特征当作最后的图像编码特征final_img。
g、声音特征经过声音编码器输出声音编码向量,与步骤f得到的final_img进行向量拼接,得到最终的视频编码向量final_vid。
h、按照如下公式,对每个p选取视频对应的情感类别:
Figure PCTCN2022121852-appb-000003
其中,f vid(E 1(x))表示final_vid,f txt([E T(p);E T(y i)])表示final_t。
i、按照各个p对应的投票,得到相应最终的情感类别。
在应用过程中,可以执行如下步骤:
a、输入人脸视频,视频经过预处理,固定选取M帧图片。
b、将人为制定的prompt集合记为P,其中的每个prompt记为p,从P中选出目标模板p0。
c、标签向量集合y中的各个情绪标签对应的向量,分别与向量p0经过文本预处理,然后通过向量拼接合成文本嵌入向量t0。
d、将文本嵌入向量t0和M帧图片输入文本编码器和视觉编码器,得到M个时序文本特征和M个时序图像特征。其中文本编码器和视觉编码器加载VIT-CLIP大规模预训练权重。
e、M个时序文本特征进行时序上的池化,得到最终的文本编码向量final_t0。
f、M个时序图像特征经过LSTM模型,最后一个节点的特征当作最后的图像编码特征final_img。
g、声音特征经过声音编码器输出声音编码向量,与步骤f得到的final_img进行向量拼接,得到最终的视频编码向量final_vid。
h、按照如下公式,对p0选取视频对应的情感类别:
Figure PCTCN2022121852-appb-000004
其中,f vid(E 1(x))表示final_vid,f txt([E T(p);E T(y i)])表示final_t0。
应用本申请实施例提供的情绪识别模型训练和情绪识别方法,将情绪识别过程由原本的概率预测问题转换为了相似匹配问题,同时引入了标签的本身所包含的语义信息,提高准确率的同时,还使得模型能够具备一定的zero-shot learning(零样本学习)的迁移能力。具体的,本申请在训练情绪识别模型时,利用情绪标签生成训练文本数据,并利用其训练初始模型,使得初始模型能够学习情绪标签携带的语义信息。在编码完毕后,通过相似度数据计算损失值并调参,使得初始模型的编码过程侧重于体现文本与非文本之间的相似程度。在应用时,同样通过待测视频的待测非文本编码数据分别和各个情绪标签对应的待测文本编码数据之间的相似度,来确定最相似的情绪标签,提高情绪识别准确率。同时,即便在应用时新增了情绪识别模型训练时未涉及到的情绪标签,情绪识别模型也能够基于该情绪标签的语义信息将其与其他情绪标签进行区分,具备了一定零样本学习的能力,提高了模型通用性。
另外,在实际应用中可以将训练好的情绪识别模型应用到识别终端。识别终端可以包括处理器、检测部件和显示屏,当然还可以包括输入部件。处理器分别与检测部件、输入部件以及显示屏连接,处理器可以获取待测视频和待测音频;在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应 的待测文本数据;将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据;将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。在得到情绪识别结果之后,可以通过显示屏展示该情绪识别结果。
在实际应用中,检测部件可以包括检测接口和采集部件(例如摄像头和麦克风)。输入部件可以包括输入接口和输入键盘,输入键盘可以便于用户向识别终端输入相关的指令或数据等。为了降低布线难度,满足数据传输需求,在识别终端上还可以设置无线传输模块。其中,无线传输模块可以为蓝牙模块或者wifi模块等。
图4为本申请实施例提供的一种识别终端的结构示意图,识别终端可以包括处理器、显示屏41、输入接口42、输入键盘43、检测接口44、摄像头45、麦克风46、无线传输模块47。当显示屏41为触摸屏时,输入键盘43可以是在显示屏41上呈现的软键盘。输入接口42可以用于实现与外部设备的连接。输入接口可以有多个,图3中以一个输入接口为例,检测接口44与采集部件45连接。处理器内嵌于识别终端的内部,因此未在图3中示出。
识别终端可以为智能手机、平板电脑、笔记本电脑或台式电脑等,在本申请实施例中,对于识别终端的形式不做限定。当识别终端为智能手机或平板电脑时,输入接口42可以通过数据线实现与外部设备的连接,输入键盘43可以为显示界面上呈现的软键盘。当识别终端为笔记本电脑或台式电脑时,输入接口42可以为USB接口,用于连接U盘等外部设备,输入键盘43可以为硬键盘。
以台式电脑为例,在实际应用中,用户可以将待测视频和待测音频导入U盘,将U盘插入识别终端的输入接口52。识别终端在获取到待测视频和待测音频后,在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据,将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据,利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据,将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果,并通过显示屏41展示识别结果。需要说明的是,图5中识别终端包含的显示屏41、输入接口42、输入键盘43、检测接口44、摄像头45、麦克风46、无线传输模块47等功能模块仅是举例说明,在实际应用中,基于实际需求问答终端也可以包含更多或更少的功能模块,对此不做限定。
本申请实施例提供的情绪识别方法可以部署于基于FPGA(Field Programmable Gate  Array,现场可编程门阵列)的神经网络加速应用或者AI(Artificial Intelligence,人工智能)加速芯片的软件平台中。需要说明的是,本申请实施例依据偏移量,对神经网络模型进行压缩处理的方式,除了应用于文本答案的确定外,也可以应用于基于LSTM(Long Short-Term Memory,长短期记忆网络)的时序数据处理,例如多目标跟踪等场景。
下面对本申请实施例提供的情绪识别装置进行介绍,下文描述的情绪识别装置与上文描述的情绪识别模型训练方法可相互对应参照。
请参考图5,图5为本申请实施例提供的一种情绪识别装置的结构示意图,包括:
待测获取模块51,用于获取待测视频和待测音频;
待测数据处理模块52,用于在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个情绪标签分别对应的待测文本数据;
待测输入模块53,用于将待测视频帧、待测文本数据和待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;
待测相似度生成模块54,用于利用待测非文本编码数据分别和各个待测文本编码数据生成待测相似度数据;
识别结果确定模块55,用于将最大待测相似度数据对应的情绪标签确定为待测视频对应的情绪识别结果。
可选地,待测数据处理模块52,包括:
待测模板确定单元,用于从预设模板库中选取待测文本模板;
待测向量映射单元,用于对待测文本模板分别和各个情绪标签进行向量映射处理,得到一个待测模板向量和各个标签向量;
待测拼接单元,用于将模板向量分别和各个标签向量进行拼接,得到待测文本数据。
可选地,还包括:
训练获取模块,用于获取训练视频、训练音频和情绪标签;
训练数据处理模块,用于在训练视频中确定多个训练视频帧,并利用情绪标签生成训练文本数据;
训练输入模块,用于将训练视频帧、训练文本数据和训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据;
训练相似度生成模块,用于利用训练文本编码数据和训练非文本编码数据生成相似度数据;
参数调节模块,用于利用相似度数据生成损失值,并基于损失值对初始模型进行参数调节;
模型确定模块,用于若检测到满足训练完成条件,则将参数调节后的初始模型确定为情绪识别模型。
可选地,初始模型包括文本编码器、图像编码器和音频编码器,还包括池化网络模块和时间递归网络模块,文本编码器的输出为池化网络模块的输入,图像编码器的输出为时间递归网络模块的输入。
可选地,训练输入模块,包括:
训练文本编码单元,用于将训练文本输入文本编码器,得到多个初始文本编码;
训练池化处理单元,用于将多个初始文本编码输入池化网络模块,得到训练文本编码数据;
训练音频编码单元,用于将训练视频帧输入图像编码器,得到多个初始图像编码,并将训练音频输入音频编码器,得到初始音频编码;
训练图像编码单元,用于将多个初始图像编码输入时间递归网络模块,得到中间图像编码;
训练拼接单元,用于将中间图像编码和初始音频编码进行拼接,得到训练非文本编码数据。
可选地,文本编码器和图像编码器属于语言图像对比学习预训练模型,音频编码器被预训练完毕;
参数调节模块,包括:
部分调节单元,用于基于损失值对初始模型中的池化网络模块和时间递归网络模块进行参数调节。
可选地,训练数据处理模块,包括:
目标模板选取单元,用于从预设模板库中选取一个目标文本模板;
向量映射单元,用于对目标文本模板和情绪标签进行向量映射处理,得到模板向量和标签向量;
文本向量拼接单元,用于对模板向量和标签向量进行拼接,得到训练文本数据。
可选地,模型确定模块,包括:
测试单元,用于利用测试数据对参数调节后的初始模型进行准确率测试,得到测试结果;
确定单元,用于若测试结果大于预设阈值,则确定满足训练完成条件。
可选地,测试数据包括多组测试子数据,其中包括目标测试子数据,目标测试子数据包括目标测试视频、目标测试音频和目标测试标签;
测试单元,包括:
测试数据处理子单元,用于在目标测试视频中确定多个目标测试视频帧,并利用标签集合中的各个情绪标签生成多个目标测试文本数据;其中,目标测试文本数据对应于至少一个文本模板;
测试输入子单元,用于将目标测试视频帧、目标测试文本数据和目标测试音频输入参数调节后的初始模型,得到目标非文本编码数据和多个目标文本编码数据;
测试计算子单元,用于计算目标非文本编码数据分别和各个目标文本编码数据之间的测试相似度数据,并利用测试相似度数据确定至少一个文本模板分别对应的至少一个最大相似度数据;
预测结果确定子单元,用于将至少一个最大相似度数据对应的情绪标签确定为目标测试视频对应的初始预测结果,并对初始预测结果进行最大数量筛选,得到预测结果;
子结果确定子单元,用于基于预测结果和目标测试标签确定目标测试子数据对应的测试子结果;
统计子单元,用于统计测试数据对应的全部测试子结果,得到测试结果。
下面对本申请实施例提供的电子设备进行介绍,下文描述的电子设备与上文描述的情绪识别模型训练方法,和/或,情绪识别方法可相互对应参照。
请参考图6,图6为本申请实施例提供的一种电子设备的结构示意图。其中电子设备100可以包括处理器101和存储器102,还可以进一步包括多媒体组件103、信息输入/信息输出(I/O)接口104以及通信组件105中的一种或多种。
其中,处理器101用于控制电子设备100的整体操作,以完成上述的情绪识别模型训练方法,和/或,情绪识别方法中的全部或部分步骤;存储器102用于存储各种类型的数据以支持在电子设备100的操作,这些数据例如可以包括用于在该电子设备100上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,SRAM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only  Memory,EPROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、只读存储器(Read-Only Memory,ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。
多媒体组件103可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或通过通信组件105发送。音频组件还包括至少一个扬声器,用于输出音频信号。I/O接口104为处理器101和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件105用于电子设备100与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件105可以包括:Wi-Fi部件,蓝牙部件,NFC部件。
电子设备100可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述实施例给出的情绪识别模型训练方法,和/或,情绪识别方法。
下面对本申请实施例提供的非易失性可读存储介质进行介绍,下文描述的非易失性可读存储介质与上文描述的情绪识别模型训练方法,和/或,情绪识别方法可相互对应参照。
本申请还提供一种非易失性可读存储介质,非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述的情绪识别模型训练方法,和/或,情绪识别方法的步骤。
该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言, 由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应该认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系属于仅仅用来将一个实体或者操作与另一个实体或者操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语包括、包含或者其他任何变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (21)

  1. 一种情绪识别方法,其特征在于,包括:
    获取待测视频和待测音频;
    在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个所述情绪标签分别对应的待测文本数据;
    将所述待测视频帧、所述待测文本数据和所述待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;
    利用所述待测非文本编码数据分别和各个所述待测文本编码数据生成待测相似度数据;
    将最大待测相似度数据对应的情绪标签确定为所述待测视频对应的情绪识别结果。
  2. 根据权利要求1所述的情绪识别方法,其特征在于,所述利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个所述情绪标签分别对应的待测文本数据,包括:
    从预设模板库中选取所述待测文本模板;
    对所述待测文本模板分别和各个所述情绪标签进行向量映射处理,得到一个待测模板向量和各个标签向量;
    将所述模板向量分别和各个所述标签向量进行拼接,得到所述待测文本数据。
  3. 根据权利要求1所述的情绪识别方法,其特征在于,所述情绪识别模型的训练过程,包括:
    获取训练视频、训练音频和情绪标签;
    在所述训练视频中确定多个训练视频帧,并利用所述情绪标签生成训练文本数据;
    将所述训练视频帧、所述训练文本数据和所述训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据;
    利用所述训练文本编码数据和所述训练非文本编码数据生成相似度数据;
    利用所述相似度数据生成损失值,并基于所述损失值对所述初始模型进行参数调节;
    若检测到满足训练完成条件,则将参数调节后的所述初始模型确定为情绪识别模型。
  4. 根据权利要求3所述的情绪识别模型训练方法,其特征在于,所述训练文本编码数据基于所述训练文本数据得到,所述训练文本编码数据用于表征情绪标签的情绪语义特征;所述训练非文本编码数据基于所述训练视频帧和所述训练音频得到;所述训练非 文本编码数据用于表征图像和声音表征的情绪特征。
  5. 根据权利要求4所述的情绪识别模型训练方法,其特征在于,所述初始模型包括文本编码器、图像编码器和音频编码器,还包括池化网络模块和时间递归网络模块,所述文本编码器的输出为所述池化网络模块的输入,所述图像编码器的输出为所述时间递归网络模块的输入。
  6. 根据权利要求5所述的情绪识别模型训练方法,其特征在于,所述将所述训练视频帧、所述训练文本数据和所述训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据,包括:
    将所述训练文本输入所述文本编码器,得到多个初始文本编码;
    将所述多个初始文本编码输入所述池化网络模块,得到所述训练文本编码数据;
    将所述训练视频帧输入所述图像编码器,得到多个初始图像编码,并将所述训练音频输入所述音频编码器,得到初始音频编码;
    将所述多个初始图像编码输入所述时间递归网络模块,得到中间图像编码;
    将所述中间图像编码和所述初始音频编码进行拼接,得到所述训练非文本编码数据。
  7. 根据权利要求5所述的情绪识别模型训练方法,其特征在于,所述文本编码器和所述图像编码器属于语言图像对比学习预训练模型,所述音频编码器被预训练完毕。
  8. 根据权利要求5所述的情绪识别模型训练方法,其特征在于,所述池化网络模块用于对所述文本编码器的输出做时序上的池化操作。
  9. 根据权利要求7所述的情绪识别模型训练方法,其特征在于,所述基于所述损失值对所述初始模型进行参数调节,包括:
    基于所述损失值对所述初始模型中的所述池化网络模块和所述时间递归网络模块进行参数调节。
  10. 根据权利要求3所述的情绪识别模型训练方法,其特征在于,所述利用所述情绪标签生成训练文本数据,包括:
    将所述情绪标签作为所述训练文本数据。
  11. 根据权利要求3所述的情绪识别模型训练方法,其特征在于,所述利用所述情绪标签生成训练文本数据,包括:
    利用预设的文本模板和所述情绪标签共同生成所述训练文本数据。
  12. 根据权利要求11所述的情绪识别模型训练方法,其特征在于,所述利用预设的文本模板和所述情绪标签共同生成所述训练文本数据,包括:
    从预设模板库中选取一个目标文本模板;
    对所述目标文本模板和所述情绪标签进行向量映射处理,得到模板向量和标签向量;
    对所述模板向量和所述标签向量进行拼接,得到所述训练文本数据。
  13. 根据权利要求3所述的情绪识别模型训练方法,其特征在于,所述检测到满足训练完成条件,包括:
    利用测试数据对参数调节后的所述初始模型进行准确率测试,得到测试结果;
    若所述测试结果大于预设阈值,则确定满足所述训练完成条件。
  14. 根据权利要求13所述的情绪识别模型训练方法,其特征在于,所述测试数据包括多组测试子数据,其中包括目标测试子数据,所述目标测试子数据包括目标测试视频、目标测试音频和目标测试标签。
  15. 根据权利要求14所述的情绪识别模型训练方法,其特征在于,所述利用测试数据对参数调节后的所述初始模型进行准确率测试,得到测试结果,包括:
    在所述目标测试视频中确定多个目标测试视频帧,并利用标签集合中的各个情绪标签生成多个目标测试文本数据;其中,所述目标测试文本数据对应于至少一个文本模板;
    将所述目标测试视频帧、所述目标测试文本数据和所述目标测试音频输入参数调节后的初始模型,得到目标非文本编码数据和多个目标文本编码数据;
    计算所述目标非文本编码数据分别和各个目标文本编码数据之间的测试相似度数据,并利用所述测试相似度数据确定所述至少一个文本模板分别对应的至少一个最大相似度数据;
    将所述至少一个最大相似度数据对应的情绪标签确定为所述目标测试视频对应的初始预测结果,并对所述初始预测结果进行最大数量筛选,得到预测结果;
    基于所述预测结果和所述目标测试标签确定所述目标测试子数据对应的测试子结果;
    统计所述测试数据对应的全部测试子结果,得到所述测试结果。
  16. 根据权利要求15所述的情绪识别模型训练方法,其特征在于,所述基于所述预测结果和所述目标测试标签确定所述目标测试子数据对应的测试子结果,包括:
    若所述预测结果与所述目标测试标签相同,则所述测试子结果预测正确;否则所述测试子结果预测错误。
  17. 根据权利要求3所述的情绪识别模型训练方法,其特征在于,所述检测到满足训练完成条件,包括:
    检测到训练时长达到预设时长限值的情况下,确定满足所述训练完成条件;
    或检测到训练轮数达到预设训练次数的情况下,确定满足所述训练完成条件。
  18. 一种情绪识别装置,其特征在于,包括:
    待测获取模块,用于获取待测视频和待测音频;
    待测数据处理模块,用于在待测视频中确定多个待测视频帧,并利用标签集合中的各个情绪标签分别与待测文本模板拼接生成各个所述情绪标签分别对应的待测文本数据;
    待测输入模块,用于将所述待测视频帧、所述待测文本数据和所述待测音频输入情绪识别模型,得到待测非文本编码数据和各个待测文本数据分别对应的各个待测文本编码数据;
    待测相似度生成模块,用于利用所述待测非文本编码数据分别和各个所述待测文本编码数据生成待测相似度数据;
    识别结果确定模块,用于将最大待测相似度数据对应的情绪标签确定为所述待测视频对应的情绪识别结果。
  19. 一种情绪识别模型训练装置,其特征在于,包括:
    训练获取模块,用于获取训练视频、训练音频和情绪标签;
    训练数据处理模块,用于在所述训练视频中确定多个训练视频帧,并利用所述情绪标签生成训练文本数据;
    训练输入模块,用于将所述训练视频帧、所述训练文本数据和所述训练音频输入初始模型,得到训练文本编码数据以及训练非文本编码数据;
    训练相似度生成模块,用于利用所述训练文本编码数据和所述训练非文本编码数据生成相似度数据;
    参数调节模块,用于利用所述相似度数据生成损失值,并基于所述损失值对所述初始模型进行参数调节;
    模型确定模块,用于若检测到满足训练完成条件,则将参数调节后的所述初始模型确定为情绪识别模型。
  20. 一种电子设备,其特征在于,包括存储器和处理器,其中:
    所述存储器,用于保存计算机程序;
    所述处理器,用于执行所述计算机程序,以实现如权利要求1至17任一项所述的情 绪识别方法。
  21. 一种非易失性可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时以实现如权利要求1至17任一项所述的情绪识别方法。
PCT/CN2022/121852 2022-06-30 2022-09-27 情绪识别方法、装置、设备及存储介质 WO2024000867A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210760941.XA CN115050077A (zh) 2022-06-30 2022-06-30 情绪识别方法、装置、设备及存储介质
CN202210760941.X 2022-06-30

Publications (1)

Publication Number Publication Date
WO2024000867A1 true WO2024000867A1 (zh) 2024-01-04

Family

ID=83164944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121852 WO2024000867A1 (zh) 2022-06-30 2022-09-27 情绪识别方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115050077A (zh)
WO (1) WO2024000867A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117807995A (zh) * 2024-02-29 2024-04-02 浪潮电子信息产业股份有限公司 一种情绪引导的摘要生成方法、系统、装置及介质
CN118230398A (zh) * 2024-05-24 2024-06-21 中国科学技术大学 一种微表情识别模型的训练方法、识别方法及相关设备
CN118312620A (zh) * 2024-06-07 2024-07-09 北京中关村科金技术有限公司 面向智慧数字人的页面交互信息挖掘方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115050077A (zh) * 2022-06-30 2022-09-13 浪潮电子信息产业股份有限公司 情绪识别方法、装置、设备及存储介质
CN116320611B (zh) * 2023-04-06 2024-05-03 湖南梵映教育科技有限公司 一种音视频的合成方法及系统
CN116229332B (zh) * 2023-05-06 2023-08-04 浪潮电子信息产业股份有限公司 一种视频预训练模型的训练方法、装置、设备及存储介质
CN116978106B (zh) * 2023-09-22 2024-01-05 华侨大学 批处理混合对比学习的跨模态情绪异常检测方法和装置
CN117217807B (zh) * 2023-11-08 2024-01-26 四川智筹科技有限公司 一种基于多模态高维特征的不良资产估值方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339913A (zh) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 一种视频中的人物情绪识别方法及装置
US20200286506A1 (en) * 2019-03-08 2020-09-10 Tata Consultancy Services Limited Method and system using successive differences of speech signals for emotion identification
CN112926525A (zh) * 2021-03-30 2021-06-08 中国建设银行股份有限公司 情绪识别方法、装置、电子设备和存储介质
CN113536999A (zh) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 人物情绪识别方法、系统、介质及电子设备
CN113920561A (zh) * 2021-09-23 2022-01-11 广东技术师范大学 一种基于零样本学习的人脸表情识别方法及装置
CN114550057A (zh) * 2022-02-24 2022-05-27 重庆邮电大学 一种基于多模态表示学习的视频情绪识别方法
CN115050077A (zh) * 2022-06-30 2022-09-13 浪潮电子信息产业股份有限公司 情绪识别方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11176484B1 (en) * 2017-09-05 2021-11-16 Amazon Technologies, Inc. Artificial intelligence system for modeling emotions elicited by videos
CN108197115B (zh) * 2018-01-26 2022-04-22 上海智臻智能网络科技股份有限公司 智能交互方法、装置、计算机设备和计算机可读存储介质
CN110781916B (zh) * 2019-09-18 2024-07-16 平安科技(深圳)有限公司 视频数据的欺诈检测方法、装置、计算机设备和存储介质
WO2021231484A1 (en) * 2020-05-13 2021-11-18 SESH Corp. Machine-learned prediction of decision state and generating feedback information for decision states
CN114120978A (zh) * 2021-11-29 2022-03-01 中国平安人寿保险股份有限公司 情绪识别模型训练、语音交互方法、装置、设备及介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200286506A1 (en) * 2019-03-08 2020-09-10 Tata Consultancy Services Limited Method and system using successive differences of speech signals for emotion identification
CN111339913A (zh) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 一种视频中的人物情绪识别方法及装置
CN112926525A (zh) * 2021-03-30 2021-06-08 中国建设银行股份有限公司 情绪识别方法、装置、电子设备和存储介质
CN113536999A (zh) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 人物情绪识别方法、系统、介质及电子设备
CN113920561A (zh) * 2021-09-23 2022-01-11 广东技术师范大学 一种基于零样本学习的人脸表情识别方法及装置
CN114550057A (zh) * 2022-02-24 2022-05-27 重庆邮电大学 一种基于多模态表示学习的视频情绪识别方法
CN115050077A (zh) * 2022-06-30 2022-09-13 浪潮电子信息产业股份有限公司 情绪识别方法、装置、设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117807995A (zh) * 2024-02-29 2024-04-02 浪潮电子信息产业股份有限公司 一种情绪引导的摘要生成方法、系统、装置及介质
CN117807995B (zh) * 2024-02-29 2024-06-04 浪潮电子信息产业股份有限公司 一种情绪引导的摘要生成方法、系统、装置及介质
CN118230398A (zh) * 2024-05-24 2024-06-21 中国科学技术大学 一种微表情识别模型的训练方法、识别方法及相关设备
CN118312620A (zh) * 2024-06-07 2024-07-09 北京中关村科金技术有限公司 面向智慧数字人的页面交互信息挖掘方法及系统

Also Published As

Publication number Publication date
CN115050077A (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
WO2024000867A1 (zh) 情绪识别方法、装置、设备及存储介质
CN108520741A (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
WO2022078146A1 (zh) 语音识别方法、装置、设备以及存储介质
JP2023545543A (ja) 情報生成方法、装置、コンピュータ機器、記憶媒体及びコンピュータプログラム
CN113421547B (zh) 一种语音处理方法及相关设备
CN112232276B (zh) 一种基于语音识别和图像识别的情绪检测方法和装置
CN112069309A (zh) 信息获取方法、装置、计算机设备及存储介质
WO2023226239A1 (zh) 对象情绪的分析方法、装置和电子设备
WO2021232876A1 (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN112055257B (zh) 视频课堂的互动方法、装置、设备及存储介质
CN112036174B (zh) 一种标点标注方法及装置
CN114492579A (zh) 情绪识别方法、摄像装置、情绪识别装置及存储装置
CN110827799A (zh) 用于处理语音信号的方法、装置、设备和介质
TWI769520B (zh) 多國語言語音辨識及翻譯方法與相關的系統
WO2024093578A1 (zh) 语音识别方法、装置、电子设备、存储介质及计算机程序产品
CN113393841B (zh) 语音识别模型的训练方法、装置、设备及存储介质
CN113762056A (zh) 演唱视频识别方法、装置、设备及存储介质
CN112597889A (zh) 一种基于人工智能的情绪处理方法和装置
CN113689527A (zh) 一种人脸转换模型的训练方法、人脸图像转换方法
CN116935277A (zh) 多模态情感识别方法及装置
CN116453539A (zh) 用于多说话人的语音分离方法、装置、设备及存储介质
CN116092485A (zh) 语音识别模型的训练方法及装置、语音识别方法及装置
CN116959417A (zh) 对话回合的检测方法、装置、设备、介质、程序产品
CN111931510B (zh) 一种基于神经网络的意图识别方法及装置、终端设备
CN115171673A (zh) 一种基于角色画像的交流辅助方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22948956

Country of ref document: EP

Kind code of ref document: A1