CN112597889A - Emotion processing method and device based on artificial intelligence - Google Patents

Emotion processing method and device based on artificial intelligence Download PDF

Info

Publication number
CN112597889A
CN112597889A CN202011532100.0A CN202011532100A CN112597889A CN 112597889 A CN112597889 A CN 112597889A CN 202011532100 A CN202011532100 A CN 202011532100A CN 112597889 A CN112597889 A CN 112597889A
Authority
CN
China
Prior art keywords
information
emotion
expression
image
target user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011532100.0A
Other languages
Chinese (zh)
Inventor
张延雄
徐彩营
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011532100.0A priority Critical patent/CN112597889A/en
Publication of CN112597889A publication Critical patent/CN112597889A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention relates to an artificial intelligence-based emotion processing method and device, wherein a video file to be detected is obtained, the video file to be detected is segmented to respectively obtain image information, audio information and text information related to a target user and other users, feature extraction, time sequence processing and information fusion are sequentially carried out to obtain fusion feature information, the emotion result of the target user is obtained according to the fusion feature information, the emotion result is input into a preset emotion relieving model, and a target emotion relieving measure corresponding to the emotion result is obtained. The emotion processing method based on artificial intelligence can accurately and reliably recognize the emotion of the target user, compared with an artificial recognition mode, the emotion processing method is not influenced by subjective factors, recognition accuracy is further improved, accordingly, emotion relieving measures meeting requirements can be obtained according to emotion results, and correct emotion relieving is achieved.

Description

Emotion processing method and device based on artificial intelligence
Technical Field
The invention relates to an emotion processing method and device based on artificial intelligence.
Background
With the rapid development of artificial intelligence, artificial intelligence can be used in many fields for data processing or intelligent control, such as emotional processing. The conventional emotion processing method or emotion detection method adopts an artificial recognition mode to recognize emotion, but the artificial recognition mode is easily influenced by subjective factors, so that the recognition accuracy is low, and correct emotion relieving measures cannot be provided.
Disclosure of Invention
The invention provides an emotion processing method and device based on artificial intelligence, which are used for solving the technical problem of low accuracy of an artificial emotion recognition mode.
The invention adopts the following technical scheme:
an artificial intelligence based emotion processing method, comprising:
acquiring a video file to be detected, wherein the video file to be detected is a video file for voice interaction between a target user and other users;
according to a voice interaction process, first multi-dimensional information related to the target user and second multi-dimensional information related to other users are divided from the video file to be detected, wherein the first multi-dimensional information and the second multi-dimensional information respectively comprise at least one corresponding image information, at least one corresponding audio information and at least one corresponding text information;
performing feature extraction on the first multi-dimensional information and the second multi-dimensional information to obtain first feature information and second feature information;
respectively carrying out time sequence processing on the first characteristic information and the second characteristic information to obtain first time sequence characteristic information and second time sequence characteristic information;
fusing the first time sequence characteristic information and the second time sequence characteristic information to obtain fused characteristic information;
acquiring an emotion result of the target user at the ending moment of the video file to be detected according to the fusion characteristic information;
and inputting the emotion result into a preset emotion relieving model to obtain a target emotion relieving measure corresponding to the emotion result, and outputting the target emotion relieving measure.
Preferably, the first feature information includes first image features corresponding to respective image information related to the target user, first audio features corresponding to respective audio information related to the target user, and first text features corresponding to respective text information related to the target user; the second feature information comprises second image features corresponding to the image information related to the other users, second audio features corresponding to the audio information related to the other users and second text features corresponding to the text information related to the other users;
the first time sequence feature information includes a first time sequence image feature, a first time sequence audio feature and a first time sequence text feature, and the second time sequence feature information includes a second time sequence image feature, a second time sequence audio feature and a second time sequence text feature.
Preferably, the feature extraction of the image information in the first multi-dimensional information specifically includes: identifying the expression of the target user in each image information, wherein the obtained first image characteristic is the character expression of the target user in each image information;
the specific step of extracting the features of the image information in the second multi-dimensional information is as follows: recognizing the expressions of the other users in the image information, wherein the obtained second image characteristic is the character expressions of the other users in the image information;
the specific steps of extracting the characteristics of the audio information in the first multi-dimensional information are as follows: acquiring a decibel maximum value of a voice waveform corresponding to each audio information of the target user, wherein the acquired first audio characteristic is the decibel maximum value of each audio information of the target user;
the specific steps of extracting the characteristics of the audio information in the second multi-dimensional information are as follows: acquiring a decibel maximum value of a voice waveform corresponding to each piece of audio information of the other users, wherein the acquired second audio characteristic is the decibel maximum value of each piece of audio information of the other users;
the specific steps of extracting the features of the text information in the first multi-dimensional information are as follows: acquiring preliminary emotion information in each text message of the target user, wherein the acquired first text characteristic is the preliminary emotion information of each text message of the target user;
the specific steps of extracting the features of the text information in the second multi-dimensional information are as follows: and acquiring the preliminary emotion information in the text information of the other users, wherein the acquired second text characteristic is the preliminary emotion information in the text information of the other users.
Preferably, the first time-series image characteristic is an expression change condition of the target user, the first time-series audio characteristic is a fluctuation condition of a decibel maximum value of the target user, and the first time-series text characteristic is a preliminary emotion change condition of the target user;
the second time sequence image characteristic is the expression change condition of other users, the second time sequence audio characteristic is the fluctuation condition of the decibel maximum value of other users, and the second time sequence text characteristic is the preliminary emotion change condition of other users.
Preferably, feature extraction is performed on the text information according to a preset detection model, wherein an acquisition process of the preset detection model includes:
acquiring at least two correction texts and acquiring actual emotion information of each correction text;
inputting the correction texts into an existing detection model to obtain detection emotion information of the correction texts;
and acquiring correction texts with the same actual emotion information and detection emotion information, and adjusting the existing detection model according to the correction texts with the same actual emotion information and detection emotion information to obtain the preset detection model.
Preferably, the process of recognizing the human expression in the image information is:
carrying out face recognition on the image information to obtain a face image;
and performing expression recognition on the face image to obtain the expression of the character.
Preferably, the expression recognition of the face image is performed to obtain the expression of the person specifically as follows:
acquiring at least two sample sets, including a first sample set and a second sample set, wherein the first sample set comprises at least two first expression sample images, and the second sample set comprises at least two second expression sample images;
labeling each first expression sample image in the first sample set to obtain a first expression category, labeling each second expression sample image in the second sample set to obtain a second expression category, wherein the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the facial image into the expression recognition network to obtain the expression of the character.
Preferably, the obtaining of the correction texts with the same actual emotion information and the detection emotion information specifically includes:
acquiring correction texts of which the actual emotion information and the detected emotion information are positive emotions to obtain first correction texts and correction texts of which the actual emotion information and the detected emotion information are negative emotions to obtain second correction texts;
correspondingly, the obtaining of the at least two sample sets includes a first sample set and a second sample set, where the first sample set includes at least two first expression sample images, and the second sample set includes at least two second expression sample images specifically:
acquiring two sample sets, namely a first sample set and a second sample set, wherein a first expression sample image in the first sample set is a positive expression sample image, and a second expression sample image in the second sample set is a negative expression sample image;
correspondingly, the first expression category is positive expression, and the second expression category is negative expression.
An artificial intelligence based emotion processing apparatus comprising a memory and a processor, and a computer program stored on the memory and run on the processor, the processor implementing the artificial intelligence based emotion processing method as described above when executing the computer program.
The processing object of the emotion processing method based on artificial intelligence provided by the invention is a video file to be detected, the video file to be detected is a video file for voice interaction between a target user and other users, then, according to the voice interaction process, first multi-dimensional information related to the target user and second multi-dimensional information related to other users are divided from the video file to be detected, the first multi-dimensional information and the second multi-dimensional information respectively comprise at least one corresponding image information, at least one corresponding audio information and at least one corresponding text information, feature extraction is carried out on the first multi-dimensional information and the second multi-dimensional information, then time sequence processing is carried out to obtain first time sequence feature information and second time sequence feature information, the first time sequence feature information and the second time sequence feature information are fused to obtain fused feature information, according to the fused feature information, an emotion result of the target user at the finishing moment of the video file to be detected is obtained, and finally, obtaining and outputting target emotion relieving measures corresponding to the emotion results, so that the video file to be detected is segmented, image information, audio information and text information related to the target user and other users are separated, then are respectively processed, and finally are fused, the emotion of the target user can be accurately and reliably recognized.
Drawings
FIG. 1 is a flow chart of an emotion processing method based on artificial intelligence provided by the invention.
Detailed Description
The embodiment provides an emotion processing method based on artificial intelligence, a hardware execution main body of the emotion processing method may be a computer device, a server device, an intelligent mobile terminal, and the like, and the embodiment does not specifically limit the hardware execution main body.
As shown in fig. 1, the emotion processing method includes:
step S1: acquiring a video file to be detected, wherein the video file to be detected is a video file for voice interaction between a target user and other users:
the video file to be detected is a data processing object of the emotion processing method, and the video file to be detected is a video file for voice interaction between the target user and other users, namely the video file to be detected is a section of video of the target user and other users during talking and communication. The emotion processing method is used for detecting the emotion of a target user and outputting corresponding emotion relieving measures according to emotion results. As a specific embodiment, the number of other users is one, and if the number of other users is at least two, the subsequent data processing process and the number are the same. In this embodiment, the target user first says one word, then the other users say one word, then the target user says one word, then the other users say one word, and so on, and the number of words spoken by each of the two users is set according to actual needs, for example: the target user says a first, then the other users say B, then the target user says C, then the other users say D, then the target user says E, and finally the other users say F.
It should be understood that the emotion processing method provided by the present embodiment is not suitable for a scenario in which the target user and the other users speak simultaneously, because the data information of the target user and the data information of the other users cannot be effectively separated in the scenario in which the target user and the other users speak simultaneously.
Step S2: according to a voice interaction process, first multi-dimensional information related to the target user and second multi-dimensional information related to other users are segmented from the video file to be detected, and the first multi-dimensional information and the second multi-dimensional information respectively comprise corresponding at least one image information, at least one audio information and at least one text information:
the video file to be detected is a process of voice interaction between the target user and other users, namely a process of conversation. Then, the video file to be detected can be divided into two parts according to the voice interaction process, wherein one part is the first multi-dimensional information of the target user, and the other part is the second multi-dimensional information of other users. Wherein the first multi-dimensional information comprises at least one image information, at least one audio information and at least one text information of the target user; the second multi-dimensional information includes at least one image information, at least one audio information, and at least one text information of the other user. It should be understood that the image information is a video segment with only images, the audio information is an audio segment, and the text information is a segment of text.
As described above, for example, since the target user says a first, then the other users say B, then the target user says C, then the other users say D, then the target user says E, and finally the other users say F, the audio of the entire video file to be detected is divided, so that the audio information of the target user is A, C and E, and the audio information of the other users is B, D and F. That is, the target user has three pieces of audio information, and the other users have three pieces of audio information. It should be understood that since there is usually a certain time interval between two speakers, and no speaker is speaking in the time interval, the start point and the end point of the audio information can be detected by VAD technique, so as to perform the audio information segmentation. Moreover, in order to improve the recognition accuracy, the speaking sequence of the target user and other users can be preset, and the audio information of the target user and other users can be recognized and obtained only according to the starting point and the ending point of each audio information. As another embodiment, this step may also be processed in conjunction with an artificial segmentation process, i.e., artificially segmenting the video file to be detected.
After the audio information of the target user and the audio information of other users are obtained, the image in the video file can be segmented according to the audio information (for example, after each audio information is obtained, the starting point time and the ending point time of each audio information are recorded, the video file to be detected is segmented according to the starting point time and the ending point time to obtain a plurality of video segments), and the image information of the target user corresponding to each audio information is obtained, namely, the number of the image information of the target user is the same as that of the audio information, and the image information and the audio information of the target user are in one-to-one correspondence; and obtaining image information of other users corresponding to each piece of audio information, namely the image information of other users is the same as the number of the audio information and is in one-to-one correspondence with the number of the audio information. It should be understood that, because the facial expressions of the target user and the other users need to be recognized, the image information of the target user corresponding to the respective audio information includes the front face image of the target user, and the image information of the other users corresponding to the respective audio information includes the front face images of the other users.
The text information is obtained by performing voice recognition on the audio information, so that the text information of the target user is the same as and corresponds to the audio information in number, and the text information of other users is the same as and corresponds to the audio information in number.
Then, the audio information of the target user is A, C and E, the image information of the target user is image information corresponding to audio information a, image information corresponding to audio information C, and image information corresponding to audio information E, respectively, and the text information of the target user is text information corresponding to audio information a, text information corresponding to audio information C, and text information corresponding to audio information E, respectively. The audio information of the other users is B, D and F, the image information of the other users is image information corresponding to audio information B, image information corresponding to audio information D, and image information corresponding to audio information F, respectively, and the text information of the other users is text information corresponding to audio information B, text information corresponding to audio information D, and text information corresponding to audio information F, respectively.
Step S3: performing feature extraction on the first multi-dimensional information and the second multi-dimensional information to obtain first feature information and second feature information:
since the first multi-dimensional information includes at least one image information, at least one audio information and at least one text information of the target user; the second multi-dimensional information includes at least one image information, at least one audio information, and at least one text information of the other user. Then, the first feature information includes a first image feature corresponding to each image information related to the target user, a first audio feature corresponding to each audio information related to the target user, and a first text feature corresponding to each text information related to the target user. The second feature information includes second image features corresponding to the respective image information related to the other user, second audio features corresponding to the respective audio information related to the other user, and second text features corresponding to the respective text information related to the other user.
The embodiment provides an extraction process of image information features, which specifically includes the following steps:
the specific steps of extracting the features of the image information in the first multi-dimensional information are as follows: identifying the expression of the target user in each image information, wherein the obtained first image characteristic is the character expression of the target user in each image information; the specific step of extracting the features of the image information in the second multi-dimensional information is as follows: and recognizing the expressions of other users in each image information, wherein the obtained second image characteristic is the character expressions of other users in each image information.
Since the difference between the human expressions in the image information in the first multi-dimensional information and the image information in the second multi-dimensional information is only that the user is different, the recognition process is the same. The implementation process of feature extraction on the image information, namely the process of identifying the expression of the person in the image information, is given as follows: firstly, carrying out face recognition on the image information to obtain a face image, and then carrying out expression recognition on the face image to obtain the expression of the character.
It should be understood that, since the image information is a video segment, in order to accurately obtain the expression of the character, the image information may be subjected to screenshot processing, that is, at least one screenshot is obtained, and the facial image in each screenshot is identified through a facial image identification algorithm. The screenshot can be performed only once for one image information, so that the image information corresponds to the screenshots one by one, or the multiple screenshots are performed, so that one image information corresponds to multiple screenshots.
As a specific embodiment, a specific implementation process of performing expression recognition on a face image to obtain a human expression is given as follows:
at least two sample sets are obtained, wherein the at least two sample sets comprise a first sample set and a second sample set, the first sample set comprises at least two first expression sample images, and the second sample set comprises at least two second expression sample images. In this embodiment, the expressions are divided into two types, which are positive expressions and negative expressions, where the positive expressions are happy, and the like, and the negative expressions are hurt, cry, too much, and the like, and accordingly, the obtaining of the at least two sample sets is to obtain two sample sets, which are a first sample set and a second sample set, respectively, a first expression sample image in the first sample set is a positive expression sample image, and a second expression sample image in the second sample set is a negative expression sample image. It should be understood that, as another embodiment, the expression may be further refined into more than two expressions, and then more than two sample sets need to be obtained, and the expression sample images in each sample set correspond to each expression.
And labeling each first expression sample image in the first sample set to obtain a first expression category, and labeling each second expression sample image in the second sample set to obtain a second expression category. As a specific embodiment, the first expression category is positive expression, and the second expression category is negative expression. That is to say, the expression categories of the labels are divided into two categories, different indexes can be used to represent different expression categories, where the index 0 corresponds to a positive expression and the index 1 corresponds to a negative expression, and the labels can be further encoded by one-hot. The first expression category and the second expression category constitute annotation data.
The expression recognition network comprises an expression recognition encoder, a Flatten layer, a full connection layer and a softmax function.
Inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, outputting a feature vector (such as mouth angle opening degree) by the expression recognition encoder, inputting the feature vector into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting the probabilities of two expression categories through the softmax function, wherein the probabilities of the two expression categories are added to be 1, and determining the corresponding initial expression category according to the output probabilities of the two expression categories.
And calculating the obtained initial expression categories and the marking data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression categories are gradually close to the real values.
Then, inputting a face image into the expression recognition network, and performing expression recognition through the expression recognition network, specifically, inputting the face image into an expression recognition encoder for feature extraction, outputting a feature vector by an image classification encoder, inputting the feature vector into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature vector to a softmax function, and outputting a corresponding expression category through the ftsoftmax function, wherein the output expression category is a positive expression or a negative expression.
And obtaining the first image characteristic and the second image characteristic through the identification process.
It should be understood that the expression recognition process according to the face image may also adopt other existing recognition processes, and will not be described in detail.
Since the audio information is essentially a voice waveform signal composed of a plurality of continuous decibel values, the audio information is converted into a voice waveform diagram, and a maximum decibel value in the voice waveform diagram is obtained. Then, the specific steps of performing feature extraction on the audio information in the first multi-dimensional information are as follows: acquiring a decibel maximum value of a voice waveform corresponding to each audio information of a target user, wherein the acquired first audio characteristic is the decibel maximum value of each audio information of the target user; the specific steps of extracting the characteristics of the audio information in the second multi-dimensional information are as follows: and acquiring the decibel maximum value of the voice waveform corresponding to each piece of audio information of other users, wherein the acquired second audio characteristic is the decibel maximum value of each piece of audio information of other users.
The specific steps of extracting the features of the text information in the first multi-dimensional information are as follows: acquiring preliminary emotion information in each text message of a target user, wherein the acquired first text features are the preliminary emotion information of each text message of the target user; the specific characteristic extraction of the text information in the second multi-dimensional information is as follows: and acquiring the preliminary emotion information in the text information of other users, wherein the acquired second text characteristics are the preliminary emotion information in the text information of other users.
In this embodiment, the text information is processed by presetting a detection model, so as to obtain corresponding preliminary emotion information. It should be understood that the preset detection model may be a detection model that is constructed in advance, and includes at least two texts and preliminary emotion information corresponding to each text, and it should be understood that, in order to improve detection accuracy, the number of texts set in the detection model may be sufficient, that is, all texts that are known to be capable of occurring or generating are included in the detection model. In order to improve the detection accuracy, each text in the preset detection model may be a keyword, and is not necessarily a complete sentence, for example: the complete sentence is: "I do not want to do, the keywords may be: "do not want to dry".
The detection model related in the above paragraph may also be an existing detection model, and as a specific implementation, the preset detection model is a detection model obtained by correcting the existing detection model, so that an acquisition process of the preset detection model is given as follows:
(1) at least two corrected texts are obtained. It should be appreciated that in order to improve the reliability of the correction and improve the accuracy of the predictive detection model, the text settings are corrected sufficiently widely in this step.
(2) Since the correction text is a text for correcting the existing detection model and is known, the actual emotion information corresponding to each correction text is also known, and the actual emotion information of each correction text is acquired.
(3) And inputting each correction text into the existing detection model to obtain the detection emotion information of each correction text.
(4) The method comprises the steps of obtaining actual emotion information of each corrected text, checking the two emotion information of each corrected text after detecting the emotion information, obtaining each corrected text with the same actual emotion information and the same detected emotion information, adjusting an existing detection model according to each corrected text with the same obtained actual emotion information and the same detected emotion information to obtain a preset detection model, obtaining each corrected text with the same actual emotion information and the same detected emotion information as a specific implementation mode, obtaining each first corrected text, each corrected text with the same actual emotion information and the same detected emotion information as negative emotions to obtain each second corrected text, and adjusting the existing detection model according to each first corrected text and each second corrected text to obtain the preset detection model. Two adjustment methods are given below, the first: constructing a preset detection model directly according to each first correction text and each second correction text without considering the existing detection model; and the second method comprises the following steps: deleting each text message of which the emotion information in the existing detection model is positive emotion from each text message which does not meet the condition that the actual emotion information and the detected emotion information are both positive emotions; and deleting each text message of which the emotion information is negative emotion in the existing detection model, wherein each text message does not meet the condition that the actual emotion information and the detected emotion information are negative emotions. Through the correction, the detection precision of the preset detection model can be improved.
Therefore, the text information is input into the preset detection model, and the preliminary emotion information is acquired.
Step S4: respectively carrying out time sequence processing on the first characteristic information and the second characteristic information to obtain first time sequence characteristic information and second time sequence characteristic information:
and respectively carrying out time sequence processing on the first characteristic information and the second characteristic information to obtain first time sequence characteristic information and second time sequence characteristic information. The first time sequence feature information comprises a first time sequence image feature, a first time sequence audio feature and a first time sequence text feature, and the second time sequence feature information comprises a second time sequence image feature, a second time sequence audio feature and a second time sequence text feature.
In this embodiment, since the first image feature is the character expression of the target user in each image information, the first time-series image feature is the expression change condition of the target user. Since the first audio characteristic is the maximum decibel value of each audio information of the target user, the first time-series audio characteristic is the fluctuation condition of the maximum decibel value of the target user. Since the first text feature is the preliminary emotion information of each text information of the target user, the first time-series text feature is the preliminary emotion change condition of the target user.
Similarly, the second time sequence image characteristic is the expression change condition of other users, the second time sequence audio characteristic is the fluctuation condition of the decibel maximum value of other users, and the second time sequence text characteristic is the preliminary emotion change condition of other users.
Step S5: fusing the first time sequence characteristic information and the second time sequence characteristic information to obtain fused characteristic information:
and fusing the first time sequence characteristic information and the second time sequence characteristic information to obtain fused characteristic information. In this embodiment, the first time sequence image feature and the second time sequence image feature may be fused to obtain an image fusion feature; and fusing the first time sequence audio features and the second time sequence audio features to obtain audio fusion features, and fusing the first time sequence text features and the second time sequence text features to obtain text fusion features. Then, the image fusion feature is the expression change condition of the target user and the expression change conditions of other users, the audio fusion feature is the fluctuation condition of the decibel maximum value of the target user and the fluctuation condition of the decibel maximum value of other users, and the text fusion feature is the preliminary emotion change condition of the target user and the preliminary emotion change conditions of other users.
Step S6: acquiring an emotion result of the target user at the ending moment of the video file to be detected according to the fusion characteristic information:
and acquiring the emotion result of the target user at the ending moment of the video file to be detected according to the obtained fusion characteristic information, namely acquiring the emotion result of the target user corresponding to the video file to be detected.
Because the fusion feature information includes the image fusion feature, the audio fusion feature and the text fusion feature, the emotion result of the target user is obtained according to the image fusion feature, the audio fusion feature and the text fusion feature. In this embodiment, a first sub-emotion result is obtained according to the image fusion feature, a second sub-emotion result is obtained according to the audio fusion feature, a third sub-emotion result is obtained according to the text fusion feature, and an emotion result of the target user is obtained according to the first sub-emotion result, the second sub-emotion result, and the third sub-emotion result.
As a specific embodiment:
if the expression change condition of the target user and the expression change conditions of other users are both that the expression changes from positive expression (for example, the expression changes from negative expression to positive expression) or is always positive expression, the first emotion result is a positive emotion result; if only one of the expression change conditions of the target user and the expression change conditions of other users is that the expression changes from positive expression (for example, the expression changes from negative expression to positive expression) or is always positive expression, the first emotion result is an intermediate emotion result; if the expression change condition of the target user and the expression change conditions of other users are not that the expression changes from the positive expression (for example, the expression changes from the negative expression to the positive expression) or is always the positive expression, the first emotion result is the negative emotion result.
Obtaining the maximum value and the minimum value of the target user according to the fluctuation condition of the maximum decibel value of the target user, and further obtaining the difference value between the maximum value and the minimum value, wherein the difference value is the decibel fluctuation range value of the target user; and obtaining the maximum value and the minimum value according to the fluctuation condition of the maximum decibel value of other users, and further obtaining the difference value between the maximum value and the minimum value, wherein the difference value is the decibel fluctuation range value of other users. Correspondingly, if the decibel wave travel degree value of the target user and the decibel wave travel degree values of other users are both smaller than the preset degree threshold, the second emotion result is a positive emotion result; if only one of the decibel wave travel degree value of the target user and the decibel wave travel degree values of other users is larger than or equal to a preset degree threshold value, the second emotion result is an intermediate emotion result; and if the decibel wave travel degree value of the target user and the decibel wave travel degree values of other users are both larger than or equal to the preset degree threshold value, the second emotion result is a negative emotion result.
The text fusion features are the preliminary emotion change condition of the target user and the preliminary emotion change conditions of other users.
If the preliminary emotion change condition of the target user and the preliminary emotion change conditions of other users are both emotion-to-positive emotion changes (for example, the emotion changes from negative emotion to positive emotion) or positive emotion all the time, the third emotion result is a positive emotion result; if only one of the preliminary emotion change condition of the target user and the preliminary emotion change conditions of the other users changes from emotion to positive emotion (for example, the emotion changes from negative emotion to positive emotion) or is always positive emotion, the third emotion result is an intermediate emotion result; if neither the preliminary emotional change of the target user nor the preliminary emotional changes of the other users are emotional changes from positive emotions (e.g., the emotion changes from negative to positive emotions) or are always positive emotions, the third emotional result is a negative emotional result.
If the first emotion result, the second emotion result and the third emotion result are positive emotion results, the emotion result of the target user is a first final emotion result; if two positive emotion results are selected from the first emotion result, the second emotion result and the third emotion result, and the other positive emotion result is an intermediate emotion result, the emotion result of the target user is a second final emotion result; if two positive emotional results are selected from the first emotional result, the second emotional result and the third emotional result, and the other positive emotional result is a negative emotional result, or one positive emotional result is selected from the first emotional result, the second emotional result and the third emotional result, and the other two positive emotional results are intermediate emotional results, the emotional result of the target user is a third final emotional result; if one of the first emotional result, the second emotional result and the third emotional result is a positive emotional result, one is an intermediate emotional result, and the other is a negative emotional result, or all are intermediate emotional results, the emotional result of the target user is a fourth final emotional result; if two negative emotional results are selected from the first emotional result, the second emotional result and the third emotional result, and the other negative emotional result is a positive emotional result, or one negative emotional result is selected from the first emotional result, the second emotional result and the third emotional result, and the other two negative emotional results are intermediate emotional results, the emotional result of the target user is a fifth final emotional result; if two negative emotion results are selected from the first emotion result, the second emotion result and the third emotion result, and the other negative emotion result is an intermediate emotion result, the emotion result of the target user is a sixth final emotion result; and if the first emotion result, the second emotion result and the third emotion result are negative emotion results, the emotion result of the target user is a seventh final emotion result.
Wherein the first final emotion result to the seventh final emotion result are gradually closer from the positive emotion to the negative emotion, that is, the first final emotion result to the seventh final emotion result are gradually changed from the positive emotion to the negative emotion.
It should be understood that the above-mentioned process of obtaining the emotional result of the target user is only a specific embodiment, and as other embodiments, the specific setting may also be performed according to the actual situation.
Step S7: inputting the emotion result into a preset emotion relieving model to obtain a target emotion relieving measure corresponding to the emotion result, and outputting the target emotion relieving measure:
the method comprises the steps of presetting an emotion relieving model, wherein the preset emotion relieving model comprises at least two emotion results and emotion relieving measures corresponding to the emotion results. The emotion relieving measures in the preset emotion relieving model may include listening to relaxing music, watching a video of a fun, complaining to friends, asking a psychologist, and the like. It should be understood that in the preset emotional mitigation model, different emotional outcomes correspond to different emotional mitigation measures, such as: the emotion relieving measure corresponding to the first final emotion result is listening to relaxing music, the emotion relieving measure corresponding to the third final emotion result is watching a funny video, the emotion relieving measure corresponding to the fourth final emotion result is a pouring out to a good friend, and the emotion relieving measure corresponding to the seventh final emotion result is asking a psychologist. In addition, the emotion result in the preset emotion relieving model needs to include all actually acquired emotion results of the target user, so that the corresponding target emotion relieving measure can be obtained according to the emotion result of the target user.
The emotion result of the target user obtained in step S6 is input into a preset emotion mitigation model, a target emotion mitigation measure corresponding to the emotion result of the target user is obtained, and the target emotion mitigation measure is output, for example, the target emotion mitigation measure is output to a display screen or a corresponding terminal, and a relevant person or the target user can take a corresponding measure according to the target emotion mitigation measure to mitigate the emotion.
The present embodiment also provides an artificial intelligence based emotion processing apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the artificial intelligence based emotion processing method provided in the present embodiment. Therefore, the emotion processing apparatus based on artificial intelligence is a software apparatus, and the essence thereof is still an emotion processing method based on artificial intelligence.

Claims (9)

1. An emotion processing method based on artificial intelligence, comprising:
acquiring a video file to be detected, wherein the video file to be detected is a video file for voice interaction between a target user and other users;
according to a voice interaction process, first multi-dimensional information related to the target user and second multi-dimensional information related to other users are divided from the video file to be detected, wherein the first multi-dimensional information and the second multi-dimensional information respectively comprise at least one corresponding image information, at least one corresponding audio information and at least one corresponding text information;
performing feature extraction on the first multi-dimensional information and the second multi-dimensional information to obtain first feature information and second feature information;
respectively carrying out time sequence processing on the first characteristic information and the second characteristic information to obtain first time sequence characteristic information and second time sequence characteristic information;
fusing the first time sequence characteristic information and the second time sequence characteristic information to obtain fused characteristic information;
acquiring an emotion result of the target user at the ending moment of the video file to be detected according to the fusion characteristic information;
and inputting the emotion result into a preset emotion relieving model to obtain a target emotion relieving measure corresponding to the emotion result, and outputting the target emotion relieving measure.
2. The artificial intelligence based emotion processing method of claim 1, wherein the first feature information includes a first image feature corresponding to each image information related to the target user, a first audio feature corresponding to each audio information related to the target user, and a first text feature corresponding to each text information related to the target user; the second feature information comprises second image features corresponding to the image information related to the other users, second audio features corresponding to the audio information related to the other users and second text features corresponding to the text information related to the other users;
the first time sequence feature information includes a first time sequence image feature, a first time sequence audio feature and a first time sequence text feature, and the second time sequence feature information includes a second time sequence image feature, a second time sequence audio feature and a second time sequence text feature.
3. The artificial intelligence based emotion processing method according to claim 2, wherein the feature extraction of the image information in the first multi-dimensional information specifically comprises: identifying the expression of the target user in each image information, wherein the obtained first image characteristic is the character expression of the target user in each image information;
the specific step of extracting the features of the image information in the second multi-dimensional information is as follows: recognizing the expressions of the other users in the image information, wherein the obtained second image characteristic is the character expressions of the other users in the image information;
the specific steps of extracting the characteristics of the audio information in the first multi-dimensional information are as follows: acquiring a decibel maximum value of a voice waveform corresponding to each audio information of the target user, wherein the acquired first audio characteristic is the decibel maximum value of each audio information of the target user;
the specific steps of extracting the characteristics of the audio information in the second multi-dimensional information are as follows: acquiring a decibel maximum value of a voice waveform corresponding to each piece of audio information of the other users, wherein the acquired second audio characteristic is the decibel maximum value of each piece of audio information of the other users;
the specific steps of extracting the features of the text information in the first multi-dimensional information are as follows: acquiring preliminary emotion information in each text message of the target user, wherein the acquired first text characteristic is the preliminary emotion information of each text message of the target user;
the specific steps of extracting the features of the text information in the second multi-dimensional information are as follows: and acquiring the preliminary emotion information in the text information of the other users, wherein the acquired second text characteristic is the preliminary emotion information in the text information of the other users.
4. The artificial intelligence based emotion processing method of claim 3, wherein the first time-series image characteristic is an expression change situation of the target user, the first time-series audio characteristic is a fluctuation situation of a decibel maximum of the target user, and the first time-series text characteristic is a preliminary emotion change situation of the target user;
the second time sequence image characteristic is the expression change condition of other users, the second time sequence audio characteristic is the fluctuation condition of the decibel maximum value of other users, and the second time sequence text characteristic is the preliminary emotion change condition of other users.
5. The artificial intelligence based emotion processing method according to claim 2, wherein feature extraction is performed on text information according to a preset detection model, and the acquisition process of the preset detection model includes:
acquiring at least two correction texts and acquiring actual emotion information of each correction text;
inputting the correction texts into an existing detection model to obtain detection emotion information of the correction texts;
and acquiring correction texts with the same actual emotion information and detection emotion information, and adjusting the existing detection model according to the correction texts with the same actual emotion information and detection emotion information to obtain the preset detection model.
6. The artificial intelligence based emotion processing method of claim 5, wherein the process of recognizing the expression of the person in the image information is:
carrying out face recognition on the image information to obtain a face image;
and performing expression recognition on the face image to obtain the expression of the character.
7. The artificial intelligence based emotion processing method of claim 6, wherein the expression recognition is performed on the face image to obtain the human expression specifically:
acquiring at least two sample sets, including a first sample set and a second sample set, wherein the first sample set comprises at least two first expression sample images, and the second sample set comprises at least two second expression sample images;
labeling each first expression sample image in the first sample set to obtain a first expression category, labeling each second expression sample image in the second sample set to obtain a second expression category, wherein the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the facial image into the expression recognition network to obtain the expression of the character.
8. The artificial intelligence based emotion processing method of claim 7, wherein the correction texts for which the actual emotion information is obtained and the detected emotion information is the same are specifically:
acquiring correction texts of which the actual emotion information and the detected emotion information are positive emotions to obtain first correction texts and correction texts of which the actual emotion information and the detected emotion information are negative emotions to obtain second correction texts;
correspondingly, the obtaining of the at least two sample sets includes a first sample set and a second sample set, where the first sample set includes at least two first expression sample images, and the second sample set includes at least two second expression sample images specifically:
acquiring two sample sets, namely a first sample set and a second sample set, wherein a first expression sample image in the first sample set is a positive expression sample image, and a second expression sample image in the second sample set is a negative expression sample image;
correspondingly, the first expression category is positive expression, and the second expression category is negative expression.
9. An artificial intelligence based emotion processing apparatus comprising a memory and a processor, and a computer program stored on the memory and run on the processor, characterised in that the processor implements the artificial intelligence based emotion processing method of any of claims 1-8 when executing the computer program.
CN202011532100.0A 2020-12-22 2020-12-22 Emotion processing method and device based on artificial intelligence Withdrawn CN112597889A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011532100.0A CN112597889A (en) 2020-12-22 2020-12-22 Emotion processing method and device based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011532100.0A CN112597889A (en) 2020-12-22 2020-12-22 Emotion processing method and device based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN112597889A true CN112597889A (en) 2021-04-02

Family

ID=75200574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011532100.0A Withdrawn CN112597889A (en) 2020-12-22 2020-12-22 Emotion processing method and device based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN112597889A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837072A (en) * 2021-09-24 2021-12-24 厦门大学 Method for sensing emotion of speaker by fusing multidimensional information
CN117423423A (en) * 2023-12-18 2024-01-19 四川互慧软件有限公司 Health record integration method, equipment and medium based on convolutional neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837072A (en) * 2021-09-24 2021-12-24 厦门大学 Method for sensing emotion of speaker by fusing multidimensional information
CN117423423A (en) * 2023-12-18 2024-01-19 四川互慧软件有限公司 Health record integration method, equipment and medium based on convolutional neural network
CN117423423B (en) * 2023-12-18 2024-02-13 四川互慧软件有限公司 Health record integration method, equipment and medium based on convolutional neural network

Similar Documents

Publication Publication Date Title
US10438586B2 (en) Voice dialog device and voice dialog method
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
US11281945B1 (en) Multimodal dimensional emotion recognition method
CN108305641B (en) Method and device for determining emotion information
CN108305643B (en) Method and device for determining emotion information
CN111046133A (en) Question-answering method, question-answering equipment, storage medium and device based on atlas knowledge base
EP3617946B1 (en) Context acquisition method and device based on voice interaction
CN111739539B (en) Method, device and storage medium for determining number of speakers
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
CN112232276B (en) Emotion detection method and device based on voice recognition and image recognition
WO2024000867A1 (en) Emotion recognition method and apparatus, device, and storage medium
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN112597889A (en) Emotion processing method and device based on artificial intelligence
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN112966568A (en) Video customer service quality analysis method and device
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
CN112466287B (en) Voice segmentation method, device and computer readable storage medium
CN110910898A (en) Voice information processing method and device
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN112614510B (en) Audio quality assessment method and device
CN113128284A (en) Multi-mode emotion recognition method and device
CN116257816A (en) Accompanying robot emotion recognition method, device, storage medium and equipment
CN114120425A (en) Emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210402

WW01 Invention patent application withdrawn after publication