CN114492579A

CN114492579A - Emotion recognition method, camera device, emotion recognition device and storage device

Info

Publication number: CN114492579A
Application number: CN202111605408.8A
Authority: CN
Inventors: 易冠先; 陈波扬; 刘德龙; 王康
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2022-05-13

Abstract

The application discloses an emotion recognition method, a camera device, an emotion recognition device and a storage device. The method comprises the following steps: acquiring audio information and text information of a target; wherein the text information is derived based on the audio information; respectively extracting audio features of the audio information and text features of the text information; respectively correcting the audio features and the text features based on the correlation between the audio features and the text features to obtain audio correction features and text correction features; and integrating the audio correction features and the text correction features to perform emotion recognition to obtain emotion classification of the target. According to the scheme, the emotion recognition accuracy can be improved.

Description

Emotion recognition method, camera device, emotion recognition device and storage device

Technical Field

The present application relates to the field of emotion recognition technologies, and in particular, to an emotion recognition method, an image capture device, an emotion recognition device, and a storage device.

Background

With the continuous development of artificial intelligence technology, the requirements of people on interaction experience are continuously improved, and emotion recognition is widely applied to various scenes as a human-computer interaction technology at present.

For example, in the service field, the service satisfaction degree can be judged according to the emotion of the calling user; in the medical field, the detection result of the emotion change of a patient can be used as the basis for diagnosis and treatment of diseases; in the field of education, teaching adjustment and the like can be made according to emotion changes of students in a classroom. Most emotion recognition is performed based on information such as vision, voice, text, behaviors, physiological signals and the like of a target to be detected, although emotion of the target can be detected in each scene, the emotion recognition accuracy rate is low at present, and therefore emotion recognition effect is affected.

Disclosure of Invention

The technical problem mainly solved by the application is to provide the emotion recognition method, the camera device, the emotion recognition device and the storage device, and the emotion recognition accuracy can be improved.

In order to solve the above problem, a first aspect of the present application provides an emotion recognition method, including: acquiring audio information and text information of a target; wherein the text information is derived based on the audio information; respectively extracting audio features of the audio information and text features of the text information; respectively correcting the audio features and the text features based on the correlation between the audio features and the text features to obtain audio correction features and text correction features; and integrating the audio correction features and the text correction features to perform emotion recognition to obtain emotion classification of the target.

In order to solve the above problems, a second aspect of the present application provides an emotion recognition apparatus, which includes a memory and a processor coupled to each other, wherein the memory stores program data, and the processor is configured to execute the program data to implement any one of the steps of the emotion recognition method.

In order to solve the above problem, a third aspect of the present application provides a storage device storing program data executable by a processor, the program data being for implementing any one of the steps of the emotion recognition method described above.

In order to solve the above problem, a fourth aspect of the present application provides an image pickup apparatus comprising: the device comprises a camera shooting component and a recognition component, wherein the camera shooting component is used for acquiring audio or video of a target; identification means for carrying out the steps of the method according to any one of the preceding claims 1 to 10 using audio or video of the object.

According to the scheme, the audio information and the text information of the target are obtained; respectively extracting audio features of the audio information and text features of the text information; because the text information is obtained based on the audio information, the audio characteristic and the text characteristic are respectively corrected based on the correlation between the audio characteristic and the text characteristic to obtain the audio correction characteristic and the text correction characteristic, the information in the two aspects of the audio characteristic and the text characteristic is referred, the audio characteristic and the text characteristic are respectively corrected, the audio correction characteristic and the text correction characteristic are fused to perform emotion recognition to obtain the emotion classification of the target, the emotion recognition is performed through the characteristic information in the two aspects of the audio characteristic and the text characteristic, and the accuracy of the emotion recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings required in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor. Wherein:

FIG. 1 is a schematic flow chart diagram of an embodiment of the emotion recognition method of the present application;

FIG. 2 is a flowchart illustrating an embodiment of step S11 in FIG. 1;

FIG. 3 is a schematic flow chart illustrating another embodiment of step S11 in FIG. 1;

FIG. 4 is a schematic diagram of an embodiment of a sonic pattern of audio information according to the present application;

FIG. 5 is a diagram of an embodiment of a spectrogram of audio information of the present application;

FIG. 6 is a flowchart illustrating an embodiment of step S12 of FIG. 1;

FIG. 7 is a schematic structural diagram of an embodiment of an audio extraction model of the present application;

FIG. 8 is a schematic structural diagram of an embodiment of a text extraction model of the present application;

FIG. 9 is a flowchart illustrating an embodiment of step S13 of FIG. 1 of the present application;

FIG. 10 is a schematic flow chart illustrating another embodiment of step S13 in FIG. 1;

FIG. 11 is a schematic structural diagram of an embodiment of the present application for modifying audio and text features;

FIG. 12 is a schematic structural diagram of an embodiment of an image capturing apparatus according to the present application;

fig. 13 is a schematic structural view of a first embodiment of the emotion recognition apparatus of the present application;

fig. 14 is a schematic structural view of a second embodiment of the emotion recognition apparatus of the present application;

fig. 15 is a schematic structural view of a third embodiment of the emotion recognition apparatus of the present application;

FIG. 16 is a schematic structural diagram of an embodiment of a memory device according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first" and "second" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The present application provides the following examples, each of which is specifically described below.

Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of an emotion recognition method according to the present application. The method may comprise the steps of:

s11: acquiring audio information and text information of a target; wherein the text information is derived based on the audio information.

The emotion refers to the psychological experience of happiness, anger, sadness, music, fear and the like, which is a reflection of the attitude of a person to objective things. The emotion recognition can be carried out on the speaker through the speaking mode of the speaker, the tone and the attitude of the speaking and the like.

In the process of emotion recognition of the target, audio information of the target may be acquired, and text information of the target may be acquired based on the audio information, for example, text information obtained by performing text conversion on the audio information of the target.

In some embodiment modes, the target may include one or more targets to be detected, and audio information and text information including a plurality of targets to be detected are acquired to perform emotion recognition on the plurality of targets to be detected.

In some embodiment modes, for example, in the service field, the call audio of the target user may be acquired as the audio information of the target user, so that text information of the target user is obtained by performing text conversion based on the audio information.

In some embodiments, for example in the medical field, for example, emotion recognition of a patient, audio of an inquiry record with a target patient may be acquired, and/or the patient may acquire interactive conversation content over a period of time as audio information of the target patient, so as to obtain text information of the target patient based on the audio information.

In some embodiment modes, for example, in the field of education, audio or video of a teacher or a student in a classroom may be acquired by using a camera device, so as to obtain audio information targeting conversation audio content of the teacher or the student in the classroom, and to obtain text information of the target based on the audio information.

The audio information and the text information of the target can be acquired in other manners, and the method and the device are not limited to this.

S12: and respectively extracting audio features of the audio information and text features of the text information.

In some embodiment modes, feature extraction may be performed on the audio information of the target, and spectral analysis may be performed on the audio information to obtain speech characteristic information with a degree of recognition, for example, a melcept coefficient, a linear prediction coefficient, a cochlear cept coefficient, a prosodic feature, and the like of the audio are extracted, and the extracted features are used as audio features.

In some embodiments, the time-frequency features of the speech signal of the audio information may be converted into a spectrogram, where the spectrogram includes the energy, formants, pitch frequency, and other features of the speech signal. The method can communicate two time-frequency domains by extracting the speech spectrum characteristics of the speech spectrogram, and visually display the visual results of the distribution of the speech energy in the time domain and the frequency domain. And extracting features of the spectrogram of the audio information in spatial and time sequence dimensions to obtain the audio features of the audio information.

In some embodiment modes, semantic analysis can be performed on the text in the text information, and the text information semantic information is extracted to obtain the text features of the text information based on the semantic information.

In some embodiments, the audio information and the text information may be input to a neural network to extract audio features of the audio information and text features of the text information, respectively.

S13: and respectively correcting the audio features and the text features based on the correlation between the audio features and the text features to obtain audio correction features and text correction features.

Because the text information is obtained based on the audio information, the audio information is related to the text information, and the audio characteristic and the text characteristic can be respectively corrected based on the correlation between the audio characteristic and the text characteristic, so that the audio correction characteristic and the text correction characteristic are obtained.

In some embodiments, the audio characteristics may be modified using the text information to obtain audio modified characteristics; and modifying the text information by using the audio characteristics to obtain text modification characteristics so as to modify the text information mutually through the voice characteristics and the text characteristics.

In some embodiments, based on a first correlation between the audio feature and the text feature, where the audio feature may be used as an external factor, the first correlation of the audio feature to the text feature is obtained, and the text feature of the internal factor is modified by using the first correlation, so as to obtain a text modification feature. Based on the second correlation between the audio features and the text features, namely, the second correlation of the external factor text features to the internal factor audio features can be obtained, and the audio features are corrected by using the second correlation to obtain audio correction features.

S14: and integrating the audio correction features and the text correction features to perform emotion recognition to obtain emotion classification of the target.

And fusing the audio correction features and the text correction features, wherein the fused features can be used as fusion features, and emotion recognition is carried out on the fusion features through an Add layer, a normalization layer, an average pooling layer and a decision classifier so as to obtain emotion classification of the target. The Add layer may superimpose the fusion information, so that the information amount under the feature of the description target is increased, and the information amount under each dimension is increased. The decision classifier comprises a feedforward neural network decision classifier formed by two full connections.

The emotion classification may include a classification of happiness, anger, sadness, happiness, fear, and the like, and the obtained target emotion classification may include an emotion classification, a classification probability, and the like.

In some embodiments, the emotion may be ranked, and the emotion classifications include a classification of happiness, anger, sadness, music, fear, and the like, for example, the emotion classification "happiness" may be ranked, and may be classified as a first-level happiness, a second-level happiness, a third-level happiness, and the like.

In some embodiments, the emotions may be classified into more detailed categories, for example, the emotion categories may include interest, pleasure, surprise, sadness, anger, disgust, slight, fear, shy, and so on. The present application does not limit the mood classification.

In some embodiments, the emotion recognition result, i.e., emotion classification, may be represented by an emotion figure, for example, an image with "smiling face" if the emotion is happy; and if the emotion is sad, the emotion is represented by an image with a crying expression, and the like. This is not limited by the present application.

In some embodiments, emotion portrait processing can be further performed on the recognized emotion classifications, the target emotion classifications can be processed and counted in real time to analyze emotion changes of the targets, and the emotions can be visually displayed in real time and analyzed for an overall emotion analysis report over a period of time.

In some embodiments, the audio information, the text information, the audio features, the text features, the recognized emotion classifications, and other information of the target may be stored, and when authorization is obtained, data or historical data in a period of time and the like may be transmitted by a device such as a cloud server or a client, so as to facilitate subsequent tracing.

In the embodiment, the audio information and the text information of the target are acquired; respectively extracting audio features of the audio information and text features of the text information; because the text information is obtained based on the audio information, the audio characteristic and the text characteristic are respectively corrected based on the correlation between the audio characteristic and the text characteristic to obtain the audio correction characteristic and the text correction characteristic, the information in the two aspects of the audio characteristic and the text characteristic is referred, the audio characteristic and the text characteristic are respectively corrected, the audio correction characteristic and the text correction characteristic are fused to perform emotion recognition to obtain the emotion classification of the target, the emotion recognition is performed through the characteristic information in the two aspects of the audio characteristic and the text characteristic, and the accuracy of the emotion recognition can be improved.

In some embodiments, referring to fig. 2, the step S11 of obtaining the audio information and the text information of the target may include the following steps:

s111: video or audio of the target is acquired.

The target can be photographed by a camera to obtain a video of the target. The target can also be recorded by the recording device to obtain the audio frequency of the target.

S112: and extracting audio information of the target in the video or audio.

For video, audio information of an object in the video may be extracted. Specifically, the video collected by the camera device can be separated into audio and video, the audio part and the video part in the video are divided, the audio part is reserved, the video part can be abandoned, and the audio part is used as the target audio information.

For audio, audio collected by the recording device may be targeted audio information.

S113: and performing character conversion on the audio information of the target to obtain text information of the target.

After the audio information of the target is obtained, voice recognition can be performed on the audio information, that is, the audio information can be subjected to character conversion, and the text obtained by conversion is used as the text information of the target.

In some embodiments, before obtaining the text information based on the target audio information, the audio information may be a speech signal, and denoising and end point cutting processes may be performed on the speech signal, and then the text translation obtained by using a speech recognition API (interface) may be used as the target text information.

In other embodiments, referring to fig. 3, the step S11 of obtaining the audio information and the text information of the target may include the following steps:

s111: video or audio of the target is acquired.

S112: and extracting audio information of the target in the video or audio.

In this embodiment, the specific implementation process of the above embodiment may be referred to in the implementation process of steps S111 to S113, and is not described herein again.

S114: carrying out audio preprocessing on the audio information of the target to obtain preprocessed audio information; wherein the audio pre-processing comprises at least one of: frame windowing, discrete Fourier transform, logarithmic scaling, denoising and normalization.

Referring to fig. 4 and 5, the audio information extracted from the video or audio may be a voice signal, and the voice signal may be represented by a sound chart, that is, an original sound chart of the voice signal, for example, an original sound chart of the voice signal collected in a classroom for about 3 seconds, and the horizontal axis may represent a time domain and the vertical axis may represent an amplitude domain. The voice signal can be converted into a spectrogram, and the spectrogram is obtained by calculation according to an extraction process of the spectrogram and a corresponding theoretical formula. The spectrogram comprises the characteristics of energy, formants, pitch frequency and the like, the horizontal axis of the spectrogram represents a time domain, the vertical axis of the spectrogram represents a frequency domain, and one piece of voice energy can be uniquely determined through determined time and frequency.

Carrying out audio preprocessing on a target voice signal to obtain preprocessed audio information; wherein the audio pre-processing comprises at least one of: frame windowing, discrete Fourier transform processing, logarithmic scaling, denoising, and normalization.

For example, in audio pre-processing, speech signals are framed in 25ms frames long and 20ms frames. After the framing, hamming window processing is performed with a window length of 20 ms. Then, discrete Fourier transform processing is carried out on the length of 800 sampling points of each frame of voice signal, and then logarithmic scaling, denoising and normalization processing are carried out on the voice signal to obtain preprocessed audio information. In addition. Considering that the concentration of the high frequency component energy of the voice signal is not high, the voice signal of the frequency component above 4kHz can be removed, thereby keeping the voice signal of the frequency component of 0-4 kHz.

S115: performing text preprocessing on the target text information to obtain preprocessed text information; wherein the text preprocessing comprises at least one of: word segmentation processing, word retention filtering processing and part-of-speech tagging processing.

Text preprocessing is performed on the text information obtained by converting the audio information into characters, and specifically, chinese word segmentation processing, such as word segmentation of part of speech, can be performed on the text translation of the audio information. And performing stay word filtering processing on the text translation to filter stay words such as 'o, kay' and the like in the text translation. In addition, part-of-speech tagging processing can be performed on the participles of the text translation. The present application may also perform other pre-processing on the text translation, and the present application is not limited thereto.

In some embodiments, referring to fig. 6, the step S12 may include the following steps:

s121: processing the audio information by using an audio extraction model to obtain audio features of the audio information; the audio extraction model comprises a convolutional neural network and a bidirectional long-short term memory network.

For extracting the audio features of the audio information, the audio information may be processed by using an audio extraction model to obtain the audio features of the audio information. Wherein the audio extraction model may be a neural network model.

Referring to fig. 7, the audio extraction model includes a convolutional neural network and a Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory, abbreviated as BiLSTM). The bidirectional Long and Short Term Memory network is formed by combining a forward Long and Short Term Memory network (LSTM) and a backward LSTM.

In some embodiments, the convolutional neural network may include a plurality of convolutional layers and pooling layers. For example, a convolutional neural network convolutional layer Block (convolutional Block) comprises, in order, convolutional layer 1(Conv _1BN _1, 48 × 60 × 96), pooled layer 1(Max pooling, 23 × 29 × 96), convolutional layer 2(Conv _2BN _2, 23 × 29 × 256), pooled layer 2(Max pooling, 11 × 14 × 256), convolutional layer 3(Conv _3, 11 × 14 × 384), convolutional layer 4(Conv _4, 11 × 14 × 384), convolutional layer 5(Conv _5, 11 × 14 × 256), and pooled layer 3(Max pooling, 11 × 14 × 256). The convolution layer is used for performing convolution processing on the audio information, and the pooling layer is used for performing pooling processing on the audio information. The convolutional neural network may also employ neural networks of other sizes, other convolutional block distributions, and the application is not limited thereto.

In some embodiments, a Spectrogram (Spectrogram) of the audio information may be input into a convolutional neural network of the audio extraction model, so that the audio information is convolved by the convolutional neural network of the audio extraction model to obtain spatial features of the audio. In the process, the feature extraction is performed through a plurality of convolution layers and pooling layers of the convolutional neural network in sequence, and as the number of channels of the convolution layers is gradually increased, the extracted feature information is richer, and the extracted features are higher and higher, so that the high-level spatial features of the spectrogram, namely the spatial features of the audio, can be obtained according to the sequence of extracting the low-level features first and then extracting the high-level features.

In some implementations, the spatial feature modeling of audio from different perceptual fields may be converted into temporal features in order. And inputting the spatial features of the audio into a BilSTM network connected with the convolutional neural network, and coding the spatial features of the audio by using a bidirectional long-short term memory network of an audio extraction model to obtain the spatio-temporal features of the audio as the audio features of the audio information.

In this embodiment, since the audio information (spectrogram) includes spatial information and time information, the spatial feature is extracted by using the convolutional neural network, and the temporal feature is extracted by using the BiLSTM network, which can improve the encoding effect of the audio information and the extraction capability of the audio feature.

S122: processing the text information by using a text extraction model to obtain text characteristics of the text information; the text extraction model comprises a language model and a bidirectional long-short term memory network.

For extracting the text features of the text information, the text information can be processed by using a text extraction model to obtain the text features of the text information. Wherein the text extraction model may be a neural network model.

Referring to fig. 8, the text extraction model includes a language model and a two-way long-short term memory network. The language model may include a Relative Segment layer (Relative Segment), a position Embedding layer (Positional Embedding), a replacement Mask layer (completion Mask), and 12 stacked XLNet layers, wherein the XLNet layers sequentially include an Attention layer (Multi-head attachment), a residual and normalization layer (Add & Norm), a Forward propagation layer (Feed Forward), a residual and normalization layer (Add & Norm), and a Memory layer (Memory).

In some embodiments, the text information (W1, W2, …, Wn) may be input into a text extraction model, and the text information may be semantically processed using a language model of the text extraction model to obtain semantic information. Specifically, text information is relatively segmented and coded by using a relative segment layer of a voice model, the text information is position-embedded by using a position embedding layer to obtain a relative position relation between word sequences, and different word sequence inputs are constructed by replacing a mask layer (or a language arrangement model) to realize utilization of bidirectional context information. And inputting the processed text information into 12 stacked XLNET layers, and extracting the semantics of the text information by utilizing the 12 stacked XLNET layers to obtain the semantics information of the text information. And inputting the semantic information into a bidirectional long and short term memory network of the text extraction model, and coding the semantic information by using the bidirectional long and short term memory network to obtain text characteristics of the text information.

In the embodiment, the text features of the text information are obtained by encoding the semantic information by using the bidirectional long-short term memory network, and the BilSTM network is added during the text feature extraction, so that the relevance between words in the text information can be enhanced, and the extracted text features have better effect.

In some embodiments, referring to fig. 9, in the step S13, modifying the text feature based on the first correlation between the audio feature and the text feature to obtain the text modified feature may include the following steps:

s1311: based on the audio features and the text features, a first attention weight representing a first relevance is obtained.

Referring to fig. 11, text information may be used as a modality, audio information may be used as a modality, and the text information of the internal factor may be subjected to correlation calculation by using the audio feature as the external factor. That is, the importance of the text information of one modality is judged by using the audio information of the other modality, so that the fusion relationship between the two modalities can be established.

The text feature and the audio feature may be modified separately using an attention mechanism. If A ═ a₁，a₂，…，a_n]Representing an audio feature, T ═ T₁，t₂，…，t_n]Representing a text feature, where A, T ∈ R^d×n,a_i,t_i∈R^d. A first attention weight representing a first relevance may be obtained based on the audio features and the text features.

Specifically, when the text features are modified, the relevance between the audio features and the text features can be used for scoring the text features, that is, evaluating the importance of the text features. The scoring method may be a scaling dot product method, and the audio feature and the text feature may be used to perform a scaling dot product to obtain the feature score value. The scoring process can be expressed by the following formula:

in the above formula (1), s represents a scoring function, a represents an audio feature, T represents a text feature, s (a, T) represents a feature score value, and d represents a constant for point scaling.

The feature score value is subjected to attention operation, namely, the feature score value can be used for obtaining a distribution matrix of attention weight distributionα∈R^n×nA first attention weight representing a first correlation is obtained. This process can be expressed by the following equation:

in the above formula (2), α represents a first attention weight, and softmax () represents a function for performing an attention operation.

S1312: and correcting the text features by using the first attention weight to obtain text correction features.

After the first attention weight α is obtained, the text feature is corrected by the first attention weight α to obtain a text correction feature X, where X ═ X₁，x₂，…，x_n]^T∈R^n×d. The text correction feature X can be expressed by the following formula:

X＝α·T^T (3)

by combining the above formulas, after the text feature is corrected, a new text feature, that is, the text correction feature X, can be represented as:

in other embodiments, referring to fig. 10, in the step S13, modifying the audio feature based on the second correlation between the audio feature and the text feature to obtain the audio modified feature may include the following steps:

s1321: based on the audio features and the text features, a second attention weight representing a second relevance is obtained.

Referring to fig. 11, text information may be used as a modality, audio information may be used as a modality, and the text feature may be used as an external factor to perform correlation calculation on the internal factor audio information. That is, the text information of one modality is used to determine the importance of the audio information of the other modality, so that a fusion relationship between the two modalities can be established.

Based on the audio features and the text features, a second attention weight representing a second relevance is obtained. Wherein, the audio features and the text features can be utilized to carry out scaling dot products to obtain feature score values; attention operation is performed on the feature score value to obtain a second attention weight representing a second correlation.

For the specific implementation process of this step, reference may be made to the specific implementation process of step S1311, which is not described herein again.

S1322: and correcting the audio features by using the second attention weight to obtain audio corrected features.

After the second attention weight is obtained, the audio feature is corrected by using the second attention weight to obtain an audio correction feature Y, wherein Y is [ Y ═ Y₁，y₂，…，y_n]^T∈R^n×d. The new text feature, i.e. the audio correction feature Y, can be expressed as:

in some embodiments, the text feature and the audio feature may be modified separately by using an attention mechanism, where the attention mechanism uses features of all time steps, including audio features of all time steps and text features of all time steps, and an attention layer is set for the text feature and the audio feature separately, and the two types of modal information may be mutually utilized by crossing the audio feature and the text feature with each other as an external factor of the attention layer, so that information in both text and audio aspects is considered during fusion. Namely, the voice stream (voice characteristic) utilizes the space-time characteristic to automatically capture key information in the text characteristic to eliminate noise information, and utilizes the key information to remold new text characteristic; similarly, text streams (text features) use the text features to capture important information in speech features, eliminate secondary information, and use the important information to reconstruct new speech features.

In some embodiments, the text correction feature X and the audio correction feature Y are fused, information in the text and the audio is considered comprehensively, invalid features in the audio features can be corrected in an auxiliary manner for the text features, the invalid features in the text features can be corrected in an auxiliary manner for the audio features, effective fusion of information in two modes can be completed better, the features of the text information and the audio information are complemented and mutually corrected, a better emotion recognition effect can be obtained, and the influence of the introduction of the invalid features and noise features on an emotion recognition result during fusion of the two modes can be reduced, so that the accuracy of emotion recognition is improved.

With the above-described embodiment, it is possible to apply to various emotion recognition scenes. The present application is described taking the application to the smart camera device for emotion detection of a target as an example, but the present application is not limited thereto.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of the image capturing apparatus of the present application. The imaging device 50 includes an imaging part 51 and a recognition part 52. The steps of any of the above-described emotion recognition methods may be performed by the image pickup part 51 and the recognition part 52.

The camera 51 is used to capture audio or video of a target. For example, in a classroom, the camera 50 may be used to capture video or audio of a subject. Extracting audio information of a target in video or audio; and performing character conversion on the audio information of the target to obtain text information of the target so as to obtain the audio information and the text information of the target.

The recognition component 52 is adapted to perform any of the steps of the emotion recognition method described above using the audio or video of the object. The recognition component 52 may acquire audio information and text information of the object using audio or video of the object; wherein the text information is derived based on the audio information. And respectively extracting audio features of the audio information and text features of the text information. Respectively correcting the audio features and the text features based on the correlation between the audio features and the text features to obtain audio correction features and text correction features; and integrating the audio correction features and the text correction features to perform emotion recognition to obtain emotion classification of the target.

The specific implementation of this embodiment can refer to the implementation process of the above embodiment, and is not described herein again.

To the above embodiment, the present application also provides an emotion recognition apparatus. Referring to fig. 13, fig. 13 is a schematic structural diagram of a first embodiment of the emotion recognition apparatus of the present application. The emotion recognition apparatus 20 may include an acquisition unit 21, a feature extraction unit 22, a feature correction unit 23, and a recognition unit 24. Among them, the acquisition unit 21, the feature extraction unit 22, the feature correction unit 23, and the recognition unit 24 are connected to each other.

The acquiring unit 21 is used for acquiring audio information and text information of a target; wherein the text information is derived based on the audio information.

The feature extraction unit 22 is configured to extract an audio feature of the audio information and a text feature of the text information, respectively.

The feature correction unit 23 is configured to correct the audio features and the text features respectively based on the correlation between the audio features and the text features, so as to obtain audio correction features and text correction features;

the recognition unit 24 is configured to perform emotion recognition by fusing the audio correction features and the text correction features, so as to obtain an emotion classification of the target.

In some embodiments, the emotion recognition device 20 may be an image pickup device, an electronic apparatus, a sound recording apparatus, or the like. For example, the emotion recognition device may be a smart camera, and a teacher, a student, or the like is photographed by the smart camera in real time in a classroom, and a video is photographed to obtain audio information and text information through the video. The emotion recognition device 20 of the present application may be another device having an image pickup or sound recording function, but the present application is not limited thereto.

In some embodiments, the emotion recognition device 20 may also be a device capable of acquiring text information and audio information, and the emotion recognition device 20 may be connected to the device with camera or audio recording function to acquire video or audio transmitted by the device with camera or audio recording function to acquire text information and audio information for emotion recognition.

In some embodiments, please refer to fig. 14 to 15, fig. 14 is a schematic structural diagram of a second embodiment of the emotion recognition device of the present application. The emotion recognition apparatus 20 may include an acquisition unit 21, a feature extraction unit 22, a feature correction unit 23, and a recognition unit 24.

The acquisition unit 21 may include an acquisition unit 211 and a separation unit 212. In some embodiments, the acquisition unit 211 may be a camera, a camera device, a sound recording device, or the like. The capture unit 211 is used to obtain video or audio of a target. The separation unit 212 is used to extract audio information of a target in video or audio. For example, the separation unit 212 may be configured to separate audio and video in a local storage space from audio or video collected by the camera, that is, cut out an audio portion and a video portion, and only retain the audio portion and discard the video portion.

The feature extraction unit 22 includes a speech preprocessing unit 221, a speech feature unit 222, a text preprocessing unit 223, and a text feature unit 224.

The voice preprocessing unit 221 is configured to perform audio preprocessing on the target audio information to obtain preprocessed audio information; wherein the audio pre-processing comprises at least one of: frame windowing, discrete Fourier transform processing, logarithmic scaling, denoising, and normalization.

The voice feature unit 222 is configured to process the audio information by using an audio extraction model to obtain an audio feature of the audio information; the audio extraction model comprises a convolutional neural network and a bidirectional long-short term memory network. Specifically, carrying out convolution processing on audio information by using a convolution neural network of an audio extraction model to obtain spatial characteristics of the audio; and coding the spatial characteristics of the audio by using a bidirectional long-short term memory network of the audio extraction model to obtain the spatio-temporal characteristics of the audio as the audio characteristics of the audio information.

The text preprocessing unit 223 may be configured to perform text conversion on the audio information of the target to obtain text information of the target. Performing text preprocessing on the target text information to obtain preprocessed text information; wherein the text preprocessing comprises at least one of: word segmentation processing, word retention filtering processing and part-of-speech tagging processing.

The text feature unit 224 is configured to process the text information by using a text extraction model to obtain a text feature of the text information; the text extraction model comprises a language model and a bidirectional long-short term memory network. Specifically, semantic processing can be performed on the text information by using a language model of the text extraction model to obtain semantic information; and coding the semantic information by using a bidirectional long-short term memory network of the text extraction model to obtain the text characteristics of the text information.

The feature correcting unit 23 includes a voice correcting unit 231 and a text correcting unit 232.

The voice modification unit 231 is configured to modify the audio feature based on a second correlation between the audio feature and the text feature, so as to obtain an audio modified feature. Specifically, a second attention weight representing a second relevance may be obtained based on the audio feature and the text feature; and correcting the audio features by using the second attention weight to obtain audio corrected features.

The text modification unit 232 is configured to modify the text feature based on a first correlation between the audio feature and the text feature, so as to obtain a text modified feature. Specifically, a first attention weight representing a first relevance may be obtained based on the audio feature and the text feature; and correcting the text features by using the first attention weight to obtain text correction features.

In some implementations, obtaining a first attention weight representing the first relevance or a second attention weight representing the second relevance based on the audio features and the text features includes: performing scaling dot product by using the audio features and the text features to obtain feature score values; and performing attention operation on the feature score value to obtain a first attention weight or a second attention weight.

The emotion recognition apparatus 20 further includes a storage unit 25. The storage unit 25 is used to store audio or video. And data generated by each process in the emotion recognition process. For example, the final emotion recognition is completed from the acquisition of audio and video data to the processing unit and through the forward reasoning of the feedforward network. The audio data, text data and detection results generated in the whole process are temporarily stored in a storage unit on the device, so that the stored data can be used for real-time scheduling, such as providing data support for a classroom emotion figure unit. In addition, the storage unit of the device can transmit historical data to the cloud server under the condition of obtaining authorization, so that the device can be traced in the later period.

The recognition unit 24 may also perform emotion recognition processing on the emotion classification. Such as real-time processing, statistics of the results of emotion classification and display of classroom emotions in a friendly visualization interface. May include a real-time visual display of emotions and an overall statistical report over a period of time.

With respect to the above embodiments, the present application provides an emotion recognition apparatus, please refer to fig. 15, where fig. 15 is a schematic structural diagram of a third embodiment of a computer device according to the present application. The computer device 30 comprises a memory 31 and a processor 32, wherein the memory 31 and the processor 32 are coupled to each other, the memory 31 stores program data, and the processor 32 is configured to execute the program data to implement the steps of any of the embodiments of the emotion recognition method described above.

In the present embodiment, the processor 32 may also be referred to as a CPU (Central Processing Unit). The processor 32 may be an integrated circuit chip having signal processing capabilities. The processor 32 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 32 may be any conventional processor or the like.

For the method of the above embodiment, it can be implemented in the form of a computer program, so that the present application provides a storage device, please refer to fig. 16, where fig. 16 is a schematic structural diagram of an embodiment of the storage device of the present application. The storage device 40 has stored therein program data 41 executable by a processor, the program data 41 being executable by the processor to implement the steps of any of the embodiments of the emotion recognition method described above.

The storage device 40 of the present embodiment may be a medium that can store the program data 41, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or may be a server that stores the program data 41, and the server may transmit the stored program data 41 to another device for operation, or may operate the stored program data 41 by itself.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a storage device, which is a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application.

It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method of emotion recognition, the method comprising:

acquiring audio information and text information of a target; wherein the text information is derived based on the audio information;

respectively extracting audio features of the audio information and text features of the text information;

respectively correcting the audio features and the text features based on the correlation between the audio features and the text features to obtain audio correction features and text correction features;

and integrating the audio correction features and the text correction features to perform emotion recognition to obtain emotion classification of the target.

2. The method of claim 1, wherein modifying the audio features and the text features based on the correlation between the audio features and the text features to obtain audio modified features and text modified features respectively comprises:

based on a first correlation between the audio features and the text features, correcting the text features to obtain text correction features; and the number of the first and second groups,

and modifying the audio features based on a second correlation between the audio features and the text features to obtain the audio modified features.

3. The method of claim 2,

the modifying the text feature based on the first correlation between the audio feature and the text feature to obtain the text modified feature includes: obtaining a first attention weight representing the first relevance based on the audio features and the text features; correcting the text features by using the first attention weight to obtain the text correction features; and/or the presence of a gas in the gas,

the modifying the audio feature based on the second correlation between the audio feature and the text feature to obtain the audio modified feature includes: obtaining a second attention weight representing the second relevance based on the audio feature and the text feature; and correcting the audio features by using the second attention weight to obtain the audio corrected features.

4. The method of claim 3, wherein obtaining a first attention weight representing the first relevance or a second attention weight representing the second relevance based on the audio features and the text features comprises:

performing scaling dot product by using the audio features and the text features to obtain feature score values;

performing attention operation on the feature score value to obtain the first attention weight or the second attention weight.

5. The method of claim 1, wherein the extracting the audio feature of the audio information comprises:

processing the audio information by using an audio extraction model to obtain audio features of the audio information; wherein the audio extraction model comprises a convolutional neural network and a bidirectional long-short term memory network.

6. The method of claim 5, wherein the processing the audio information using an audio extraction model to obtain the audio features of the audio information comprises:

carrying out convolution processing on the audio information by utilizing a convolution neural network of the audio extraction model to obtain the spatial characteristics of the audio;

and coding the spatial characteristics of the audio by using the bidirectional long-short term memory network of the audio extraction model to obtain the space-time characteristics of the audio as the audio characteristics of the audio information.

7. The method of claim 1, wherein extracting the text feature of the text information comprises:

processing the text information by using a text extraction model to obtain text characteristics of the text information; wherein the text extraction model comprises a language model and a bidirectional long-short term memory network.

8. The method of claim 7, wherein the processing the text information using the text extraction model to obtain the text features of the text information comprises:

performing semantic processing on the text information by using a language model of the text extraction model to obtain semantic information;

and coding the semantic information by using a bidirectional long-short term memory network of the text extraction model to obtain text characteristics of the text information.

9. The method of claim 1, wherein the obtaining the audio information and the text information of the target comprises:

acquiring video or audio of the target;

extracting audio information of the target in the video or audio;

and performing character conversion on the audio information of the target to obtain text information of the target.

10. The method of claim 9, wherein after said obtaining the video of the target, the method further comprises:

carrying out audio preprocessing on the audio information of the target to obtain preprocessed audio information; wherein the audio pre-processing comprises at least one of: framing windowing, discrete Fourier transform processing, logarithmic scaling, denoising and normalization processing; and the number of the first and second groups,

performing text preprocessing on the text information of the target to obtain preprocessed text information; wherein the text pre-processing comprises at least one of: word segmentation processing, word retention filtering processing and part-of-speech tagging processing.

11. An image pickup apparatus, characterized by comprising:

the camera shooting component is used for collecting audio or video of a target;

identification means for carrying out the steps of the method according to any one of the preceding claims 1 to 10, using audio or video of said object.

12. An emotion recognition apparatus, comprising a memory and a processor coupled to each other, the memory having stored therein program data, the processor being configured to execute the program data to implement the steps of the method as claimed in any one of claims 1 to 10.

13. A storage device, characterized by program data stored therein which can be executed by a processor for carrying out the steps of the method according to any one of claims 1 to 10.