CN115062691A

CN115062691A - Attribute identification method and device

Info

Publication number: CN115062691A
Application number: CN202210581712.1A
Authority: CN
Inventors: 顾艳梅; 王涛; 王志铭
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-16

Abstract

The embodiment of the specification describes an attribute identification method and device. According to the method of the embodiment, raw data for identifying the attributes from at least two modalities is obtained first, and then mining of attribute features is performed on the raw data of each modality respectively. Furthermore, after the obtained attribute features of each modality are fused, the identification result of the attribute can be obtained according to the fused attribute features. According to the scheme, the attribute is identified by fusing the data of different modes, so that the advantage of identifying the attribute by the data of each mode can be fully absorbed, certain information beneficial to attribute identification cannot be omitted, and the accuracy of attribute identification can be improved.

Description

Attribute identification method and device

Technical Field

One or more embodiments of the present specification relate to the field of computer technology, and more particularly, to an attribute identification method and apparatus.

Background

Attribute identification is a technical means often used in a service scenario, for example, in a human-computer interaction service, if a machine cannot accurately identify an emotional state of a user, the machine cannot take accurate operations to serve the user, thereby resulting in poor experience of the user. Therefore, the attribute identification is beneficial to improving the use experience of the user.

However, the accuracy of identifying attributes is currently low.

Disclosure of Invention

One or more embodiments of the present specification describe an attribute identification method and apparatus, which can improve accuracy of identifying an attribute.

According to a first aspect, there is provided an attribute identification method comprising:

obtaining raw data from at least two modalities for identifying the attributes; wherein the similarity of the semantics of the raw data of the at least two modalities is greater than a predetermined value;

respectively carrying out feature mining on the original data of the at least two modes to obtain attribute features corresponding to the modes; wherein the attribute features are features that can affect the attributes;

fusing the obtained attribute characteristics corresponding to each mode to obtain fused characteristics;

and obtaining the identification result of the attribute by using the fusion characteristic.

In one possible implementation, when the at least two modalities include a speech modality, the attribute feature includes a speech feature vector and a speech alignment matrix;

the acquiring raw data from at least two modalities for identifying the attribute includes: acquiring a voice signal from a voice modality for identifying the attribute;

the feature mining of the original data of the at least two modalities to obtain attribute features corresponding to the modalities includes:

the voice signal is cut into at least two frames according to a preset first time length to obtain a time domain cut signal;

performing Fourier transform on the time domain segmentation signal to obtain frequency domain segmentation characteristics;

performing attribute feature extraction on the frequency domain segmentation features by using at least one feature extraction convolution core to obtain extraction features;

respectively obtaining a voice feature vector and a voice alignment matrix corresponding to the voice modality according to the extracted features; and the dimension size of the representation frame in the voice alignment matrix is equal to the dimension size of the representation frame in the voice feature vector.

In a possible implementation manner, signals of two adjacent frames in the time-domain sliced signal have a time overlap of a preset second time length, and the second time length is smaller than the first time length.

In one possible implementation, the at least one feature extraction convolution kernel includes a first convolution kernel, a second convolution kernel, and a third convolution kernel; the first convolution kernel and the second convolution kernel are different in size, and the second convolution kernel and the third convolution kernel are the same in size;

the extracting the attribute characteristics of the frequency domain segmentation characteristics by using at least one characteristic extraction convolution core to obtain the extraction characteristics comprises the following steps:

performing feature extraction on the frequency domain segmentation features by using the first convolution kernel to obtain first extraction features;

performing feature extraction on the first extraction features by using the second convolution kernel to obtain second extraction features;

and performing feature extraction on the second extraction features by using the third convolution kernel to obtain the extraction features.

In a possible implementation manner, the obtaining a speech feature vector corresponding to the speech modality according to the extracted feature includes:

performing local attribute feature mining on the extracted features by using a self-attention mechanism to obtain a voice feature vector corresponding to the voice modality; wherein the local attribute feature is a feature capable of characterizing details of the attribute.

In a possible implementation manner, the obtaining a speech alignment matrix corresponding to the speech modality according to the extracted features includes:

scanning the matrix corresponding to the extracted features by using a preset fourth convolution kernel; wherein the size of the fourth convolution kernel satisfies: enabling a dimension size for characterizing a frame in the speech alignment matrix to be equal to a dimension size for characterizing a frame in the speech feature vector; and the number of the first and second groups,

and determining a matrix formed by the maximum value in each scanning result as a voice alignment matrix of the voice modality.

In one possible implementation, when the at least two modalities include a text modality, the attribute feature includes a text feature vector and a text alignment matrix;

the acquiring raw data from at least two modalities for identifying the attribute includes: acquiring text data from a text modality for identifying the attributes;

inputting the text data into a pre-trained text extraction model for feature extraction to obtain the text feature vector; the training method of the text extraction model comprises the following steps: training by utilizing at least one group of sample sets; each group of sample sets comprises character information and coding information;

linearly changing the text characteristic vector by using a preset linear transformation parameter to obtain the text alignment matrix; and the column dimension of the text alignment matrix is equal to the column dimension of the voice feature vector obtained through the voice signal.

In one possible implementation, when the at least two modalities include a speech modality and a text modality;

the fusing the obtained attribute features corresponding to each modality to obtain a fused feature includes:

calculating the influence of the attribute characteristics of the text mode on the attribute characteristics of the voice mode to obtain an influence matrix of the voice mode; wherein the attribute features of the text modality are features capable of characterizing details of the attribute;

calculating the influence of the attribute characteristics of the voice modality on the attribute characteristics of the text modality to obtain an influence matrix of the text modality; wherein the attribute of the voice modality is characterized by features capable of characterizing details of the attribute;

and splicing the influence matrix of the voice mode, the influence matrix of the text mode and the matrixes corresponding to the attribute characteristics of the modes to obtain the fusion characteristics.

In one possible implementation manner, the attribute features of the speech modality include a speech feature vector and a speech alignment matrix, and the attribute features of the text modality include a text feature vector and a text alignment matrix;

the calculating the influence of the attribute features of the text modality on the attribute features of the voice modality to obtain an influence matrix of the voice modality includes:

calculating an influence matrix of the speech modality by using the following calculation formula:

E _a ＝δ(W _a ·X _s )

wherein E is _a An influence matrix, W, for characterizing the speech modality _a For characterizing said speech alignment matrix, X _s For characterizing the text feature vector, δ (-) for characterizing the Sigmoid activation function;

and/or the presence of a gas in the gas,

the calculating the influence of the attribute characteristics of the voice modality on the attribute characteristics of the text modality to obtain an influence matrix of the text modality includes:

calculating an influence matrix of the text modality by using the following calculation formula:

E _s ＝δ(W _s ·X _a )

wherein, E _s An influence matrix, W, for characterizing the text modality _s For characterizing said text alignment matrix, X _a For characterizing the speech feature vector, δ (-) for characterizing the Sigmoid activation function.

In one possible implementation, the attribute features of the speech modality include speech feature vectors, and the attribute features of the text modality include text feature vectors;

the splicing the influence matrix of the voice modality, the influence matrix of the text modality and the matrixes corresponding to the attribute characteristics of the modalities to obtain the fusion characteristics comprises:

calculating the sum of the influence matrix of the voice modality and the voice feature vector to obtain a fusion matrix of the voice modality;

calculating the sum of the influence matrix of the text mode and the text characteristic vector to obtain a fusion matrix of the text mode;

calculating the fusion feature using the following calculation:

wherein A is used to characterize the fusion feature,

a fusion matrix for characterizing the speech modalities,

and the Concat () is used for representing the fusion matrix of the text modality and splicing the matrix.

In a possible implementation manner, the obtaining, by using the fusion feature, the identification result of the attribute includes:

calculating at least one mathematical characteristic quantity of the fusion characteristic; wherein the mathematical characteristic quantity includes: mean and variance;

inputting the at least one mathematical characteristic quantity into a pre-trained attribute recognition model to obtain a recognition result of the attribute; the training method of the attribute recognition model comprises the following steps: training by utilizing at least one group of sample training sets; each group of sample training set comprises a mathematical characteristic quantity of the attributes and a recognition result of the attributes.

In one possible implementation, the attributes include: the emotional state of the user.

According to a second aspect, there is provided an attribute identification apparatus comprising: the system comprises a data acquisition module, a feature mining module, a feature fusion module and an attribute identification module;

the data acquisition module is configured to acquire raw data from at least two modalities for identifying the attributes; wherein the similarity of the semantics of the raw data of the at least two modalities is greater than a predetermined value;

the characteristic mining module is configured to respectively perform characteristic mining on the original data of the at least two modalities acquired by the data acquisition module to obtain attribute characteristics corresponding to each modality; wherein the attribute feature is a feature that can affect the attribute;

the feature fusion module is configured to fuse the attribute features corresponding to the modalities, which are obtained by the feature mining module, to obtain fusion features;

and the attribute identification module is configured to obtain an identification result of the attribute by using the fusion feature obtained by the feature fusion module.

According to a third aspect, there is provided a computing device comprising: a memory having executable code stored therein, and a processor that when executing the executable code implements the method of any of the first aspects described above.

According to the method and the device provided by the embodiment of the specification, when the attribute is identified, the raw data for identifying the attribute from at least two modalities can be obtained firstly, and then the mining of the attribute feature is performed on the raw data of each modality respectively. Furthermore, after the obtained attribute features of each mode are fused, the identification result of the attribute can be obtained according to the fused attribute features. Therefore, according to the scheme, the attribute is identified by fusing the data of different modes, so that the advantage of identifying the attribute by each mode data can be fully absorbed, certain information beneficial to attribute identification cannot be omitted, and the accuracy of attribute identification can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram of an attribute identification method provided in one embodiment of the present description;

FIG. 2 is a flow diagram of a method for attribute feature mining provided in one embodiment of the present description;

FIG. 3 is a flow diagram of a method for attribute feature extraction provided in one embodiment of the present description;

FIG. 4 is a flow diagram of another method for attribute feature extraction provided by one embodiment of the present description;

FIG. 5 is a flow diagram of yet another method of attribute feature extraction provided by one embodiment of the present description;

FIG. 6 is a flow diagram of a method of feature fusion provided in one embodiment of the present description;

FIG. 7 is a flow diagram of another method for attribute identification provided by one embodiment of the present description;

fig. 8 is a schematic diagram of an attribute identification device according to an embodiment of the present disclosure.

Detailed Description

The identification technology of artificial intelligence is a commonly used technical means in business scenes. For example, image recognition, character recognition, voice recognition, and the like are performed. The recognition technology can improve the convenience for the user, provide the use experience of the user, particularly recognize certain attributes in a business scene, and further enhance the use experience of the user. For example, emotion recognition is performed through voice or characters of a user to further determine what way to take for human-computer interaction; for another example, the attitude of the user about a person, an object, or an event is recognized by text or voice.

Currently, in a business scenario, attribute identification is often performed by using data of a single modality. For example, in a security business scenario, a human-computer needs to perform a lot of interactions, and in order to be able to serve a customer with accurate conversational content, the emotional state of the user is usually recognized and perceived through voice uttered by the user, or through a facial image of the user. However, attribute identification using data of a single modality may ignore some information. For example, sometimes a user may express a certain emotion and attitude of the user by increasing the volume, changing the tone, and the like, which cannot be recognized by the text. Therefore, the accuracy of attribute identification using data of a single modality is low.

Based on this, the scheme considers that the multi-modal data are fused, and the attribute is identified based on the fused data. Therefore, the advantages of identifying the attributes by the modal data can be fully absorbed, and certain information beneficial to attribute identification can not be omitted, so that the accuracy of attribute identification can be improved.

As shown in fig. 1, an embodiment of the present specification provides an attribute identification method, which may include the following steps:

step 101: obtaining raw data from at least two modalities for identifying attributes; the similarity of the semantics of the original data of at least two modalities is greater than a preset value;

step 103: respectively carrying out feature mining on the original data of at least two modes to obtain attribute features corresponding to each mode; the attribute features are features capable of influencing the attributes;

step 105: fusing the obtained attribute characteristics corresponding to each mode to obtain fused characteristics;

step 107: and obtaining the identification result of the attribute by utilizing the fusion characteristic.

In this embodiment of the present specification, when identifying an attribute, raw data for identifying the attribute from at least two modalities may be first obtained, and then mining of attribute features may be performed on the raw data of each modality respectively. Furthermore, after the obtained attribute features of each modality are fused, the identification result of the attribute can be obtained according to the fused attribute features. Therefore, according to the scheme, the attribute is identified by fusing the data of different modes, so that the advantage of identifying the attribute by each mode data can be fully absorbed, certain information beneficial to attribute identification cannot be omitted, and the accuracy of attribute identification can be improved.

The steps in FIG. 1 are described below with reference to specific examples.

First in step 101, raw data from at least two modalities for identifying an attribute is acquired.

The raw data from different modalities may be different types of data. For example, the audio signal may be a voice signal of a voice modality, text data of a text modality, picture data of an image modality, video data of a video modality, and the like. It is of course readily understood that the similarity of the semantics of the acquired raw data of the at least two modalities should be greater than a predetermined value.

For example, when performing emotion recognition of a user, voice information uttered by the user may be first acquired, and then the voice information may be recognized to recognize text semantics expressed by the user. The original data of the plurality of modes for recognizing the emotion of the user can be voice information sent by the user and text semantic data recognized according to the voice information. Of course, the text semantic data in the multimodal raw data may also be directly input by the user.

For another example, when the attitude of the user about a certain article is recognized, voice information uttered by the user, user image information when the user indicates the attitude, and text information may be acquired. Therefore, by fusing the tone and tone information in the voice mode, the expression information of the user about the object to be recognized in the image mode, the text information of the user about the object to be recognized in the text mode and the like, the fusion of the advantage information of different modes is realized, and the attitude of the user about the object to be recognized can be recognized more accurately.

Then, in step 103, feature mining is performed on the original data of at least two modalities to obtain attribute features corresponding to each modality.

As mentioned above in step 101, the raw data for performing attribute recognition may be from multiple modalities such as voice, text, image, and video, and step 103 is described below in terms of a voice modality and a text modality.

The following explains a case of a speech modality.

When the at least two modalities in step 101 include a speech modality, the attribute feature may include a speech feature vector and a speech alignment matrix; step 101 may now acquire a speech signal from a speech modality for identifying an attribute when acquiring raw data from at least two modalities for identifying an attribute. The voice signal may be a voice signal directly collected and sent by a user, or may be a sound recording message played. For example, in the field of artificial intelligence, when a user issues a command to an artificial intelligence device by speaking, the speech uttered by the user is a speech signal in a speech mode.

At this time, when performing feature mining on raw data of at least two modalities to obtain attribute features corresponding to each modality, as shown in fig. 2, step 103 may include the following steps:

performing feature mining on original data of at least two modes to obtain attribute features corresponding to each mode, wherein the attribute features comprise:

step 201: dividing the voice signal into at least two frames according to a preset first time length to obtain a time domain division signal;

step 203: carrying out Fourier transform on the time domain segmentation signal to obtain frequency domain segmentation characteristics;

step 205: performing attribute feature extraction on the frequency domain segmentation features by using at least one feature extraction convolution kernel to obtain extracted features;

step 207: respectively obtaining a voice feature vector and a voice alignment matrix corresponding to the voice modality according to the extracted features; and the dimension of the representation frame in the voice alignment matrix is equal to the dimension of the representation frame in the voice feature vector.

In the embodiment of the present specification, when feature mining is performed on a speech signal, firstly, the speech signal is segmented into at least two frames according to a preset first time length to obtain a time domain segmentation signal, and then the time domain segmentation signal is converted into a frequency domain segmentation feature through fourier transform. Further, extracting attribute features by using the frequency domain segmentation features obtained by convolution check to obtain extraction features, and obtaining a voice feature vector and a voice alignment matrix corresponding to the voice mode according to the extraction features. Therefore, when the attribute characteristics are mined, the voice signals are divided into a plurality of sections according to the frame to be processed, so that the data processing efficiency is not greatly reduced due to too large single data volume, and the stability and efficiency of data processing can be improved.

Step 201 is explained below.

In this step, consider slicing an audio signal of indefinite length into small segments of fixed length. For example, it can be divided into small segments of one frame for 25 ms.

In addition, in order to avoid information leakage caused by window boundaries, that is, information leakage caused by incomplete closure of two adjacent frames after being sliced, it is considered that when a speech signal is sliced, signals of two adjacent frames in a sliced time domain sliced signal have time overlap of a preset second time length, and the second time length is smaller than the first time length of each frame. For example, the first time length of each frame is 25ms, and the second time length of the time overlap is 5 ms.

Step 203 is explained below.

Step 103 needs to implement the mining of attribute features, and for the speech signal, the information of the frequency domain dimension can better reflect the features of the attribute, for example, different pronunciations, tones, volumes, etc. may correspond to different frequencies. Therefore, after the speech signal is segmented by frames, the segmented time-domain segmentation signal is converted into the frequency-domain segmentation characteristic in the frequency-domain dimension by fourier transform. Therefore, the attribute features can be extracted and mined in the next step.

Fbank is a front-end processing method, and can process audio in a manner similar to human ears, so that the performance of voice recognition is improved. Therefore, the signal slicing in step 201 and the fourier transform in step 203 can process the voice signal by the Fbank processing method.

Step 205 is explained below.

Step 205, when the attribute feature extraction is performed on the frequency domain segmentation feature by using at least one feature extraction convolution kernel, in a possible implementation manner, the at least one feature extraction convolution kernel may include a first convolution kernel, a second convolution kernel, and a third convolution kernel; and, the first convolution kernel and the second convolution kernel are the same size, and the second convolution kernel and the third convolution kernel are different sizes. Thus, as shown in fig. 3, step 205 may be implemented by:

step 301: performing feature extraction on the frequency domain segmentation features by using a first convolution kernel to obtain first extraction features;

step 303: performing feature extraction on the first extraction features by using a second convolution kernel to obtain second extraction features;

step 305: and performing feature extraction on the second extraction features by using a third convolution kernel to obtain extraction features.

In this embodiment, when the convolution kernel is used to perform attribute feature extraction, the first convolution kernel may be used to perform feature extraction on the frequency domain segmentation features to obtain first extraction features, the second convolution kernel may be used to perform feature extraction on the first extraction features to obtain second extraction features, and the third convolution kernel may be used to perform feature extraction on the second extraction features to obtain extraction features. Therefore, more information beneficial to attribute identification can be extracted in a mode of carrying out feature extraction layer by layer, and the loss of the information is reduced.

For example, for each frame, data having a size of 1 × 80 is extracted as data having a size of 256 × 80 when the attribute feature extraction is performed. Of course, when the attribute feature extraction is performed, the 1 × 80 data may be directly extracted by one convolution kernel to the extraction feature of 256 × 80 size. It is also possible to extract the data of 1 × 80 size to be the first extracted feature of 64 × 80 size by the first convolution kernel, then extract the first extracted feature of 64 × 80 size to be the second extracted feature of 128 × 80 size by the second convolution kernel, and finally extract the second extracted feature of 128 × 80 size to be the extracted feature of 256 × 80 size by the third convolution kernel. The size of the first convolution kernel may be 5 × 1, and the size of each of the second convolution kernel and the third convolution kernel may be 2 × 1.

Step 207 is explained below.

In this step, after the extracted features are obtained by the attribute feature extraction, the speech feature vector and the speech alignment matrix of the speech modality can be further obtained according to the extracted features. For example, in one possible implementation, local attribute feature mining may be performed on the extracted features by using a self-attention mechanism to obtain a speech feature vector corresponding to the speech modality. The local attribute features are features capable of representing details of the attributes, so that the detailed features which are beneficial to attribute recognition in the extracted features can be further mined, and the recognition accuracy can be improved when the attributes are recognized based on the voice feature vectors.

The former is a mixed network structure for enhancing representation learning by using convolution operation and a self-attention mechanism, and the mixed network structure is used for fusing local feature global representations at different resolutions in an interactive mode and can reserve the local features and the global representations to the maximum extent. Therefore, in this embodiment, the extracted features may be input into the transformer network, and the context information of the speech signal may be learned to obtain the speech feature vector of the speech modality. For example, voice information such as volume, pitch, and timbre in the voice signal, which cannot be obtained from data of other modalities, can be learned through the former network, and it is more advantageous to recognize some attributes by using the voice information. Such as by voice information such as the user's pitch, tone, and volume, is more conducive to recognizing the user's emotional state.

Of course, in another possible implementation manner, the speech alignment matrix of the speech modality can also be obtained by extracting the features. For example, as shown in fig. 4, step 207 may also obtain a speech alignment matrix by:

step 401: scanning the matrix corresponding to the extracted features by using a preset fourth convolution kernel; wherein the size of the fourth convolution kernel satisfies: enabling the dimension size used for representing the frame in the voice alignment matrix to be equal to the dimension size used for representing the frame in the voice feature vector; and (c) a second step of,

step 403: and determining a matrix formed by the maximum value in each scanning result as a voice alignment matrix of the voice modality.

In this embodiment, when determining the speech alignment matrix, a preset fourth convolution kernel may be first used to scan a matrix corresponding to the extracted features, and then a matrix formed by a maximum value in each scanning result is determined as the speech alignment matrix of the speech modality. The dimension size of the representing frame in the speech alignment matrix can be equal to the dimension size of the representing frame in the speech feature vector by the fourth convolution kernel, so that the subsequent feature fusion can be better ensured.

When scanning is performed, in step 403, the maximum value in each scanning result is obtained, and the maximum value is the characteristic information that can most represent the current scanning. In this way, the alignment matrix of the speech modality consisting of the maximum value in each scan result can characterize the most prominent feature information in the speech signal.

Moreover, by setting the proper size of the fourth convolution kernel, the voice alignment matrix with corresponding dimensionality can be obtained, and therefore fusion of the voice alignment matrix and other attribute features is achieved. In a possible implementation manner, the step 401 and 403 may be implemented by maxpling when determining the speech alignment matrix, and for each scanned or extracted feature value, only the value with the largest score is taken as the remaining value of the extraction, so that not only the speech alignment matrix includes strong feature information, but also the alignment of the matrix can be implemented more easily to further perform feature fusion.

The text modality is explained below.

When the at least two modalities in step 101 include a text modality, the attribute feature may include a text feature vector and a text alignment matrix; step 101 may now acquire text data from the text modality for identifying the attribute while acquiring raw data from at least two modalities for identifying the attribute. The text data may be directly input by the user or may be recognized. For example, after a user sends out a voice signal, the voice signal is recognized to obtain semantic information to be expressed by the user, so that text data for recognizing attributes is formed.

At this time, when performing feature mining on the raw data of at least two modalities to obtain attribute features corresponding to each modality, as shown in fig. 5, step 103 may include the following steps:

step 501: inputting text data into a pre-trained text extraction model for feature extraction to obtain a text feature vector; the training method of the text extraction model comprises the following steps: training by using at least one group of sample sets; each group of sample sets comprises character information and coding information;

step 503: carrying out linear change on the text characteristic vector by using preset linear transformation parameters to obtain a text alignment matrix; the column dimension of the text alignment matrix is equal to the column dimension of the voice feature vector obtained through the voice signal.

In this embodiment, when the speech modality is a text modality, the obtained text data may be input into a pre-trained text extraction model to perform feature extraction, so as to obtain a text feature vector. And then, linearly changing the text characteristic vector by using the linear transformation parameters trained in advance to obtain a text alignment matrix. For the subsequent feature fusion operation, step 503 uses a linear variation mode to make the obtained text alignment matrix meet the requirement of dimension size, thereby ensuring the feasibility of subsequent fusion calculation.

The text extraction model can be obtained through Bert network framework training.

Further in step 105, the obtained attribute features corresponding to each modality are fused to obtain a fused feature.

After feature mining is performed on the original data of at least two modes to obtain attribute features corresponding to each mode, the obtained attribute features are considered to be fused. For example, when the at least two modalities include a speech modality and a text modality, and the step 105 is implemented when performing attribute feature fusion, as shown in fig. 6, by the following steps:

step 601: calculating the influence of the attribute characteristics of the text mode on the attribute characteristics of the voice mode to obtain an influence matrix of the voice mode; the attribute features of the text modality are features capable of representing the details of the attributes;

step 603: calculating the influence of the attribute characteristics of the voice mode on the attribute characteristics of the text mode to obtain an influence matrix of the text mode; the attribute features of the voice modality are features capable of representing the details of the attributes;

step 605: and splicing the influence matrix of the voice mode, the influence matrix of the text mode and the matrixes corresponding to the attribute characteristics of the modes to obtain the fusion characteristics.

In this embodiment, when performing attribute feature fusion, the influence of the attribute feature of the text modality on the attribute feature of the speech modality may be first calculated, and then the influence of the attribute feature of the speech modality on the attribute feature of the text modality is calculated. And finally, splicing the obtained influence matrix of the attribute mode, the influence matrix of the text mode and the matrixes corresponding to the attribute characteristics of all the modes to obtain the fusion characteristics. According to the method and the device, influence of the attribute characteristics of the text mode on the attribute characteristics of the voice mode is considered, and therefore relevance between the text mode and the voice mode on the attribute characteristics can be fully considered. Based on the fusion features with the association characteristics, the attributes can be identified more accurately.

For example, when the influence matrix of the speech modality is calculated, the attribute feature of the text modality may be a feature capable of representing details of an attribute, and the attribute feature of the speech modality may be a feature representing strong feature information about an attribute in text data; when the influence matrix of the text modality is calculated, the attribute feature of the voice modality can be a feature capable of representing the details of the attribute, and the attribute feature of the text modality can be a feature representing strong feature information about the attribute in the voice signal. Therefore, more data beneficial to attribute identification can be acquired after fusion based on the influence matrix obtained by the similarity relation between the attribute characteristics of different modes and different representation levels.

Step 601 will be explained below.

When performing attribute feature fusion, firstly, the influence of the attribute features of the text modality on the attribute features of the voice modality is considered to be calculated. At this time, the attribute feature of the voice modality may include a voice alignment matrix, and the attribute feature of the text modality may include a text feature vector. In step 601, when the influence of the attribute feature of the text modality on the attribute feature of the speech modality is calculated to obtain the influence matrix of the speech modality, the influence matrix can be obtained by the following calculation formula:

E _a ＝δ(W _a ·X _s )

wherein E is _a Influence matrix, W, for characterizing speech modalities _a For characterizing a speech alignment matrix, X _s For characterizing text feature vectors, δ (-) for characterizing Sigmoid activation functions.

Step 603 will be explained below.

When the attribute features are fused, after the influence of the attribute features of the text modality on the attribute features of the speech modality is calculated, the influence of the attribute features of the speech modality on the attribute features of the text modality can be further calculated. At this time, the attribute feature of the voice modality may include a voice feature vector, and the attribute feature of the text modality may include a text alignment matrix. Thus, when the influence of the attribute feature of the speech modality on the attribute feature of the text modality is calculated in step 603 to obtain the influence matrix of the text modality, the influence matrix can be obtained by the following calculation formula:

E _s ＝δ(W _s ·X _a )

wherein, E _s Influence matrix, W, for characterizing text modalities _s For characterizing text alignment matrices, X _a For characterizing the speech feature vectors, δ (-) for characterizing the Sigmoid activation function.

Through the

steps

601 and 603, the cross fusion of the attribute features of the voice mode and the attribute features of the text mode is realized, and the fusion between the features of different modes and different levels is fully realized, so that the obtained fusion features can be more suitable for performing attribute recognition, namely, the accuracy of the obtained attribute recognition result is higher.

Step 605 is explained below.

In one possible implementation, the attribute features of the speech modality may include speech feature vectors, and the attribute features of the text modality may include text feature vectors. At this time, in step 605, when the influence matrix of the voice modality, the influence matrix of the text modality, and the matrices corresponding to the attribute features of each modality are spliced to obtain the fusion feature, the sum of the influence matrix of the voice modality and the voice feature vector may be first calculated to obtain the fusion matrix of the voice modality, and then the sum of the influence matrix of the text modality and the text feature vector may be calculated to obtain the fusion matrix of the text modality. Finally, the fusion characteristics are calculated by using the following calculation formula:

wherein A is used for characterizing the fusion characteristics,

a fusion matrix for characterizing the speech modalities,

In this embodiment, in order to further avoid the reduction of the recognition accuracy due to the loss of information during attribute feature fusion, it is considered that the speech feature vector of the speech modality is fused into the influence matrix of the speech modality and the text feature vector of the text modality is fused into the influence matrix of the text modality before the influence matrix of the speech modality and the influence matrix of the text modality are fused. Compared with the influence matrix, the speech feature vector and the text feature vector contain more original features, so that the possibility of information loss can be further reduced, and the aim of improving the attribute identification accuracy is fulfilled.

The above-described calculation formula is a fusion calculation formula for one frame, and in the case of a plurality of frames, other frames may be sequentially spliced to follow. For example, the above calculation formula may be

Where the superscripts 1,2,3 … …, etc. of the fusion matrix represent the sequence numbers of the frames.

Finally, in step 107, the recognition result of the attribute is obtained by using the fusion feature.

For example, the attribute may include an emotional state of the user, and the attribute recognition result of the user may be a state of happiness, anger, impatience, or the like.

For another example, the attribute may further include an attitude of the user with respect to a certain viewpoint, and the attribute identification result of the user may be a support attitude, an objection attitude, and a no-so-called equiattitude.

In one possible implementation manner, when the step 107 obtains the identification result of the attribute by using the fusion feature, as shown in fig. 7, the following steps may be implemented:

step 701: calculating at least one mathematical characteristic quantity of the fusion characteristic; wherein the mathematical characteristic quantities include: mean and variance;

step 703: inputting at least one mathematical characteristic quantity into a pre-trained attribute recognition model to obtain a recognition result of the attribute; the training method of the attribute recognition model comprises the following steps: training by utilizing at least one group of sample training sets; each group of sample training set comprises a mathematical characteristic quantity of the attributes and a recognition result of the attributes.

In this embodiment, when obtaining the attribute identification result by using the fusion feature, at least one mathematical feature quantity of the fusion feature may be calculated first. For example, the mean and variance of the fused features may be calculated. And then inputting the obtained at least one mathematical and physical characteristic into a pre-trained attribute recognition model to obtain a recognition result of the attribute. Therefore, the feature data at the frame level is converted into the feature data at the mathematical and physical feature level, so that the classification of the classification model is facilitated, namely, the attribute is more favorably identified by the attribute identification model by utilizing the mathematical and physical feature quantity.

When the mathematical characteristic quantity is input into the attribute identification model, the model may output probabilities of various results of the attributes, so that a final result may be determined further based on the probability values. For example, when the emotional state of the user is recognized, if the mathematical and physical feature quantities of the fusion features are input to the attribute recognition model, and the obtained recognition results are that the happy probability is 90%, the angry is 5%, and the anxiety is 5%, then the emotional state of the user is a happy state.

As shown in fig. 8, the present specification also provides an attribute identification apparatus, which may include: a data acquisition module 801, a feature mining module 802, a feature fusion module 803 and an attribute identification module 804;

a data acquisition module 801 configured to acquire raw data from at least two modalities for identifying attributes; the similarity of the semantics of the original data of at least two modals is larger than a preset value;

a feature mining module 802 configured to perform feature mining on the original data of the at least two modalities acquired by the data acquisition module 801 to obtain attribute features corresponding to the modalities; the attribute features are features capable of influencing the attributes;

a feature fusion module 803 configured to fuse the attribute features corresponding to the modalities obtained by the feature mining module 802 to obtain fusion features;

an attribute identification module 804 configured to obtain an identification result of the attribute by using the fusion feature obtained by the feature fusion module 803.

the data acquisition module 801, when acquiring raw data for identifying an attribute from at least two modalities, is configured to acquire a voice signal for identifying an attribute from a voice modality;

the feature mining module 802 is configured to perform the following operations when performing feature mining on raw data of at least two modalities to obtain attribute features corresponding to each modality:

dividing the voice signal into at least two frames according to a preset first time length to obtain a time domain division signal;

performing attribute feature extraction on the frequency domain segmentation features by using at least one feature extraction convolution kernel to obtain extracted features;

respectively obtaining a voice feature vector and a voice alignment matrix corresponding to the voice modality according to the extracted features; and the dimension of the representation frame in the voice alignment matrix is equal to the dimension of the representation frame in the voice feature vector.

In a possible implementation manner, signals of two adjacent frames in the time domain sliced signal obtained by the feature mining mode have a time overlap of a preset second time length, and the second time length is smaller than the first time length.

the feature mining module 802 is configured to perform the following operations when performing attribute feature extraction on the frequency domain segmentation features by using at least one feature extraction convolution kernel to obtain extracted features:

performing feature extraction on the frequency domain segmentation features by using a first convolution kernel to obtain first extraction features;

performing feature extraction on the first extraction features by using a second convolution kernel to obtain second extraction features;

and performing feature extraction on the second extraction features by using a third convolution kernel to obtain extraction features.

In one possible implementation, when obtaining the speech feature vector corresponding to the speech modality according to the extracted feature, the feature mining module 802 is configured to perform the following operations:

performing local attribute feature mining on the extracted features by using a self-attention mechanism to obtain a voice feature vector corresponding to the voice mode; the local attribute feature is a feature capable of characterizing details of the attribute.

In one possible implementation, when obtaining the speech alignment matrix corresponding to the speech modality according to the extracted features, the feature mining module 802 is configured to perform the following operations:

scanning the matrix corresponding to the extracted features by using a preset fourth convolution kernel; wherein the size of the fourth convolution kernel satisfies: enabling the dimension size used for representing the frame in the voice alignment matrix to be equal to the dimension size used for representing the frame in the voice feature vector; and (c) a second step of,

the data acquisition module 801, when acquiring raw data for identifying attributes from at least two modalities, is configured to acquire text data for identifying attributes from a text modality;

inputting text data into a pre-trained text extraction model for feature extraction to obtain a text feature vector; the training method of the text extraction model comprises the following steps: training by using at least one group of sample sets; each group of sample sets comprises character information and coding information;

carrying out linear change on the text characteristic vector by using preset linear transformation parameters to obtain a text alignment matrix; the column dimension of the text alignment matrix is equal to the column dimension of the voice feature vector obtained through the voice signal.

when the obtained attribute features corresponding to the respective modalities are fused to obtain a fused feature, the feature fusion module 803 is configured to perform the following operations:

calculating the influence of the attribute characteristics of the text mode on the attribute characteristics of the voice mode to obtain an influence matrix of the voice mode; the attribute features of the text modality are features capable of representing the details of the attributes;

calculating the influence of the attribute characteristics of the voice mode on the attribute characteristics of the text mode to obtain an influence matrix of the text mode; the attribute features of the voice modality are features capable of representing the details of the attributes;

the feature fusion module 803, when calculating the influence of the attribute features of the text modality on the attribute features of the speech modality to obtain an influence matrix of the speech modality, is configured to perform the following operations:

the influence matrix of the speech modality is calculated using the following calculation formula:

E _a ＝δ(W _a ·X _s )

wherein E is _a For characterizing speechInfluence matrix of mode, W _a For characterizing a speech alignment matrix, X _s Used for characterizing the text feature vector, delta (-) is used for characterizing the Sigmoid activation function;

in one possible implementation, when calculating the influence of the attribute features of the speech modality on the attribute features of the text modality to obtain an influence matrix of the text modality, the feature fusion module 803 is configured to perform the following operations:

the influence matrix of the text modality is calculated using the following calculation:

E _s ＝δ(W _s ·X _a )

wherein E is _s Influence matrix, W, for characterizing text modalities _s For characterizing text alignment matrices, X _a For characterizing the speech feature vectors, δ (-) for characterizing the Sigmoid activation function.

when the feature fusion module 803 splices the influence matrix of the speech modality, the influence matrix of the text modality, and the matrices corresponding to the attribute features of the modalities to obtain a fusion feature, it is configured to perform the following operations:

calculating the sum of the influence matrix of the voice mode and the voice characteristic vector to obtain a fusion matrix of the voice mode;

the fusion features are calculated using the following calculation:

wherein A is used for characterizing the fusion characteristics,

a fusion matrix for characterizing the speech modalities,

In one possible implementation, the attribute identification module 804, when obtaining the identification result of the attribute by using the fused feature, is configured to perform the following operations:

calculating at least one mathematical characteristic quantity of the fusion characteristic; wherein the mathematical characteristic quantities include: mean and variance;

inputting at least one mathematical characteristic quantity into a pre-trained attribute recognition model to obtain a recognition result of the attribute; the training method of the attribute recognition model comprises the following steps: training by utilizing at least one group of sample training set; each group of sample training set comprises a mathematical characteristic quantity of the attributes and a recognition result of the attributes.

In one possible implementation, the attributes identified by the attribute identification module 804 include a user emotional state.

The present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.

The present specification also provides a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method in any of the embodiments of the specification.

It is to be understood that the illustrated configuration of the embodiment of the present specification does not specifically limit the attribute identifying device. In other embodiments of the specification, the attribute identification apparatus may include more or fewer components than illustrated, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

For the information interaction, execution process, and other contents between the units in the apparatus, the specific contents may refer to the description in the method embodiment of the present specification because the same concept is based on the method embodiment of the present specification, and are not described herein again.

Those skilled in the art will recognize that in one or more of the examples described above, the functions described in this specification can be implemented in hardware, software, hardware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, the purpose, technical solutions and advantages described in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. The attribute identification method comprises the following steps:

and obtaining the identification result of the attribute by utilizing the fusion characteristic.

2. The method of claim 1, wherein, when a voice modality is included in the at least two modalities, the attribute features include a voice feature vector and a voice alignment matrix;

the obtaining raw data from at least two modalities for identifying the attribute includes: acquiring a voice signal from a voice modality for identifying the attribute;

performing attribute feature extraction on the frequency domain segmentation features by using at least one feature extraction convolution core to obtain extracted features;

3. The method of claim 2, wherein the signals of two adjacent frames in the time-domain sliced signal have a time overlap of a preset second time length, and the second time length is smaller than the first time length.

4. The method of claim 2, wherein the at least one feature extraction convolution kernel includes a first convolution kernel, a second convolution kernel, and a third convolution kernel; the first convolution kernel and the second convolution kernel have different sizes, and the second convolution kernel and the third convolution kernel have the same size;

the extracting the attribute features of the frequency domain segmentation features by using at least one feature extraction convolution core to obtain extraction features comprises the following steps:

5. The method according to claim 2, wherein the obtaining of the speech feature vector corresponding to the speech modality according to the extracted features comprises:

performing local attribute feature mining on the extracted features by using a self-attention mechanism to obtain a voice feature vector corresponding to the voice mode; wherein the local attribute feature is a feature capable of characterizing details of the attribute.

6. The method according to claim 2, wherein the obtaining of the speech alignment matrix corresponding to the speech modality according to the extracted features comprises:

scanning the matrix corresponding to the extracted features by using a preset fourth convolution kernel; wherein the size of the fourth convolution kernel satisfies: enabling a dimension size for a characterizing frame in the speech alignment matrix to be equal to a dimension size for a characterizing frame in the speech feature vector; and (c) a second step of,

7. The method of claim 1, wherein, when a text modality is included in the at least two modalities, the attribute features include a text feature vector and a text alignment matrix;

inputting the text data into a pre-trained text extraction model for feature extraction to obtain the text feature vector; the training method of the text extraction model comprises the following steps: training by using at least one group of sample sets; each group of sample sets comprises character information and coding information;

carrying out linear change on the text characteristic vector by using preset linear transformation parameters to obtain the text alignment matrix; and the column dimension of the text alignment matrix is equal to the column dimension of the voice feature vector obtained through the voice signal.

8. The method of claim 1, wherein when the at least two modalities include a speech modality and a text modality;

the fusing the obtained attribute features corresponding to the modalities to obtain fused features includes:

calculating the influence of the attribute characteristics of the voice mode on the attribute characteristics of the text mode to obtain an influence matrix of the text mode; wherein the attribute of the voice modality is characterized by features capable of characterizing details of the attribute;

9. The method of claim 8, wherein the attribute features of the speech modality comprise a speech feature vector and a speech alignment matrix, and the attribute features of the text modality comprise a text feature vector and a text alignment matrix;

E _a ＝δ(W _a ·X _s )

wherein, E _a An influence matrix, W, for characterizing the speech modality _a For characterizing said speech alignment matrix, X _s For characterizing the text feature vector, δ (-) for characterizing the Sigmoid activation function;

and/or the presence of a gas in the gas,

the calculating the influence of the attribute features of the voice modality on the attribute features of the text modality to obtain an influence matrix of the text modality includes:

E _s ＝δ(W _s ·X _a )

wherein E is _s An influence matrix, W, for characterizing the text modality _s For characterizing said text alignment matrix, X _a For characterizing the speech feature vector, δ (-) for characterizing the Sigmoid activation function.

10. The method of claim 8, wherein the attribute features of the speech modality comprise speech feature vectors and the attribute features of the text modality comprise text feature vectors;

calculating the fusion feature using the following calculation:

wherein A is used to characterize the fusion feature,

a fusion matrix for characterizing the speech modalities,

and the fusion matrix is used for representing the text modality, and the Concat (-) is used for representing matrix splicing.

11. The method according to any one of claims 1 to 10, wherein the obtaining of the identification result of the attribute using the fused feature comprises:

inputting the at least one mathematical characteristic quantity into a pre-trained attribute recognition model to obtain a recognition result of the attribute; the training method of the attribute recognition model comprises the following steps: training by utilizing at least one group of sample training set; each group of sample training set comprises a mathematical characteristic quantity of the attributes and a recognition result of the attributes.

12. The method of any of claims 1-10, wherein the attributes comprise: the emotional state of the user.

13. An attribute identification device comprising: the system comprises a data acquisition module, a feature mining module, a feature fusion module and an attribute identification module;

the characteristic mining module is configured to respectively perform characteristic mining on the original data of the at least two modalities acquired by the data acquisition module to obtain attribute characteristics corresponding to each modality; wherein the attribute features are features that can affect the attributes;

14. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-12.