CN115618270B

CN115618270B - Multi-modal intention recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115618270B
Application number: CN202211621367.6A
Authority: CN
Inventors: 张烁; 刘芳; 陈曦; 杨睿; 安业腾; 张惠民; 张妍; 赵伟; 王晨飞; 徐李阳
Original assignee: State Grid Co ltd Customer Service Center
Current assignee: State Grid Co ltd Customer Service Center
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-11
Anticipated expiration: 2042-12-16
Also published as: CN115618270A

Abstract

The application provides a multi-modal intention recognition method, a multi-modal intention recognition device, an electronic device and a storage medium. Relates to the technical field of artificial intelligence, and the method comprises the following steps: acquiring data to be identified, wherein the data to be identified comprises data of at least two modes, and each mode data has different data types; coding the data to be identified to obtain a representation sequence of each modal data; constructing a multi-modal abnormal graph by taking the representation sequence of each modal data as a node characteristic; coding the multi-modal abnormal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal abnormal composition; and classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result. The method can effectively fuse multi-mode information, improves the recognition accuracy of the user interaction intention by adopting the multi-mode composition, and realizes natural and flexible human-computer interaction.

Description

Multi-modal intention recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for recognizing a multi-modal intention, an electronic device, and a storage medium.

Background

The intention recognition is used for analyzing core requirements of a user and outputting information most relevant to query input, a common task-type dialogue intention recognition task in the prior art only solves single intention recognition generally, usually obtains word vectors and context word vectors in sample texts for training to obtain an intention recognition model, and the intention recognition model generates and executes a series of behaviors and strategies by determining intentions corresponding to the user input to realize interaction with the user. However, in real life, people often need to comprehensively judge real intentions by using various modal information (such as natural language, video and audio signals, etc.), and besides the most common characters, multi-modal data such as pictures, videos, audios, etc. can also be applied to assist understanding of user intentions, so as to improve the accuracy of information services.

For example, in the field of power systems, a scene that is hard to describe by words is often encountered in power failure repair, because in a customer service session, a user sends not only plain text information but also image and voice information and the like. For example, the repair/installation of the charging pile cannot be directly described through text, usually the repair or inquiry is performed through a photographing mode, and the intention of the user may be accurately determined by comprehensively considering text and image information.

However, most intention reference data sets only contain text modal information at present, man-machine interaction data is single, few modes for multi-modal intention recognition are trained by fusing a multi-modal pre-training model and an attention mechanism to obtain a multi-modal intention recognition model, recognition accuracy is not high, a mode fusion mode is simple, development of the multi-modal intention understanding field is greatly limited, and multi-intention recognition in the power failure repair field is more rarely researched.

Therefore, it is an urgent problem to improve the recognition accuracy of multi-modal intent recognition.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for multimodal intention recognition in the field of power failure repair, which can specifically solve the existing problems.

In a first aspect, based on the above object, the present application provides a multimodal intent recognition method, comprising: acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, and each modality data has different data types; coding the data to be identified to obtain a representation sequence of each modal data; constructing a multi-modal abnormal graph by taking the representation sequence of each modal data as a node characteristic; coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image; and classifying according to the representation of the multi-modal abnormal picture to obtain an intention recognition result.

Optionally, the data to be recognized includes text data, picture data, and audio data, and the encoding the data to be recognized to obtain a representation sequence of each modal data includes: performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.

Optionally, for the text data, training the tri-modal pre-training model by minimizing a negative log-likelihood function to obtain the text sequence; for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence; and for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence.

Optionally, constructing a multi-modal heteromorphic graph by using the representation sequence of each modal data as a node feature, including: obtaining different node types according to different modal data, and determining the number of nodes of each node type according to the number of elements in the representation sequence of each modal data; the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.

Optionally, encoding the multi-modal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal composition, including: calculating attention weight according to the relation between each node in the multi-modal heteromorphic graph; calculating a hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal abnormal graph according to the representation of each node.

Optionally, the calculating an attention weight according to a relationship between each node in the multi-modal heteromorphic graph includes: activating the node vectors through a nonlinear activation function, and normalizing the activated node vectors to obtain attention weight; the relationship between each node comprises a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.

Optionally, the classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result includes: obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.

In a second aspect, for the above purpose, the present application further proposes a multimodal intention recognition apparatus, comprising: the data acquisition module is used for acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, and each modality data has different data types; the pre-training module is used for coding the data to be identified to obtain a representation sequence of each modal data; the heterogeneous graph creating module is used for taking the representation sequence of each modal data as node characteristics to construct a multi-modal heterogeneous graph; the heterogeneous graph representation module is used for coding the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph; and the classification module is used for classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result.

In a third aspect, the present embodiment also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method according to any one of the first aspect.

In a fourth aspect, the present embodiments also provide a computer-readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to implement the method according to any one of the first aspect.

Generally, the advantages of the present application and the experience brought to the user are:

the embodiment provides a multi-modal intention identification method, which includes the steps of obtaining data to be identified with different modes, and coding the data to be identified to obtain a representation sequence of each mode data; the representation sequence of each modal data is used as a node characteristic to construct a multi-modal heteromorphic graph, and a new idea is provided for solving the problem of multi-modal dialog intention understanding; coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image; and classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result. The method can effectively fuse multi-mode information, improve the recognition accuracy of the user interaction intention by adopting the multi-mode heteromorphic image, and realize natural and flexible human-computer interaction.

Drawings

In the drawings, like reference characters designate like or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 illustrates a flow diagram of a multi-modal intent recognition methodology of the present application;

FIG. 2 illustrates a schematic structural diagram of a tri-modal pre-training model in accordance with an example of the present application;

FIG. 3 shows a schematic diagram of a multimodal heteromorphic representation in accordance with one example of the present application;

FIG. 4 illustrates a flow diagram for deriving a representation of a multimodal anomaly map in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a multi-modal intent recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 7 is a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 illustrates a flow diagram of a multimodal intent recognition method of the present application. As shown in FIG. 1, the multi-modal intent recognition method includes the following steps S101 to S105:

s101, obtaining data to be identified.

The execution main body of the embodiment may be a server or an intelligent terminal, for example, a user may interact with a backend server through the intelligent terminal, may interact between the user and the terminal through an application program or an applet installed on the terminal, and the user may input a content to be queried through voice control, text input, or picture input. The embodiment can be applied to the field of power customer service fault repair, and can also be applied to other scenes, such as online shopping and the like.

In one example, since the embodiment needs to recognize the multi-modal intent of the user, the data to be recognized input by the user needs to be obtained first, where the data to be recognized includes data of at least two modalities, each modality has a different data type, for example, the data to be recognized includes text information and voice information, for example, the data to be recognized includes text information and picture information, and for example, the data to be recognized includes text information, picture information and voice information.

Considering that an application scenario of the embodiment is a power failure report scenario, a scene that is difficult to describe by words is often encountered in the power failure report scenario, because in a customer service session, a user may send not only pure text information but also image and voice information. Therefore, in this embodiment, the data to be recognized includes text data, picture data, and audio data as an example, where the text data may be power failure text information input by a user through a terminal, the picture data may be picture information of a failure repair part uploaded by the user through the terminal, and the audio data may be description audio of the power failure acquired by the voice acquisition device by the user.

And S102, coding the data to be identified to obtain a representation sequence of each modal data.

In this embodiment, a three-modal pre-training model (Omni-perspective pre-Trainer, OPT) is used to encode data to be identified, so as to obtain a representation sequence of each modal data, and the encoded features are used as node features.

Specifically, encoding data to be identified to obtain a representation sequence of each modal data includes: performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.

The text data can be segmented by Word Piece to obtain a plurality of words, for example, the 'how application is applied to the charging pile of the old cell' is segmented into 'charging pile' of the old cell ', how application' and a plurality of words. For picture data, feature extraction of a region of interest (ROI) can be performed on a picture using fast R-CNN, resulting in a plurality of image regions. For audio data, the audio information symbols can be obtained using wav2vec, which is a convolutional neural network that takes the original audio as input and computes a general representation that can be input to a speech recognition system, dividing the audio data into a plurality of audio segments.

In this embodiment, fig. 2 is a schematic structural diagram of an OPT model, and as shown in fig. 2, the OPT model includes a mask layer at a segmentation level, a mask layer at a modality level, a modality coding layer, an OPT training layer, a cross-modality coder and a decoding layer, where the mask layer at the segmentation level is configured to mask a segmentation of Text data, an image region of picture data, and an Audio fragment of Audio data according to a preset ratio, the mask layer at the modality level is configured to convert the segmentation of the Text data, the image region of picture data, and the Audio fragment of Audio data into a modality form corresponding thereto, the modality coding layer includes a Text coder Text Encoder, a video Encoder, and an Audio Encoder, and is respectively configured to code data of different modalities, the OPT training layer is configured to perform OPT model training, the cross-modality coder is configured to perform fusion of multiple modalities, the decoding layer includes a Text Decoder, a picture Decoder, a video Decoder, and an Audio Decoder, and is configured to output results corresponding to the modalities. In this embodiment, the OPT training layer includes a Mask Language Model (MLM), a Mask Vision Model (MVM), and a Mask Audio Model (MAM), and trains data in different modalities respectively.

As shown in fig. 2, the input of the OPT model mainly includes three parts, including a text input part, a picture input part and an audio input part, in this embodiment, the text input part, the picture input part and the audio input part in fig. 2 are all provided with "MASK" for the part to be masked when the OPT model is pre-trained through an objective function, and the training target is the part to be predicted to be concealed.

In this embodiment, for Text input, after word segmentation is performed on Text data, words may be encoded in combination with word encoding Token Embedding and Position encoding Position Embedding to obtain first encoded information, where Token Embedding is a low-dimensional continuous word vector for each word, and when inputting, the encoding corresponding to each word is performed, and Position Embedding is used to encode the Position of each word, and each word encoding and Position encoding are added to obtain an Embedding input of the Text, which is encoded by a Text Encoder to obtain the first encoded information.

For a picture input part, extracting ROI characteristics of an original image by using fast R-CNN to obtain a plurality of image areas, inputting image information and position information of the image areas into a full connection layer, mapping the image information and the position information to the same space, adding respective codes to obtain an Embedding input of the picture, and coding by a picture coder Vision Encoder to obtain second coding information.

And for the Audio input part, after a plurality of Audio segments are obtained, encoding is carried out through an Audio Encoder and an Audio Encoder to obtain third encoding information.

In this embodiment, the first encoding information, the second encoding information, and the third encoding information are used as inputs of a three-mode pre-training model, and a text sequence, a picture sequence, and an audio sequence respectively corresponding to text data, picture data, and audio data are obtained through OPT.

In an example, in consideration of different characteristics corresponding to different data, the present embodiment trains the tri-modal pre-training model by using different training methods to obtain text sequences of different types of data.

In one example, it is assumed that, for text data, a text sequence obtained by word segmentation is:

wherein N is the number of the words after word segmentation.

For picture data, the picture sequence represented by ROI feature extraction is:

where K is the number of picture regions.

For audio data, the audio sequence represented by the audio segment obtained by wav2vec 2.0 is:

wherein Q is the number of audio segments.

Then, in this embodiment, pre-training of the model is performed on the basis of T, V, and a, and the training method and the objective function of each part are as follows:

specifically, for text data, a three-mode pre-training model is trained through a minimized negative log-likelihood function, and a text sequence is obtained.

For a text sequence, an OPT model randomly conceals 15% of words to obtain word segmentation mask representation of a mask layer at a word segmentation level as shown in FIG. 2, a training target is to predict the concealed words, the training process is realized by minimizing a negative log likelihood function, and the formula of the minimized negative log likelihood function is as follows:

wherein,

to minimize the negative log-likelihood function of the mask language model MLM,

in the form of a word that is to be concealed,

for words that are not concealed, V denotes a picture sequence and a denotes an audio sequence.

Specifically, for picture data, a first function and a second function are set to train a three-mode pre-training model, so that a picture sequence is obtained.

For a picture sequence, the OPT model also randomly shades 15% of a picture region to obtain a picture mask representation of a mask layer at a word segmentation level as shown in fig. 2, and a training target is a reconstructed picture, because the visual features of the picture are high-dimensional, it is not feasible to directly use a likelihood-like function, so two functions are adopted for representation, where the target function is:

wherein,

for the minimized negative log-likelihood function of the mask picture model MVM,

there are two ways, including a first function and a second function,

the first function is:

the first function represents directly comparing the output of the encoder with the input after conversion through the full link layer,

for the purpose of concealed picture input,

for the input of pictures that are not concealed,

the method is a picture obtained by converting the concealed picture through the full connection layer.

The second function is:

the second function represents classification by hidden area.

Pair of representations

The label vector after the conversion is carried out,

to represent

The vector of the true tag is then calculated,

to represent

And

when the cross entropy error is minimum, a classification result is output.

Specifically, for audio data, a third function and a fourth function are set to train a three-mode pre-training model, so as to obtain an audio sequence. For an audio sequence, the objective function is:

wherein,

to mask the minimized negative log-likelihood function of the audio model MAM,

has two function expressions including a third function and a fourth function,

the third function is:

wherein the third function represents comparing the output of the encoder with the input after conversion through the full link layer,

for the audio input vector to be concealed,

for the audio input vector that is not concealed,

is an audio vector converted from the concealed audio through the full connection layer.

The fourth function is:

the fourth function represents maximizing mutual information between masks through contrast learning, and positive and negative samples are constructed through the audio input vector which is concealed and the audio input vector which is not concealed.

S103, constructing a multi-modal heteromorphic graph by taking the representation sequence of each modal data as a node feature.

In this embodiment, constructing the multimodal heterogeneous graph includes: and obtaining different node types according to the different modal data, and determining the number of the nodes of each node type according to the number of elements in the representation sequence of each modal data. In this embodiment, the data to be recognized is modeled as a multi-modal heteromorphic graph, where data of different modalities are represented by different types of nodes.

The node number of the text nodes is obtained according to the number of words in the text data, the node number of the picture nodes is obtained according to the number of image areas of the picture data, and the node number of the audio nodes is obtained according to the number of audio clips of the audio data. For the input of the text modality, each segmented word is represented by one node, for the input of the picture modality, each picture region is represented by one node, and for the input of the audio modality, each audio clip is represented by one node.

It should be noted that the relationship between each node in this embodiment includes a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.

Fig. 3 is a schematic diagram of a multi-modal heteromorphic graph, as shown in fig. 3, the number of nodes of picture data is 3, the number of nodes of text data is 4, and the number of nodes of audio data is 4. The 3 nodes of the picture data are respectively a picture node 1, a picture node 2 and a picture node 3, the four nodes of the text data are 'old cells', 'charging piles', 'what' and 'application', the corresponding 4 audio nodes of the audio data comprise an audio node 1, an audio node 2, an audio node 3 and an audio node 4, wherein the nodes belonging to the same type of modal data have a parallel relationship, for example, the picture node 1, the picture node 2 and the picture node 3 have a parallel relationship, the picture node 1 and the text node 'charging pile' have a parallel relationship, the picture node 2 and the text node 'charging pile' have a progressive relationship, and the picture node 3 and the text node 'charging pile' have a progressive relationship.

S104, encoding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image.

In this embodiment, on the basis of the constructed abnormal graph, the graph is encoded through a global view based on an attention mechanism. Under the global view, attention aggregation is firstly carried out on nodes in different modes to obtain representation of each mode, and then aggregation is carried out on the nodes in different modes to obtain representation of the whole graph.

As shown in FIG. 4, obtaining a representation of a multimodal anomaly map includes the following steps S401-S404:

s401, calculating attention weight according to the relation between each node in the multi-modal heteromorphic graph.

In this embodiment, the relationship between each node includes a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data. In the global view, there are three node types N = { TN, PN, AN } and two edge types E = { CO, PG }, where TN denotes a word node of a text modality, PN denotes AN image region node of a picture modality, AN denotes AN audio clip node of AN audio modality, CO denotes a parallel relationship, and PG denotes a progressive relationship. After the respective representation sequence of each modal data is obtained through the pre-training model, all the nodes are mapped into the same vector space.

In this embodiment, calculating the attention weight according to the relationship between each node in the multi-modal heteromorphic graph includes: and activating the node vectors through a nonlinear activation function, and normalizing the activated node vectors to obtain the attention weight.

Wherein the activation function is activated by a non-linearity

Activating the node vectors, and obtaining the attention weight after the node vectors are normalized by softmax:

wherein,

a domain of neighbor nodes representing node i,

indicates that the nodes k and i are in the same mode]The concatenation of the vectors is represented and,

a mapping matrix is represented that is,

is a non-linear activation function.

S402, calculating the hidden vector of each node under different modal data.

After the attention weight is obtained, the hidden vector calculation formula of the node i under the modality p is as follows:

wherein,

the weight is represented by a weight that is,

is a non-linear activation function.

And S403, obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data.

After the representation of the node i under the modality p is obtained, different modalities are aggregated, and the representation of the node under the global view is obtained as follows:

wherein,

representing the weight of each modality.

S404, obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.

After the representations of all nodes are obtained by calculation, averaging all nodes to obtain the representation of the final graph:

wherein,

the domain composed of all nodes in the diagram is represented, and AVG represents the average of the node vector.

By the method, the representation of the multi-modal abnormal composition can be obtained, a better multi-modal information fusion effect can be achieved, and the recognition accuracy of the user interaction intention is improved.

And S105, classifying according to the representation of the multi-modal abnormal composition to obtain an intention prediction result.

In this embodiment, classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result includes: obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.

Specifically, after the graph representation vector is obtained, the intent calculation result is:

and W is a parameter matrix to be trained, and is converted into the prediction probability of the intention label through softmax normalization processing.

The loss function is:

wherein,

a label representing the true intention of the user,

representing the intent prediction probability.

The smaller the loss value L is, the better the classification effect of the model is, and when the loss value is stabilized in a preset range, the convergence of the model is determined, so that the output intention prediction result is more accurate.

In the multi-modal intention identification method provided by the embodiment, the data to be identified with different modalities are obtained, and the data to be identified is encoded to obtain the representation sequence of each modality data; the representation sequence of each modal data is used as a node characteristic to construct a multi-modal abnormal graph, and a new thought is provided for solving the multi-modal dialog intention understanding; coding the multi-modal abnormal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal abnormal composition; and classifying according to the representation of the multi-modal abnormal picture to obtain an intention recognition result. The method can effectively fuse multi-mode information, improve the recognition accuracy of the user interaction intention by adopting the multi-mode heteromorphic image, and realize natural and flexible human-computer interaction.

The present embodiment also provides a multi-modal intention recognition apparatus, and referring to fig. 5, the multi-modal intention recognition apparatus 500 includes:

a data obtaining module 501, configured to obtain data to be identified, where the data to be identified includes data of at least two modalities, and each modality data has a different data type;

a pre-training module 502, configured to encode the data to be identified to obtain a representation sequence of each modal data;

a heterogeneous graph creating module 503, configured to use the representation sequence of each modal data as a node feature to construct a multi-modal heterogeneous graph;

a heterogeneous graph representation module 504, configured to encode the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph;

and the classification module 505 is used for classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result.

In a feasible example, the pre-training module 502 is further configured to perform word segmentation on the text data to obtain a plurality of words, and encode the words to obtain first encoded information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.

In a possible example, the pre-training module 502 is further configured to train the tri-modal pre-training model by minimizing a negative log-likelihood function with respect to the text data to obtain the text sequence; for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence; and for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence.

In a feasible example, the heterogeneous graph creating module 503 is configured to obtain different node types according to different modal data, and determine the number of nodes of each node type according to the number of elements in the representation sequence of each modal data; the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.

In one possible example, the heterogeneous graph representation module 504 is configured to calculate an attention weight according to a relationship between each node in the multi-modal heterogeneous graph; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.

In a feasible example, the heterogeneous graph representation module 504 is configured to activate a node vector through a nonlinear activation function, and perform normalization processing on the activated node vector to obtain an attention weight; the relationship between each node comprises a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.

In one possible example, the classification module 505 is configured to derive an intention tag prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.

The multi-modal intention recognition device provided by the embodiment of the application and the multi-modal intention recognition method provided by the embodiment of the application have the same advantages as the method adopted, operated or realized by the stored application program.

The embodiment of the application also provides an electronic device corresponding to the multi-modal intention recognition method provided by the previous embodiment, so as to execute the multi-modal intention recognition method. The embodiments of the present application are not limited.

Please refer to fig. 6, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 6, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the computer program to perform the composite intention identifying method provided by any of the foregoing embodiments of the present application.

The Memory 201 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used for storing a program, and the processor 200 executes the program after receiving an execution instruction, and the method for recognizing a composite intention disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200, or implemented by the processor 200.

The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the multi-modal intention recognition method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 7, the computer readable storage medium is an optical disc 30, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program executes the composite intent recognition method provided by any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above embodiment of the present application and the multi-modal intention recognition method provided by the embodiment of the present application have the same advantages as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system is apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Those skilled in the art will appreciate that the modules in the devices in an embodiment may be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification, and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except for at least some of such features and/or processes or elements being mutually exclusive. Each feature disclosed in this specification may be replaced by an alternative feature serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a virtual machine creation system according to embodiments of the present application. The present application may also be embodied as apparatus or system programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the disclosure. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application.

Claims

1. A method of multi-modal intent recognition, the method comprising:

acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, each modality data has different data types, and the data to be identified comprises text data, picture data and audio data;

coding the data to be identified to obtain a representation sequence of each modal data;

constructing a multi-modal heteromorphic graph by taking the representation sequence of each modal data as a node feature;

coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image;

classifying according to the representation of the multi-modal heteromorphic image to obtain an intention recognition result;

wherein the encoding of the multi-modal heteromorphic graph through the attention mechanism-based global view to obtain the representation of the multi-modal heteromorphic graph comprises: calculating attention weight according to the relation between each node in the multi-mode heteromorphic graph, wherein the attention weight activates a node vector through a nonlinear activation function, and the activated node vector is obtained through normalization processing; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.

2. The method according to claim 1, wherein the data to be recognized comprises text data, picture data and audio data, and the encoding the data to be recognized to obtain the representation sequence of each modal data comprises:

performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information;

extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information;

extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information;

and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.

3. The multi-modal intent recognition method of claim 2, further comprising:

for the text data, training the three-mode pre-training model through a minimized negative log-likelihood function to obtain the text sequence;

for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence;

for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence;

wherein, the first function and the third function both represent that the output of the coder is compared with the input after being converted by the full connection layer; the second function represents classification by a hidden region; the fourth function represents the construction of positive and negative samples from the concealed audio input vector and the non-concealed audio input vector.

4. The method according to claim 2, wherein constructing a multi-modal anomaly map by using the representation sequence of each modal data as a node feature comprises:

obtaining different node types according to different modal data, and determining the number of nodes of each node type according to the number of elements in the representation sequence of each modal data;

the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.

5. The multi-modal intent recognition method of claim 1, wherein the relationships between each node include a parallel relationship and a progressive relationship, the parallel relationship characterizing two nodes belonging to the same type of modal data, and the progressive relationship characterizing two nodes belonging to different types of modal data.

6. The multi-modal intent recognition method of claim 1, wherein the classifying from the representation of the multi-modal composition to obtain intent recognition results comprises:

obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph;

and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.

7. A multimodal intent recognition apparatus, the apparatus comprising:

the data acquisition module is used for acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, each modality data has different data types, and the data to be identified comprises text data, picture data and audio data;

the pre-training module is used for coding the data to be identified to obtain a representation sequence of each modal data;

the heterogeneous graph creating module is used for taking the representation sequence of each modal data as a node characteristic to construct a multi-modal heterogeneous graph;

the heterogeneous graph representation module is used for coding the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph;

the classification module is used for classifying according to the representation of the multi-modal heteromorphic image to obtain an intention recognition result;

wherein the encoding of the multi-modal heteromorphic graph through the attention mechanism-based global view to obtain the representation of the multi-modal heteromorphic graph comprises: calculating attention weight according to the relation between each node in the multi-mode heteromorphic graph, wherein the attention weight activates a node vector through a nonlinear activation function, and the activated node vector is obtained through normalization processing; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal abnormal graph according to the representation of each node.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.