CN115618270B - Multi-modal intention recognition method and device, electronic equipment and storage medium - Google Patents

Multi-modal intention recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115618270B
CN115618270B CN202211621367.6A CN202211621367A CN115618270B CN 115618270 B CN115618270 B CN 115618270B CN 202211621367 A CN202211621367 A CN 202211621367A CN 115618270 B CN115618270 B CN 115618270B
Authority
CN
China
Prior art keywords
data
modal
node
representation
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211621367.6A
Other languages
Chinese (zh)
Other versions
CN115618270A (en
Inventor
张烁
刘芳
陈曦
杨睿
安业腾
张惠民
张妍
赵伟
王晨飞
徐李阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Co ltd Customer Service Center
Original Assignee
State Grid Co ltd Customer Service Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Co ltd Customer Service Center filed Critical State Grid Co ltd Customer Service Center
Priority to CN202211621367.6A priority Critical patent/CN115618270B/en
Publication of CN115618270A publication Critical patent/CN115618270A/en
Application granted granted Critical
Publication of CN115618270B publication Critical patent/CN115618270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a multi-modal intention recognition method, a multi-modal intention recognition device, an electronic device and a storage medium. Relates to the technical field of artificial intelligence, and the method comprises the following steps: acquiring data to be identified, wherein the data to be identified comprises data of at least two modes, and each mode data has different data types; coding the data to be identified to obtain a representation sequence of each modal data; constructing a multi-modal abnormal graph by taking the representation sequence of each modal data as a node characteristic; coding the multi-modal abnormal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal abnormal composition; and classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result. The method can effectively fuse multi-mode information, improves the recognition accuracy of the user interaction intention by adopting the multi-mode composition, and realizes natural and flexible human-computer interaction.

Description

Multi-modal intention recognition method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for recognizing a multi-modal intention, an electronic device, and a storage medium.
Background
The intention recognition is used for analyzing core requirements of a user and outputting information most relevant to query input, a common task-type dialogue intention recognition task in the prior art only solves single intention recognition generally, usually obtains word vectors and context word vectors in sample texts for training to obtain an intention recognition model, and the intention recognition model generates and executes a series of behaviors and strategies by determining intentions corresponding to the user input to realize interaction with the user. However, in real life, people often need to comprehensively judge real intentions by using various modal information (such as natural language, video and audio signals, etc.), and besides the most common characters, multi-modal data such as pictures, videos, audios, etc. can also be applied to assist understanding of user intentions, so as to improve the accuracy of information services.
For example, in the field of power systems, a scene that is hard to describe by words is often encountered in power failure repair, because in a customer service session, a user sends not only plain text information but also image and voice information and the like. For example, the repair/installation of the charging pile cannot be directly described through text, usually the repair or inquiry is performed through a photographing mode, and the intention of the user may be accurately determined by comprehensively considering text and image information.
However, most intention reference data sets only contain text modal information at present, man-machine interaction data is single, few modes for multi-modal intention recognition are trained by fusing a multi-modal pre-training model and an attention mechanism to obtain a multi-modal intention recognition model, recognition accuracy is not high, a mode fusion mode is simple, development of the multi-modal intention understanding field is greatly limited, and multi-intention recognition in the power failure repair field is more rarely researched.
Therefore, it is an urgent problem to improve the recognition accuracy of multi-modal intent recognition.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for multimodal intention recognition in the field of power failure repair, which can specifically solve the existing problems.
In a first aspect, based on the above object, the present application provides a multimodal intent recognition method, comprising: acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, and each modality data has different data types; coding the data to be identified to obtain a representation sequence of each modal data; constructing a multi-modal abnormal graph by taking the representation sequence of each modal data as a node characteristic; coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image; and classifying according to the representation of the multi-modal abnormal picture to obtain an intention recognition result.
Optionally, the data to be recognized includes text data, picture data, and audio data, and the encoding the data to be recognized to obtain a representation sequence of each modal data includes: performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.
Optionally, for the text data, training the tri-modal pre-training model by minimizing a negative log-likelihood function to obtain the text sequence; for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence; and for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence.
Optionally, constructing a multi-modal heteromorphic graph by using the representation sequence of each modal data as a node feature, including: obtaining different node types according to different modal data, and determining the number of nodes of each node type according to the number of elements in the representation sequence of each modal data; the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.
Optionally, encoding the multi-modal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal composition, including: calculating attention weight according to the relation between each node in the multi-modal heteromorphic graph; calculating a hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal abnormal graph according to the representation of each node.
Optionally, the calculating an attention weight according to a relationship between each node in the multi-modal heteromorphic graph includes: activating the node vectors through a nonlinear activation function, and normalizing the activated node vectors to obtain attention weight; the relationship between each node comprises a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.
Optionally, the classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result includes: obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.
In a second aspect, for the above purpose, the present application further proposes a multimodal intention recognition apparatus, comprising: the data acquisition module is used for acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, and each modality data has different data types; the pre-training module is used for coding the data to be identified to obtain a representation sequence of each modal data; the heterogeneous graph creating module is used for taking the representation sequence of each modal data as node characteristics to construct a multi-modal heterogeneous graph; the heterogeneous graph representation module is used for coding the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph; and the classification module is used for classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result.
In a third aspect, the present embodiment also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method according to any one of the first aspect.
In a fourth aspect, the present embodiments also provide a computer-readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to implement the method according to any one of the first aspect.
Generally, the advantages of the present application and the experience brought to the user are:
the embodiment provides a multi-modal intention identification method, which includes the steps of obtaining data to be identified with different modes, and coding the data to be identified to obtain a representation sequence of each mode data; the representation sequence of each modal data is used as a node characteristic to construct a multi-modal heteromorphic graph, and a new idea is provided for solving the problem of multi-modal dialog intention understanding; coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image; and classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result. The method can effectively fuse multi-mode information, improve the recognition accuracy of the user interaction intention by adopting the multi-mode heteromorphic image, and realize natural and flexible human-computer interaction.
Drawings
In the drawings, like reference characters designate like or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 illustrates a flow diagram of a multi-modal intent recognition methodology of the present application;
FIG. 2 illustrates a schematic structural diagram of a tri-modal pre-training model in accordance with an example of the present application;
FIG. 3 shows a schematic diagram of a multimodal heteromorphic representation in accordance with one example of the present application;
FIG. 4 illustrates a flow diagram for deriving a representation of a multimodal anomaly map in accordance with an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a multi-modal intent recognition apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 7 is a schematic diagram of a storage medium according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates a flow diagram of a multimodal intent recognition method of the present application. As shown in FIG. 1, the multi-modal intent recognition method includes the following steps S101 to S105:
s101, obtaining data to be identified.
The execution main body of the embodiment may be a server or an intelligent terminal, for example, a user may interact with a backend server through the intelligent terminal, may interact between the user and the terminal through an application program or an applet installed on the terminal, and the user may input a content to be queried through voice control, text input, or picture input. The embodiment can be applied to the field of power customer service fault repair, and can also be applied to other scenes, such as online shopping and the like.
In one example, since the embodiment needs to recognize the multi-modal intent of the user, the data to be recognized input by the user needs to be obtained first, where the data to be recognized includes data of at least two modalities, each modality has a different data type, for example, the data to be recognized includes text information and voice information, for example, the data to be recognized includes text information and picture information, and for example, the data to be recognized includes text information, picture information and voice information.
Considering that an application scenario of the embodiment is a power failure report scenario, a scene that is difficult to describe by words is often encountered in the power failure report scenario, because in a customer service session, a user may send not only pure text information but also image and voice information. Therefore, in this embodiment, the data to be recognized includes text data, picture data, and audio data as an example, where the text data may be power failure text information input by a user through a terminal, the picture data may be picture information of a failure repair part uploaded by the user through the terminal, and the audio data may be description audio of the power failure acquired by the voice acquisition device by the user.
And S102, coding the data to be identified to obtain a representation sequence of each modal data.
In this embodiment, a three-modal pre-training model (Omni-perspective pre-Trainer, OPT) is used to encode data to be identified, so as to obtain a representation sequence of each modal data, and the encoded features are used as node features.
Specifically, encoding data to be identified to obtain a representation sequence of each modal data includes: performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.
The text data can be segmented by Word Piece to obtain a plurality of words, for example, the 'how application is applied to the charging pile of the old cell' is segmented into 'charging pile' of the old cell ', how application' and a plurality of words. For picture data, feature extraction of a region of interest (ROI) can be performed on a picture using fast R-CNN, resulting in a plurality of image regions. For audio data, the audio information symbols can be obtained using wav2vec, which is a convolutional neural network that takes the original audio as input and computes a general representation that can be input to a speech recognition system, dividing the audio data into a plurality of audio segments.
In this embodiment, fig. 2 is a schematic structural diagram of an OPT model, and as shown in fig. 2, the OPT model includes a mask layer at a segmentation level, a mask layer at a modality level, a modality coding layer, an OPT training layer, a cross-modality coder and a decoding layer, where the mask layer at the segmentation level is configured to mask a segmentation of Text data, an image region of picture data, and an Audio fragment of Audio data according to a preset ratio, the mask layer at the modality level is configured to convert the segmentation of the Text data, the image region of picture data, and the Audio fragment of Audio data into a modality form corresponding thereto, the modality coding layer includes a Text coder Text Encoder, a video Encoder, and an Audio Encoder, and is respectively configured to code data of different modalities, the OPT training layer is configured to perform OPT model training, the cross-modality coder is configured to perform fusion of multiple modalities, the decoding layer includes a Text Decoder, a picture Decoder, a video Decoder, and an Audio Decoder, and is configured to output results corresponding to the modalities. In this embodiment, the OPT training layer includes a Mask Language Model (MLM), a Mask Vision Model (MVM), and a Mask Audio Model (MAM), and trains data in different modalities respectively.
As shown in fig. 2, the input of the OPT model mainly includes three parts, including a text input part, a picture input part and an audio input part, in this embodiment, the text input part, the picture input part and the audio input part in fig. 2 are all provided with "MASK" for the part to be masked when the OPT model is pre-trained through an objective function, and the training target is the part to be predicted to be concealed.
In this embodiment, for Text input, after word segmentation is performed on Text data, words may be encoded in combination with word encoding Token Embedding and Position encoding Position Embedding to obtain first encoded information, where Token Embedding is a low-dimensional continuous word vector for each word, and when inputting, the encoding corresponding to each word is performed, and Position Embedding is used to encode the Position of each word, and each word encoding and Position encoding are added to obtain an Embedding input of the Text, which is encoded by a Text Encoder to obtain the first encoded information.
For a picture input part, extracting ROI characteristics of an original image by using fast R-CNN to obtain a plurality of image areas, inputting image information and position information of the image areas into a full connection layer, mapping the image information and the position information to the same space, adding respective codes to obtain an Embedding input of the picture, and coding by a picture coder Vision Encoder to obtain second coding information.
And for the Audio input part, after a plurality of Audio segments are obtained, encoding is carried out through an Audio Encoder and an Audio Encoder to obtain third encoding information.
In this embodiment, the first encoding information, the second encoding information, and the third encoding information are used as inputs of a three-mode pre-training model, and a text sequence, a picture sequence, and an audio sequence respectively corresponding to text data, picture data, and audio data are obtained through OPT.
In an example, in consideration of different characteristics corresponding to different data, the present embodiment trains the tri-modal pre-training model by using different training methods to obtain text sequences of different types of data.
In one example, it is assumed that, for text data, a text sequence obtained by word segmentation is:
Figure 487039DEST_PATH_IMAGE001
wherein N is the number of the words after word segmentation.
For picture data, the picture sequence represented by ROI feature extraction is:
Figure 490767DEST_PATH_IMAGE002
where K is the number of picture regions.
For audio data, the audio sequence represented by the audio segment obtained by wav2vec 2.0 is:
Figure 886108DEST_PATH_IMAGE003
wherein Q is the number of audio segments.
Then, in this embodiment, pre-training of the model is performed on the basis of T, V, and a, and the training method and the objective function of each part are as follows:
specifically, for text data, a three-mode pre-training model is trained through a minimized negative log-likelihood function, and a text sequence is obtained.
For a text sequence, an OPT model randomly conceals 15% of words to obtain word segmentation mask representation of a mask layer at a word segmentation level as shown in FIG. 2, a training target is to predict the concealed words, the training process is realized by minimizing a negative log likelihood function, and the formula of the minimized negative log likelihood function is as follows:
Figure 342497DEST_PATH_IMAGE004
wherein,
Figure 97963DEST_PATH_IMAGE005
to minimize the negative log-likelihood function of the mask language model MLM,
Figure 956198DEST_PATH_IMAGE006
in the form of a word that is to be concealed,
Figure 256860DEST_PATH_IMAGE007
for words that are not concealed, V denotes a picture sequence and a denotes an audio sequence.
Specifically, for picture data, a first function and a second function are set to train a three-mode pre-training model, so that a picture sequence is obtained.
For a picture sequence, the OPT model also randomly shades 15% of a picture region to obtain a picture mask representation of a mask layer at a word segmentation level as shown in fig. 2, and a training target is a reconstructed picture, because the visual features of the picture are high-dimensional, it is not feasible to directly use a likelihood-like function, so two functions are adopted for representation, where the target function is:
Figure 466125DEST_PATH_IMAGE008
wherein,
Figure 25282DEST_PATH_IMAGE009
for the minimized negative log-likelihood function of the mask picture model MVM,
Figure 738023DEST_PATH_IMAGE010
there are two ways, including a first function and a second function,
the first function is:
Figure 209587DEST_PATH_IMAGE011
the first function represents directly comparing the output of the encoder with the input after conversion through the full link layer,
Figure 640568DEST_PATH_IMAGE012
for the purpose of concealed picture input,
Figure 737837DEST_PATH_IMAGE013
for the input of pictures that are not concealed,
Figure 773926DEST_PATH_IMAGE014
the method is a picture obtained by converting the concealed picture through the full connection layer.
The second function is:
Figure 665659DEST_PATH_IMAGE015
the second function represents classification by hidden area.
Figure 600248DEST_PATH_IMAGE016
Pair of representations
Figure 501208DEST_PATH_IMAGE017
The label vector after the conversion is carried out,
Figure 922962DEST_PATH_IMAGE018
to represent
Figure 251175DEST_PATH_IMAGE017
The vector of the true tag is then calculated,
Figure 594432DEST_PATH_IMAGE019
to represent
Figure 49815DEST_PATH_IMAGE020
And
Figure 857234DEST_PATH_IMAGE021
when the cross entropy error is minimum, a classification result is output.
Specifically, for audio data, a third function and a fourth function are set to train a three-mode pre-training model, so as to obtain an audio sequence. For an audio sequence, the objective function is:
Figure 356349DEST_PATH_IMAGE022
wherein,
Figure 983639DEST_PATH_IMAGE023
to mask the minimized negative log-likelihood function of the audio model MAM,
Figure 429664DEST_PATH_IMAGE024
has two function expressions including a third function and a fourth function,
the third function is:
Figure 107901DEST_PATH_IMAGE025
wherein the third function represents comparing the output of the encoder with the input after conversion through the full link layer,
Figure 43496DEST_PATH_IMAGE026
for the audio input vector to be concealed,
Figure 158083DEST_PATH_IMAGE027
for the audio input vector that is not concealed,
Figure 689689DEST_PATH_IMAGE028
is an audio vector converted from the concealed audio through the full connection layer.
The fourth function is:
Figure 409383DEST_PATH_IMAGE029
the fourth function represents maximizing mutual information between masks through contrast learning, and positive and negative samples are constructed through the audio input vector which is concealed and the audio input vector which is not concealed.
S103, constructing a multi-modal heteromorphic graph by taking the representation sequence of each modal data as a node feature.
In this embodiment, constructing the multimodal heterogeneous graph includes: and obtaining different node types according to the different modal data, and determining the number of the nodes of each node type according to the number of elements in the representation sequence of each modal data. In this embodiment, the data to be recognized is modeled as a multi-modal heteromorphic graph, where data of different modalities are represented by different types of nodes.
The node number of the text nodes is obtained according to the number of words in the text data, the node number of the picture nodes is obtained according to the number of image areas of the picture data, and the node number of the audio nodes is obtained according to the number of audio clips of the audio data. For the input of the text modality, each segmented word is represented by one node, for the input of the picture modality, each picture region is represented by one node, and for the input of the audio modality, each audio clip is represented by one node.
It should be noted that the relationship between each node in this embodiment includes a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.
Fig. 3 is a schematic diagram of a multi-modal heteromorphic graph, as shown in fig. 3, the number of nodes of picture data is 3, the number of nodes of text data is 4, and the number of nodes of audio data is 4. The 3 nodes of the picture data are respectively a picture node 1, a picture node 2 and a picture node 3, the four nodes of the text data are 'old cells', 'charging piles', 'what' and 'application', the corresponding 4 audio nodes of the audio data comprise an audio node 1, an audio node 2, an audio node 3 and an audio node 4, wherein the nodes belonging to the same type of modal data have a parallel relationship, for example, the picture node 1, the picture node 2 and the picture node 3 have a parallel relationship, the picture node 1 and the text node 'charging pile' have a parallel relationship, the picture node 2 and the text node 'charging pile' have a progressive relationship, and the picture node 3 and the text node 'charging pile' have a progressive relationship.
S104, encoding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image.
In this embodiment, on the basis of the constructed abnormal graph, the graph is encoded through a global view based on an attention mechanism. Under the global view, attention aggregation is firstly carried out on nodes in different modes to obtain representation of each mode, and then aggregation is carried out on the nodes in different modes to obtain representation of the whole graph.
As shown in FIG. 4, obtaining a representation of a multimodal anomaly map includes the following steps S401-S404:
s401, calculating attention weight according to the relation between each node in the multi-modal heteromorphic graph.
In this embodiment, the relationship between each node includes a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data. In the global view, there are three node types N = { TN, PN, AN } and two edge types E = { CO, PG }, where TN denotes a word node of a text modality, PN denotes AN image region node of a picture modality, AN denotes AN audio clip node of AN audio modality, CO denotes a parallel relationship, and PG denotes a progressive relationship. After the respective representation sequence of each modal data is obtained through the pre-training model, all the nodes are mapped into the same vector space.
In this embodiment, calculating the attention weight according to the relationship between each node in the multi-modal heteromorphic graph includes: and activating the node vectors through a nonlinear activation function, and normalizing the activated node vectors to obtain the attention weight.
Wherein the activation function is activated by a non-linearity
Figure 250301DEST_PATH_IMAGE030
Activating the node vectors, and obtaining the attention weight after the node vectors are normalized by softmax:
Figure 852183DEST_PATH_IMAGE031
wherein,
Figure 702327DEST_PATH_IMAGE032
a domain of neighbor nodes representing node i,
Figure 823998DEST_PATH_IMAGE033
indicates that the nodes k and i are in the same mode]The concatenation of the vectors is represented and,
Figure 39079DEST_PATH_IMAGE034
a mapping matrix is represented that is,
Figure 862679DEST_PATH_IMAGE030
is a non-linear activation function.
S402, calculating the hidden vector of each node under different modal data.
After the attention weight is obtained, the hidden vector calculation formula of the node i under the modality p is as follows:
Figure 250935DEST_PATH_IMAGE035
wherein,
Figure 476380DEST_PATH_IMAGE036
the weight is represented by a weight that is,
Figure 675411DEST_PATH_IMAGE030
is a non-linear activation function.
And S403, obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data.
After the representation of the node i under the modality p is obtained, different modalities are aggregated, and the representation of the node under the global view is obtained as follows:
Figure 455148DEST_PATH_IMAGE037
wherein,
Figure 381516DEST_PATH_IMAGE038
representing the weight of each modality.
S404, obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.
After the representations of all nodes are obtained by calculation, averaging all nodes to obtain the representation of the final graph:
Figure 461467DEST_PATH_IMAGE039
wherein,
Figure 80667DEST_PATH_IMAGE040
the domain composed of all nodes in the diagram is represented, and AVG represents the average of the node vector.
By the method, the representation of the multi-modal abnormal composition can be obtained, a better multi-modal information fusion effect can be achieved, and the recognition accuracy of the user interaction intention is improved.
And S105, classifying according to the representation of the multi-modal abnormal composition to obtain an intention prediction result.
In this embodiment, classifying according to the representation of the multi-modal heteromorphic graph to obtain an intention recognition result includes: obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.
Specifically, after the graph representation vector is obtained, the intent calculation result is:
Figure 906889DEST_PATH_IMAGE041
and W is a parameter matrix to be trained, and is converted into the prediction probability of the intention label through softmax normalization processing.
The loss function is:
Figure 636948DEST_PATH_IMAGE042
wherein,
Figure 571406DEST_PATH_IMAGE043
a label representing the true intention of the user,
Figure 377819DEST_PATH_IMAGE044
representing the intent prediction probability.
The smaller the loss value L is, the better the classification effect of the model is, and when the loss value is stabilized in a preset range, the convergence of the model is determined, so that the output intention prediction result is more accurate.
In the multi-modal intention identification method provided by the embodiment, the data to be identified with different modalities are obtained, and the data to be identified is encoded to obtain the representation sequence of each modality data; the representation sequence of each modal data is used as a node characteristic to construct a multi-modal abnormal graph, and a new thought is provided for solving the multi-modal dialog intention understanding; coding the multi-modal abnormal composition through a global view based on an attention mechanism to obtain a representation of the multi-modal abnormal composition; and classifying according to the representation of the multi-modal abnormal picture to obtain an intention recognition result. The method can effectively fuse multi-mode information, improve the recognition accuracy of the user interaction intention by adopting the multi-mode heteromorphic image, and realize natural and flexible human-computer interaction.
The present embodiment also provides a multi-modal intention recognition apparatus, and referring to fig. 5, the multi-modal intention recognition apparatus 500 includes:
a data obtaining module 501, configured to obtain data to be identified, where the data to be identified includes data of at least two modalities, and each modality data has a different data type;
a pre-training module 502, configured to encode the data to be identified to obtain a representation sequence of each modal data;
a heterogeneous graph creating module 503, configured to use the representation sequence of each modal data as a node feature to construct a multi-modal heterogeneous graph;
a heterogeneous graph representation module 504, configured to encode the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph;
and the classification module 505 is used for classifying according to the representation of the multi-modal abnormal composition to obtain an intention recognition result.
In a feasible example, the pre-training module 502 is further configured to perform word segmentation on the text data to obtain a plurality of words, and encode the words to obtain first encoded information; extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information; extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information; and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.
In a possible example, the pre-training module 502 is further configured to train the tri-modal pre-training model by minimizing a negative log-likelihood function with respect to the text data to obtain the text sequence; for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence; and for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence.
In a feasible example, the heterogeneous graph creating module 503 is configured to obtain different node types according to different modal data, and determine the number of nodes of each node type according to the number of elements in the representation sequence of each modal data; the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.
In one possible example, the heterogeneous graph representation module 504 is configured to calculate an attention weight according to a relationship between each node in the multi-modal heterogeneous graph; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.
In a feasible example, the heterogeneous graph representation module 504 is configured to activate a node vector through a nonlinear activation function, and perform normalization processing on the activated node vector to obtain an attention weight; the relationship between each node comprises a parallel relationship and a progressive relationship, the parallel relationship represents that the two nodes belong to the same type of modal data, and the progressive relationship ensures that the two nodes belong to different types of modal data.
In one possible example, the classification module 505 is configured to derive an intention tag prediction probability based on the representation of the multi-modal heteromorphic graph; and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.
The multi-modal intention recognition device provided by the embodiment of the application and the multi-modal intention recognition method provided by the embodiment of the application have the same advantages as the method adopted, operated or realized by the stored application program.
The embodiment of the application also provides an electronic device corresponding to the multi-modal intention recognition method provided by the previous embodiment, so as to execute the multi-modal intention recognition method. The embodiments of the present application are not limited.
Please refer to fig. 6, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 6, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the computer program to perform the composite intention identifying method provided by any of the foregoing embodiments of the present application.
The Memory 201 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used for storing a program, and the processor 200 executes the program after receiving an execution instruction, and the method for recognizing a composite intention disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200, or implemented by the processor 200.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.
The electronic device provided by the embodiment of the application and the multi-modal intention recognition method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 7, the computer readable storage medium is an optical disc 30, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program executes the composite intent recognition method provided by any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above embodiment of the present application and the multi-modal intention recognition method provided by the embodiment of the present application have the same advantages as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system is apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Those skilled in the art will appreciate that the modules in the devices in an embodiment may be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification, and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except for at least some of such features and/or processes or elements being mutually exclusive. Each feature disclosed in this specification may be replaced by an alternative feature serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a virtual machine creation system according to embodiments of the present application. The present application may also be embodied as apparatus or system programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the disclosure. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application.

Claims (9)

1. A method of multi-modal intent recognition, the method comprising:
acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, each modality data has different data types, and the data to be identified comprises text data, picture data and audio data;
coding the data to be identified to obtain a representation sequence of each modal data;
constructing a multi-modal heteromorphic graph by taking the representation sequence of each modal data as a node feature;
coding the multi-modal heteromorphic image through a global view based on an attention mechanism to obtain a representation of the multi-modal heteromorphic image;
classifying according to the representation of the multi-modal heteromorphic image to obtain an intention recognition result;
wherein the encoding of the multi-modal heteromorphic graph through the attention mechanism-based global view to obtain the representation of the multi-modal heteromorphic graph comprises: calculating attention weight according to the relation between each node in the multi-mode heteromorphic graph, wherein the attention weight activates a node vector through a nonlinear activation function, and the activated node vector is obtained through normalization processing; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal heteromorphic graph according to the representation of each node.
2. The method according to claim 1, wherein the data to be recognized comprises text data, picture data and audio data, and the encoding the data to be recognized to obtain the representation sequence of each modal data comprises:
performing word segmentation processing on the text data to obtain a plurality of words, and encoding the words to obtain first encoding information;
extracting image features of the image data to obtain a plurality of image areas, and coding the image areas to obtain second coding information;
extracting audio features of the audio data to obtain a plurality of audio segments, and encoding the audio segments to obtain third encoding information;
and taking the first coding information, the second coding information and the third coding information as the input of a three-mode pre-training model to obtain a text sequence, a picture sequence and an audio sequence which respectively correspond to the text data, the picture data and the audio data.
3. The multi-modal intent recognition method of claim 2, further comprising:
for the text data, training the three-mode pre-training model through a minimized negative log-likelihood function to obtain the text sequence;
for the picture data, training the three-mode pre-training model by setting a first function and a second function to obtain the picture sequence;
for the audio data, training the three-mode pre-training model by setting a third function and a fourth function to obtain the audio sequence;
wherein, the first function and the third function both represent that the output of the coder is compared with the input after being converted by the full connection layer; the second function represents classification by a hidden region; the fourth function represents the construction of positive and negative samples from the concealed audio input vector and the non-concealed audio input vector.
4. The method according to claim 2, wherein constructing a multi-modal anomaly map by using the representation sequence of each modal data as a node feature comprises:
obtaining different node types according to different modal data, and determining the number of nodes of each node type according to the number of elements in the representation sequence of each modal data;
the node number of the text node is obtained according to the number of the words in the text data, the node number of the picture node is obtained according to the number of the image areas of the picture data, and the node number of the audio node is obtained according to the number of the audio clips of the audio data.
5. The multi-modal intent recognition method of claim 1, wherein the relationships between each node include a parallel relationship and a progressive relationship, the parallel relationship characterizing two nodes belonging to the same type of modal data, and the progressive relationship characterizing two nodes belonging to different types of modal data.
6. The multi-modal intent recognition method of claim 1, wherein the classifying from the representation of the multi-modal composition to obtain intent recognition results comprises:
obtaining an intention label prediction probability based on the representation of the multi-modal heteromorphic graph;
and calculating a loss value of the intention label prediction probability according to a loss function, and obtaining an intention identification result under the condition that the loss value is kept in a preset range.
7. A multimodal intent recognition apparatus, the apparatus comprising:
the data acquisition module is used for acquiring data to be identified, wherein the data to be identified comprises data of at least two modalities, each modality data has different data types, and the data to be identified comprises text data, picture data and audio data;
the pre-training module is used for coding the data to be identified to obtain a representation sequence of each modal data;
the heterogeneous graph creating module is used for taking the representation sequence of each modal data as a node characteristic to construct a multi-modal heterogeneous graph;
the heterogeneous graph representation module is used for coding the multi-modal heterogeneous graph through a global view based on an attention mechanism to obtain a representation of the multi-modal heterogeneous graph;
the classification module is used for classifying according to the representation of the multi-modal heteromorphic image to obtain an intention recognition result;
wherein the encoding of the multi-modal heteromorphic graph through the attention mechanism-based global view to obtain the representation of the multi-modal heteromorphic graph comprises: calculating attention weight according to the relation between each node in the multi-mode heteromorphic graph, wherein the attention weight activates a node vector through a nonlinear activation function, and the activated node vector is obtained through normalization processing; calculating the hidden vector of each node under different modal data; obtaining the representation of the node according to the attention weight of the node and the hidden vectors of the node under different modal data; and obtaining the representation of the multi-modal abnormal graph according to the representation of each node.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.
CN202211621367.6A 2022-12-16 2022-12-16 Multi-modal intention recognition method and device, electronic equipment and storage medium Active CN115618270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211621367.6A CN115618270B (en) 2022-12-16 2022-12-16 Multi-modal intention recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211621367.6A CN115618270B (en) 2022-12-16 2022-12-16 Multi-modal intention recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115618270A CN115618270A (en) 2023-01-17
CN115618270B true CN115618270B (en) 2023-04-11

Family

ID=84880955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211621367.6A Active CN115618270B (en) 2022-12-16 2022-12-16 Multi-modal intention recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115618270B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117854091B (en) * 2024-01-15 2024-06-07 金锋馥(滁州)科技股份有限公司 Method for extracting information of multi-surface dense labels of packages based on image feature detection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method
CN114169408A (en) * 2021-11-18 2022-03-11 杭州电子科技大学 Emotion classification method based on multi-mode attention mechanism
CN114186069A (en) * 2021-11-29 2022-03-15 江苏大学 Deep video understanding knowledge graph construction method based on multi-mode heteromorphic graph attention network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805089B (en) * 2018-06-14 2021-06-29 南京云思创智信息科技有限公司 Multi-modal-based emotion recognition method
CN115099234A (en) * 2022-07-15 2022-09-23 哈尔滨工业大学 Chinese multi-mode fine-grained emotion analysis method based on graph neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method
CN114169408A (en) * 2021-11-18 2022-03-11 杭州电子科技大学 Emotion classification method based on multi-mode attention mechanism
CN114186069A (en) * 2021-11-29 2022-03-15 江苏大学 Deep video understanding knowledge graph construction method based on multi-mode heteromorphic graph attention network

Also Published As

Publication number Publication date
CN115618270A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
Xu et al. Identification framework for cracks on a steel structure surface by a restricted Boltzmann machines algorithm based on consumer‐grade camera images
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN111753092B (en) Data processing method, model training method, device and electronic equipment
CN111932555A (en) Image processing method and device and computer readable storage medium
CN112016500A (en) Group abnormal behavior identification method and system based on multi-scale time information fusion
CN111651573B (en) Intelligent customer service dialogue reply generation method and device and electronic equipment
CN114724386B (en) Short-time traffic flow prediction method and system under intelligent traffic and electronic equipment
CN112364238B (en) Deep learning-based user interest point recommendation method and system
CN115618270B (en) Multi-modal intention recognition method and device, electronic equipment and storage medium
WO2023273628A1 (en) Video loop recognition method and apparatus, computer device, and storage medium
CN116071077B (en) Risk assessment and identification method and device for illegal account
CN117217368A (en) Training method, device, equipment, medium and program product of prediction model
CN112183542A (en) Text image-based recognition method, device, equipment and medium
CN113239702A (en) Intention recognition method and device and electronic equipment
CN112418939A (en) Method for mining space-time correlation of house price based on neural network to predict house price
CN116797975A (en) Video segmentation method, device, computer equipment and storage medium
CN115905959A (en) Method and device for analyzing relevance fault of power circuit breaker based on defect factor
CN115186085A (en) Reply content processing method and interaction method of media content interaction content
CN114510609A (en) Method, device, equipment, medium and program product for generating structure data
CN112597997A (en) Region-of-interest determining method, image content identifying method and device
CN116308738B (en) Model training method, business wind control method and device
CN116958811A (en) Road ponding area detection method, system, equipment and medium
CN112905987A (en) Account identification method, account identification device, server and storage medium
CN113919338B (en) Method and device for processing text data
CN115114930A (en) Non-continuous entity identification method based on sequence to forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant