CN114757209B

CN114757209B - Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Info

Publication number: CN114757209B
Application number: CN202210659318.5A
Authority: CN
Inventors: 张梅山; 卢攀忠; 林智超; 孙越恒
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-11-11
Anticipated expiration: 2042-06-13
Also published as: CN114757209A

Abstract

The invention provides a man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition, and relates to the technical field of semantic analysis in natural language processing. The method comprises the following steps: constructing a set of complete instruction semantic role labeling normal forms according to the characteristics of the human-computer interaction instruction; according to the instruction semantic role labeling paradigm, expanding a single-mode form of a semantic role labeling model into a visual text multi-mode form by combining image acquisition; training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction. According to the invention, a paradigm marked by multi-modal semantic roles is innovatively tried to carry out semantic analysis on the human-computer interaction instruction, so that the instruction which cannot be understood by a machine originally is converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.

Description

Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Technical Field

The invention relates to the technical field of semantic analysis in natural language processing, in particular to a man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition.

Background

Semantic role labeling is a shallow semantic analysis technology, and is used for extracting predicate-argument structures contained in sentences. The predicate is a core word in a statement that can trigger a semantic event, and the argument is a role participating in the semantic event, including an actor and a victim. In general, the core of semantic role labeling techniques is to enable machines to understand what "who and who do what, at what time and place" in a sentence. Currently, there are many applications that attempt to label semantic roles as a key ring in their technical links, such as knowledge question answering, dialogue robots, machine translation, etc.

With the development of the technology, the man-machine interaction technology has gradually become an important way for a user to control an unmanned device (such as a robot or an unmanned aerial vehicle). The command is given through voice, so that the unmanned equipment can understand the intention of an operator, corresponding commands are executed, the two hands of the operator can be liberated, and the unmanned equipment can be controlled more conveniently, safely and quickly. However, the existing instruction parsing technology has limited development, and cannot achieve targeted parsing of machine-understandable semantic structures from instructions. The invention plans to utilize the advantages of the semantic role marking technology to realize high-precision analysis of the intention semantics of the control command, so that the unmanned equipment can better serve the user and execute the operation with higher abstract difficulty.

At present, the whole semantic role labeling process is mainly divided into two types, one type is a pipeline-based mode, predicates in sentences are identified by using a sequence labeling method, and then semantic roles (arguments) in the sentences are identified, so that the problem of error propagation is serious. And the other method is a method for constructing a semantic graph to simultaneously extract predicates and semantic roles corresponding to the predicates, wherein all possible predicates and argument candidate fragments of the sentence are enumerated as nodes in the graph, then the semantic role relationship between the predicate fragments and the semantic role fragments is used as edges in the graph, and finally the structured output is obtained through accurately decoding the constructed semantic graph. Most of the existing unmanned equipment has two perceptions of vision and language, but most of the existing semantic role labeling methods are oriented to single text setting, and important complementary relation between image information and text information is ignored.

At present, the labeling paradigm of semantic role labeling data sets is mostly oriented to the general field, and a large blank is left in a special field such as an unmanned equipment instruction control instruction.

Disclosure of Invention

The invention provides a human-computer interaction instruction analysis method and device based on multi-mode semantic role recognition, aiming at the problem that in the prior art, a larger blank exists under an instruction control instruction of an unmanned device.

In order to solve the technical problems, the invention provides the following technical scheme:

on one hand, the man-machine interaction instruction analysis method based on multi-mode semantic role recognition is provided, and is applied to electronic equipment, and comprises the following steps:

s1: constructing an instruction semantic role labeling paradigm according to the characteristics of the human-computer interaction instruction;

s2: according to the instruction semantic role labeling paradigm, expanding a single-mode form of a semantic role labeling model into a visual text multi-mode form by combining image acquisition;

s3: training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.

Optionally, in step S1, according to characteristics of the human-computer interaction instruction, a semantic role labeling paradigm of the instruction is constructed, which includes:

s11: adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference;

s12: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

Optionally, in step S2, according to the instruction semantic role labeling paradigm, in combination with image acquisition, expanding a single-modal form of a semantic role labeling model into a visual text bimodal form, including:

s21: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;

s22: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.

Optionally, in step S3, training and learning the multi-modal form of the visual text of the semantic role labeling model to complete multi-modal semantic role recognition and perform semantic parsing on the human-computer interaction instruction, including:

s31: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;

s32: input instructions of the pre-training model

(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I

；

S33: enumerating all strides in instruction I

Wherein

Obtaining a feature vector of each span; wherein the span is a preset value;

s34: generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;

s35: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.

Optionally, in S34, predicate candidate vectors are obtained by using two different MLP layers respectively

And semantic role candidate vectors

Wherein:

；

。

optionally, in S35, the introducing a loss function completes training loss of the model, including:

constructing a semantic role labeling loss function, and judging the integrity of predicates and argument structures predicted by the model;

wherein, the system comprises an MLP layer and a Biaffine layer; the MLP layer score layer is used for judging a semantic frame of a current predicate node, and the Biaffine layer score layer is used for judging each predicate in a sentence

Semantic roles

And the relationship between the two

Of the triad

Grading is carried out; calculating the loss of each triplet by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):

constructing a mode matching function for mode matching of image and text cross-mode feature pairs, wherein the label of the function is defined as that if the fragment corresponding to the semantic role contains an object corresponding to the target area, the output label is 1, otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):

in one aspect, a human-computer interaction instruction analysis device based on multi-modal semantic character recognition is provided, and the device is applied to electronic equipment and comprises:

the instruction semantic role labeling normal form construction module is used for constructing an instruction semantic role labeling normal form according to the characteristics of the human-computer interaction instruction;

the multi-mode construction module is used for expanding the single-mode form of the semantic role marking model into a visual text multi-mode form by combining image acquisition according to the instruction semantic role marking paradigm;

and the model training module is used for training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.

Optionally, 9, the apparatus according to claim 8, wherein the instruction semantic role labeling paradigm constructing module is configured to use a labeling mode of verbsatlas semantic role labeling data as a labeling benchmark;

and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

Optionally, the multi-modal construction module is configured to acquire an image through an unmanned system according to the instruction semantic role labeling paradigm, obtain a sequence target region by using fast-RCNN, form the sequence target region into an image region sequence, and extract features of the image sequence;

and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.

In one aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above human-machine interaction instruction parsing method based on multi-modal semantic character recognition.

In one aspect, a computer-readable storage medium is provided, where at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the above method for human-computer interaction instruction parsing based on multi-modal semantic role recognition.

The technical scheme of the embodiment of the invention at least has the following beneficial effects:

in the scheme, the invention innovatively provides a semantic dependency graph representation scheme of the integrated utterance chapters, the sentence semantic dependency graph is expanded to the whole chapters, and the characteristic that the utterance semantic information is incomplete under a conversation scene is fully considered. The invention provides an integrated semantic dependency graph joint analysis model fusing the interior of the utterances and the utterances for the dialog text for the first time, and the sentence semantics and the chapter semantics are connected together by adopting an end-to-end modeling mode. In addition, the teacher-student network based on knowledge distillation adopted by the invention can also meet the high requirements on efficiency and delay in the practical application of the conversation system.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;

FIG. 2 is a flowchart of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;

FIG. 3 is a multi-modal semantic role labeling model diagram of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;

FIG. 4 is a multi-modal semantic role structured output diagram of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;

FIG. 5 is an exemplary diagram of a multi-modal semantic role labeling implementation of a human-computer interaction method for human-computer interaction instruction parsing based on multi-modal semantic role recognition according to an embodiment of the present invention;

FIG. 6 is a block diagram of a human-computer interaction instruction analysis device based on multi-modal semantic role recognition according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a man-machine interaction instruction analysis method based on multi-mode semantic role recognition, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 1, a flow chart of a human-computer interaction instruction parsing method based on multi-modal semantic character recognition, a processing flow of the method may include the following steps:

s101: constructing an instruction semantic role labeling paradigm according to the characteristics of the human-computer interaction instruction;

s102: according to the instruction semantic role labeling paradigm, combining image acquisition and expanding a single mode form of a semantic role labeling model into a visual text multi-mode form;

s103: training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.

Optionally, in step S101, according to characteristics of the human-computer interaction instruction, a semantic role labeling paradigm of the instruction is constructed, which includes:

s111: adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference;

s112: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

Optionally, in step S102, according to the instruction semantic role labeling paradigm, in combination with image acquisition, expanding a single-modal form of a semantic role labeling model into a visual text bimodal form, including:

s121: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;

s122: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.

Optionally, in step S103, training and learning the multi-modal form of the visual text of the semantic role labeling model, and completing multi-modal semantic role recognition to perform semantic parsing on the human-computer interaction instruction, including:

s131: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;

s132: input instructions of the pre-training model

；

S133: enumerating all strides in instruction I

In which

Obtaining a feature vector of each span; wherein the span is a preset value;

s134: generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;

s135: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.

Optionally, in S134, two different MLP layers are used to obtain predicate candidate vectors respectively

And semantic role candidate vectors

Wherein:

；

。

optionally, in S135, the introducing loss function completes the training loss of the model, including:

wherein, the system comprises an MLP layer and a Biaffine layer; the MLP scoring layer is used for judging a semantic frame of a current predicate node, and the Biaffine scoring layer is used for judging each predicate in a sentence

Semantic roles

And the relationship between the two

Of the triad

optionally, in S135, the introducing a loss function refines a training loss of the model, and includes:

in the embodiment of the invention, image information is innovatively introduced into the existing single-mode semantic role labeling model in an attempt, so that the semantic analysis is carried out on input sentences by using the image information to assist the semantic role labeling model. A paradigm labeled by multi-modal semantic roles is tried to carry out semantic analysis on human-computer interaction instructions, so that the instructions which cannot be understood by a machine originally are converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.

The embodiment of the invention provides a man-machine interaction instruction analysis method based on multi-mode semantic role recognition, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 2, a flow chart of a human-computer interaction instruction parsing method based on multi-modal semantic character recognition, a processing flow of the method may include the following steps:

s201: and adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference.

In a feasible implementation mode, the invention firstly constructs a set of perfect command semantic role labeling paradigm aiming at human-computer interaction commands based on the characteristics of the human-computer interaction commands. The conventional semantic role labeling paradigm is mostly oriented to general fields (such as news), and the setting of semantic roles is better in generality. However, in the field of human-computer interaction, the semantic role of each type of instruction has specificity, which cannot be covered by the semantic role of the general field.

S202: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.

In a feasible implementation mode, the invention aims to expand and modify the existing Chinese semantic role marking paradigm so that the Chinese semantic role marking paradigm is suitable for semantic analysis of human-computer interaction instructions.

The primary plan adopts a Verbatlas semantic role labeling data labeling mode as a labeling benchmark of the invention, which is mainly based on the following two considerations: (1) The concept of a semantic frame is added to predicate identification by the annotation benchmark, so that the specific semantics of each predicate are more accurate, and the ambiguity problem of the predicates caused by different contexts is relieved. (2) The marking standard is designed for multi-language scenes, and the marking standard for Chinese instructions can be conveniently designed. Table 1 shows the semantic framework, i.e. semantic roles, initially set by the present invention. The method comprises the following steps of (1) covering simple displacement instructions such as advancing, moving and the like and operation instructions such as taking, opening and the like with high difficulty; the semantic role comprises a control device and a control means which participate in the semantic event, and the time and the place of instruction execution.

S203: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;

s204: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.

In a possible implementation manner, in terms of model architecture, the invention adopts a double-tower model as shown in fig. 3 to solve the fusion of image-text characteristics between multi-modal semantic role tasks. The whole architecture is mainly divided into three parts, namely image sequence feature extraction at an image end, semantic graph feature extraction at a language end and a training function for feature fusion.

In one possible embodiment, the image sequence features: for images observed by unmanned system

The present invention adopts the existing Faster-RCNN to obtain a sequence of target regions, and the target regions are combined into an image region sequence

And obtaining the characteristic sequence corresponding to the region sequence

. For regional features in a sequence of features

The invention utilizes an MLP layer to carry out further feature abstraction to obtain the final image feature

:

S205: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;

s206: input instructions of the pre-training model

；

S207: enumerating all spans in instruction I

Wherein

Obtaining a feature vector of each span; wherein the span is a preset value;

s208: and generating candidate vectors corresponding to the predicate nodes and the semantic role nodes in the semantic graph according to the feature vector of each span.

In one possible embodiment, the text sequence feature: the invention adopts the current end-to-end semantic role labeling classical semantic graph neural network construction idea to obtain the implied predicates and the corresponding arguments in the sentences. For input instruction

Coding the word vector sequence by using a BERT pre-training model to obtain the word vector sequence corresponding to each word in the instruction

. Then enumerate all of the instructionsSpan (L)

Wherein

Consisting of a plurality of words in a sentence. The maximum length and the minimum length of each span are preset. For each span

Its feature vector is expressed as:

wherein

，

Representing the hidden layer representation to which each span start word and end word corresponds,

the length characteristic corresponding to each span is represented,

the Attention for each word in the span is calculated using the Self-Attention mechanism and the resulting vector is weighted and averaged according to Attention.

Corresponding representation for each span

And candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph need to be generated, so the method adopts two different MLP layers to respectively obtain the predicate candidate vectors and the semantic role candidate vectors

And

：

s209: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to carry out semantic analysis on the human-computer interaction instruction.

In a feasible implementation mode, two different MLP layers are adopted to respectively obtain predicate candidate vectors

And semantic role candidate vectors

Wherein:

；

。

wherein, MLP ^P Is a multi-layer feedforward neural network, MLP, for obtaining predicate representations ^R Is a multi-layer feed-forward neural network for obtaining semantic role representations.

In one possible embodiment, the introducing the loss function completes the training loss of the model, including:

Semantic roles

And the relationship between the two

Of the triad

Grading is carried out; calculating the loss of each triple by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):

in one possible embodiment, the present invention defines two loss functions for training the model, in terms of training loss. The first semantic role labeling loss function for judging the integrity of predicates and argument structures predicted by a model comprises an MLP (Multi-level processing) scoring layer for judging a semantic frame of a current predicate node and a Biaffine scoring layer for judging each predicate, semantic role and triple of the relationship between the predicate and the semantic role in a sentence

Scoring is carried out, and the specific definition is as follows:

wherein the content of the first and second substances,

representing a multi-layer feed-forward neural network for obtaining semantic framework class scores;

is a Biaffine weight matrix and,

is a matrix of linear weights that is,

is the bias term. After the score corresponding to each relationship is obtained, the loss of each triple is calculated by adopting cross entropy:

wherein

And

representing the corresponding semantic framework and semantic role set.

In one possible implementation, the introduction of the loss function refines the training loss of the model, and includes:

in a possible embodiment, the second type is a mode matching function for image and text cross-mode feature pairs, and the label of the function is defined by the present invention as that if the segment corresponding to the semantic role contains the object corresponding to the target region, the output label is 1, otherwise the label is 10. The invention also utilizes a Biaffine layer to calculate the triples of the image region characteristics, the semantic roles and the relationship between the image region characteristics and the semantic roles

The score is made and the score is given,

similarly, the corresponding loss function is:

the final loss function is defined by adopting a multi-task learning paradigm:

wherein

For adjusting the weights exerted by the two loss functions in the model training.

In the embodiment of the invention, the target of the multi-mode semantic role labeling is to give an input instruction and obtain the semantic structured output of the instruction, so that a machine can understand and execute the semantic structured output. The structured output of multi-modal semantic character recognition is shown in fig. 4.

In the embodiment of the invention, fig. 5 shows an analysis example of the multi-modal semantic role labeling model on a human-computer interaction instruction. For the instruction issued by the user, the multi-mode semantic role analysis system identifies the predicate in the multi-mode semantic role analysis system, the corresponding semantic frame and the semantic role belonging to the semantic frame, and organizes the semantic role into a machine-identifiable structured output.

In the embodiment of the invention, aiming at the fact that the existing semantic role labeling model is mostly set based on a single mode, image information is innovatively introduced into the existing single-mode semantic role labeling model in an attempt, so that the image information is used for assisting the semantic role labeling model to carry out semantic analysis on input sentences. A paradigm labeled by multi-mode semantic roles is tried to carry out semantic analysis on human-computer interaction instructions, so that the instructions which cannot be understood by the machine originally are converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.

FIG. 6 is a block diagram illustrating a human-machine interaction instruction parsing apparatus based on multi-modal semantic character recognition according to an example embodiment. Referring to fig. 6, the apparatus 300 includes:

the paradigm-constructing module 310 is configured to construct a set of complete instruction semantic role labeling paradigm according to characteristics of human-computer interaction instructions;

the multi-mode construction module 320 is used for expanding the single-mode form of the semantic role labeling model into a visual text multi-mode form by combining image acquisition according to the instruction semantic role labeling paradigm;

and the model training module 330 is used for training and learning the multi-modal form of the visual text of the semantic role labeling model, and completing the multi-modal semantic role recognition to perform semantic analysis on the human-computer interaction instruction.

Optionally, the paradigm building module 310 is configured to use a labeling mode of the verbsaatlas semantic role labeling data as a labeling benchmark;

and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and a set of complete instruction semantic role labeling paradigm is obtained.

Optionally, the multi-modal construction module 320 is configured to acquire an image through an unmanned system according to the instruction semantic role labeling paradigm, obtain a sequence target region by using fast-RCNN, form the sequence target region into an image region sequence, and extract features of the image sequence;

Optionally, the model training module 330 is configured to construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;

input instructions of the pre-training model

；

Enumerating all strides in instruction I

Wherein

Obtaining a feature vector of each span; wherein the span is a preset value;

generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;

and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.

Optionally, the model training module 330 is configured to obtain predicate candidate vectors using two different MLP layers respectively

And semantic role candidate vectors

Wherein:

；

。

optionally, the model training module 330 is configured to construct a semantic role labeling loss function, and determine integrity of predicates and argument structures predicted by the model;

Semantic roles

And the relationship between the two

Of the triad

optionally, the model training module 330 is configured to construct a mode matching function for mode matching of a cross-mode feature pair of an image and a text, where a label of the function is defined as that if an object corresponding to the target region is included in a segment corresponding to the semantic role, an output label is 1, and otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):

Fig. 7 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following steps of a human-machine interaction instruction parsing method based on multimodal semantic role recognition:

s1: constructing a set of complete instruction semantic role labeling normal forms according to the characteristics of human-computer interaction instructions;

s2: according to the instruction semantic role labeling paradigm, combining image acquisition and expanding a single mode form of a semantic role labeling model into a visual text multi-mode form;

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the human-machine interaction instruction parsing method based on multi-modal semantic character recognition is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A man-machine interaction instruction analysis method based on multi-modal semantic role recognition is characterized by comprising the following steps:

s3: training and learning the multi-mode form of the visual text of the semantic role labeling model to complete the semantic analysis of the man-machine interaction instruction by multi-mode semantic role recognition;

in the step S3, training and learning the multi-modal form of the visual text of the semantic role labeling model to complete the multi-modal semantic role recognition and perform semantic analysis on the human-computer interaction instruction, including:

s32: input instructions of the pre-training model

；

S33: enumerating all strides in instruction I

In which

Obtaining a feature vector of each span; wherein the span is a preset value;

s35: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to carry out semantic analysis on the human-computer interaction instruction.

2. The method according to claim 1, wherein in step S1, according to the characteristics of the human-computer interaction instruction, constructing an instruction semantic role labeling paradigm comprises:

3. The method according to claim 2, wherein in step S2, expanding the monomodal form of the semantic character labeling model into a visual text bimodal form according to the instruction semantic character labeling paradigm in combination with image acquisition comprises:

s21: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, acquiring sequence target regions by adopting fast-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;

4. The method according to claim 1, wherein in S34, predicate candidate vectors are respectively obtained by using two different layer perceptron MLP layers

And semantic role candidate vectors

Wherein:

。

5. the method of claim 4, wherein in S35, the introducing a loss function completes training loss of the model, and includes:

Semantic roles

And the relationship between the two

Of (2)

。

6. the method of claim 4, wherein in S35, the introducing a loss function completes training loss of the model, and includes:

。

7. a human-computer interaction command parsing device based on multi-modal semantic character recognition, wherein the device is suitable for the method of any one of the preceding claims 1-6, and the device comprises:

8. The device of claim 7, wherein the instruction semantic role labeling paradigm building module is configured to take a labeling mode of Verbatlas semantic role labeling data as a labeling benchmark;

9. The apparatus according to claim 7, wherein the multi-modal construction module is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, obtain sequence target regions by using fast-RCNN, form the sequence of the sequence target regions into an image region sequence, and extract features of the image sequence;