CN114757209B - Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition - Google Patents

Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition Download PDF

Info

Publication number
CN114757209B
CN114757209B CN202210659318.5A CN202210659318A CN114757209B CN 114757209 B CN114757209 B CN 114757209B CN 202210659318 A CN202210659318 A CN 202210659318A CN 114757209 B CN114757209 B CN 114757209B
Authority
CN
China
Prior art keywords
semantic
semantic role
instruction
labeling
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210659318.5A
Other languages
Chinese (zh)
Other versions
CN114757209A (en
Inventor
张梅山
卢攀忠
林智超
孙越恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210659318.5A priority Critical patent/CN114757209B/en
Publication of CN114757209A publication Critical patent/CN114757209A/en
Application granted granted Critical
Publication of CN114757209B publication Critical patent/CN114757209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition, and relates to the technical field of semantic analysis in natural language processing. The method comprises the following steps: constructing a set of complete instruction semantic role labeling normal forms according to the characteristics of the human-computer interaction instruction; according to the instruction semantic role labeling paradigm, expanding a single-mode form of a semantic role labeling model into a visual text multi-mode form by combining image acquisition; training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction. According to the invention, a paradigm marked by multi-modal semantic roles is innovatively tried to carry out semantic analysis on the human-computer interaction instruction, so that the instruction which cannot be understood by a machine originally is converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.

Description

Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition
Technical Field
The invention relates to the technical field of semantic analysis in natural language processing, in particular to a man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition.
Background
Semantic role labeling is a shallow semantic analysis technology, and is used for extracting predicate-argument structures contained in sentences. The predicate is a core word in a statement that can trigger a semantic event, and the argument is a role participating in the semantic event, including an actor and a victim. In general, the core of semantic role labeling techniques is to enable machines to understand what "who and who do what, at what time and place" in a sentence. Currently, there are many applications that attempt to label semantic roles as a key ring in their technical links, such as knowledge question answering, dialogue robots, machine translation, etc.
With the development of the technology, the man-machine interaction technology has gradually become an important way for a user to control an unmanned device (such as a robot or an unmanned aerial vehicle). The command is given through voice, so that the unmanned equipment can understand the intention of an operator, corresponding commands are executed, the two hands of the operator can be liberated, and the unmanned equipment can be controlled more conveniently, safely and quickly. However, the existing instruction parsing technology has limited development, and cannot achieve targeted parsing of machine-understandable semantic structures from instructions. The invention plans to utilize the advantages of the semantic role marking technology to realize high-precision analysis of the intention semantics of the control command, so that the unmanned equipment can better serve the user and execute the operation with higher abstract difficulty.
At present, the whole semantic role labeling process is mainly divided into two types, one type is a pipeline-based mode, predicates in sentences are identified by using a sequence labeling method, and then semantic roles (arguments) in the sentences are identified, so that the problem of error propagation is serious. And the other method is a method for constructing a semantic graph to simultaneously extract predicates and semantic roles corresponding to the predicates, wherein all possible predicates and argument candidate fragments of the sentence are enumerated as nodes in the graph, then the semantic role relationship between the predicate fragments and the semantic role fragments is used as edges in the graph, and finally the structured output is obtained through accurately decoding the constructed semantic graph. Most of the existing unmanned equipment has two perceptions of vision and language, but most of the existing semantic role labeling methods are oriented to single text setting, and important complementary relation between image information and text information is ignored.
At present, the labeling paradigm of semantic role labeling data sets is mostly oriented to the general field, and a large blank is left in a special field such as an unmanned equipment instruction control instruction.
Disclosure of Invention
The invention provides a human-computer interaction instruction analysis method and device based on multi-mode semantic role recognition, aiming at the problem that in the prior art, a larger blank exists under an instruction control instruction of an unmanned device.
In order to solve the technical problems, the invention provides the following technical scheme:
on one hand, the man-machine interaction instruction analysis method based on multi-mode semantic role recognition is provided, and is applied to electronic equipment, and comprises the following steps:
s1: constructing an instruction semantic role labeling paradigm according to the characteristics of the human-computer interaction instruction;
s2: according to the instruction semantic role labeling paradigm, expanding a single-mode form of a semantic role labeling model into a visual text multi-mode form by combining image acquisition;
s3: training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.
Optionally, in step S1, according to characteristics of the human-computer interaction instruction, a semantic role labeling paradigm of the instruction is constructed, which includes:
s11: adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference;
s12: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
Optionally, in step S2, according to the instruction semantic role labeling paradigm, in combination with image acquisition, expanding a single-modal form of a semantic role labeling model into a visual text bimodal form, including:
s21: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;
s22: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
Optionally, in step S3, training and learning the multi-modal form of the visual text of the semantic role labeling model to complete multi-modal semantic role recognition and perform semantic parsing on the human-computer interaction instruction, including:
s31: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;
s32: input instructions of the pre-training model
Figure 227321DEST_PATH_IMAGE001
(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I
Figure 743753DEST_PATH_IMAGE002
Figure 460036DEST_PATH_IMAGE003
S33: enumerating all strides in instruction I
Figure 327498DEST_PATH_IMAGE004
Wherein
Figure 787430DEST_PATH_IMAGE005
Obtaining a feature vector of each span; wherein the span is a preset value;
s34: generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;
s35: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.
Optionally, in S34, predicate candidate vectors are obtained by using two different MLP layers respectively
Figure 892789DEST_PATH_IMAGE006
And semantic role candidate vectors
Figure 45553DEST_PATH_IMAGE007
Wherein:
Figure 603573DEST_PATH_IMAGE008
Figure 867195DEST_PATH_IMAGE009
optionally, in S35, the introducing a loss function completes training loss of the model, including:
constructing a semantic role labeling loss function, and judging the integrity of predicates and argument structures predicted by the model;
wherein, the system comprises an MLP layer and a Biaffine layer; the MLP layer score layer is used for judging a semantic frame of a current predicate node, and the Biaffine layer score layer is used for judging each predicate in a sentence
Figure 92640DEST_PATH_IMAGE010
Semantic roles
Figure 165374DEST_PATH_IMAGE011
And the relationship between the two
Figure 741849DEST_PATH_IMAGE012
Of the triad
Figure 543583DEST_PATH_IMAGE013
Grading is carried out; calculating the loss of each triplet by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):
Figure 623535DEST_PATH_IMAGE014
optionally, in S35, the introducing a loss function completes training loss of the model, including:
constructing a mode matching function for mode matching of image and text cross-mode feature pairs, wherein the label of the function is defined as that if the fragment corresponding to the semantic role contains an object corresponding to the target area, the output label is 1, otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):
Figure 118101DEST_PATH_IMAGE015
in one aspect, a human-computer interaction instruction analysis device based on multi-modal semantic character recognition is provided, and the device is applied to electronic equipment and comprises:
the instruction semantic role labeling normal form construction module is used for constructing an instruction semantic role labeling normal form according to the characteristics of the human-computer interaction instruction;
the multi-mode construction module is used for expanding the single-mode form of the semantic role marking model into a visual text multi-mode form by combining image acquisition according to the instruction semantic role marking paradigm;
and the model training module is used for training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.
Optionally, 9, the apparatus according to claim 8, wherein the instruction semantic role labeling paradigm constructing module is configured to use a labeling mode of verbsatlas semantic role labeling data as a labeling benchmark;
and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
Optionally, the multi-modal construction module is configured to acquire an image through an unmanned system according to the instruction semantic role labeling paradigm, obtain a sequence target region by using fast-RCNN, form the sequence target region into an image region sequence, and extract features of the image sequence;
and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
In one aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above human-machine interaction instruction parsing method based on multi-modal semantic character recognition.
In one aspect, a computer-readable storage medium is provided, where at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the above method for human-computer interaction instruction parsing based on multi-modal semantic role recognition.
The technical scheme of the embodiment of the invention at least has the following beneficial effects:
in the scheme, the invention innovatively provides a semantic dependency graph representation scheme of the integrated utterance chapters, the sentence semantic dependency graph is expanded to the whole chapters, and the characteristic that the utterance semantic information is incomplete under a conversation scene is fully considered. The invention provides an integrated semantic dependency graph joint analysis model fusing the interior of the utterances and the utterances for the dialog text for the first time, and the sentence semantics and the chapter semantics are connected together by adopting an end-to-end modeling mode. In addition, the teacher-student network based on knowledge distillation adopted by the invention can also meet the high requirements on efficiency and delay in the practical application of the conversation system.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;
FIG. 2 is a flowchart of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;
FIG. 3 is a multi-modal semantic role labeling model diagram of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;
FIG. 4 is a multi-modal semantic role structured output diagram of a human-computer interaction instruction parsing method based on multi-modal semantic role recognition according to an embodiment of the present invention;
FIG. 5 is an exemplary diagram of a multi-modal semantic role labeling implementation of a human-computer interaction method for human-computer interaction instruction parsing based on multi-modal semantic role recognition according to an embodiment of the present invention;
FIG. 6 is a block diagram of a human-computer interaction instruction analysis device based on multi-modal semantic role recognition according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a man-machine interaction instruction analysis method based on multi-mode semantic role recognition, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 1, a flow chart of a human-computer interaction instruction parsing method based on multi-modal semantic character recognition, a processing flow of the method may include the following steps:
s101: constructing an instruction semantic role labeling paradigm according to the characteristics of the human-computer interaction instruction;
s102: according to the instruction semantic role labeling paradigm, combining image acquisition and expanding a single mode form of a semantic role labeling model into a visual text multi-mode form;
s103: training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.
Optionally, in step S101, according to characteristics of the human-computer interaction instruction, a semantic role labeling paradigm of the instruction is constructed, which includes:
s111: adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference;
s112: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
Optionally, in step S102, according to the instruction semantic role labeling paradigm, in combination with image acquisition, expanding a single-modal form of a semantic role labeling model into a visual text bimodal form, including:
s121: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;
s122: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
Optionally, in step S103, training and learning the multi-modal form of the visual text of the semantic role labeling model, and completing multi-modal semantic role recognition to perform semantic parsing on the human-computer interaction instruction, including:
s131: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;
s132: input instructions of the pre-training model
Figure 916293DEST_PATH_IMAGE001
(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I
Figure 521718DEST_PATH_IMAGE002
Figure 721755DEST_PATH_IMAGE003
S133: enumerating all strides in instruction I
Figure 387222DEST_PATH_IMAGE004
In which
Figure 610393DEST_PATH_IMAGE005
Obtaining a feature vector of each span; wherein the span is a preset value;
s134: generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;
s135: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.
Optionally, in S134, two different MLP layers are used to obtain predicate candidate vectors respectively
Figure 144143DEST_PATH_IMAGE006
And semantic role candidate vectors
Figure 74053DEST_PATH_IMAGE007
Wherein:
Figure 769476DEST_PATH_IMAGE008
Figure 683206DEST_PATH_IMAGE009
optionally, in S135, the introducing loss function completes the training loss of the model, including:
constructing a semantic role labeling loss function, and judging the integrity of predicates and argument structures predicted by the model;
wherein, the system comprises an MLP layer and a Biaffine layer; the MLP scoring layer is used for judging a semantic frame of a current predicate node, and the Biaffine scoring layer is used for judging each predicate in a sentence
Figure 755067DEST_PATH_IMAGE010
Semantic roles
Figure 539483DEST_PATH_IMAGE011
And the relationship between the two
Figure 405808DEST_PATH_IMAGE012
Of the triad
Figure 272745DEST_PATH_IMAGE013
Grading is carried out; calculating the loss of each triplet by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):
Figure 148298DEST_PATH_IMAGE014
optionally, in S135, the introducing a loss function refines a training loss of the model, and includes:
constructing a mode matching function for mode matching of image and text cross-mode feature pairs, wherein the label of the function is defined as that if the fragment corresponding to the semantic role contains an object corresponding to the target area, the output label is 1, otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):
Figure 52800DEST_PATH_IMAGE015
in the embodiment of the invention, image information is innovatively introduced into the existing single-mode semantic role labeling model in an attempt, so that the semantic analysis is carried out on input sentences by using the image information to assist the semantic role labeling model. A paradigm labeled by multi-modal semantic roles is tried to carry out semantic analysis on human-computer interaction instructions, so that the instructions which cannot be understood by a machine originally are converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.
The embodiment of the invention provides a man-machine interaction instruction analysis method based on multi-mode semantic role recognition, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 2, a flow chart of a human-computer interaction instruction parsing method based on multi-modal semantic character recognition, a processing flow of the method may include the following steps:
s201: and adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference.
In a feasible implementation mode, the invention firstly constructs a set of perfect command semantic role labeling paradigm aiming at human-computer interaction commands based on the characteristics of the human-computer interaction commands. The conventional semantic role labeling paradigm is mostly oriented to general fields (such as news), and the setting of semantic roles is better in generality. However, in the field of human-computer interaction, the semantic role of each type of instruction has specificity, which cannot be covered by the semantic role of the general field.
S202: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
In a feasible implementation mode, the invention aims to expand and modify the existing Chinese semantic role marking paradigm so that the Chinese semantic role marking paradigm is suitable for semantic analysis of human-computer interaction instructions.
The primary plan adopts a Verbatlas semantic role labeling data labeling mode as a labeling benchmark of the invention, which is mainly based on the following two considerations: (1) The concept of a semantic frame is added to predicate identification by the annotation benchmark, so that the specific semantics of each predicate are more accurate, and the ambiguity problem of the predicates caused by different contexts is relieved. (2) The marking standard is designed for multi-language scenes, and the marking standard for Chinese instructions can be conveniently designed. Table 1 shows the semantic framework, i.e. semantic roles, initially set by the present invention. The method comprises the following steps of (1) covering simple displacement instructions such as advancing, moving and the like and operation instructions such as taking, opening and the like with high difficulty; the semantic role comprises a control device and a control means which participate in the semantic event, and the time and the place of instruction execution.
Figure 90026DEST_PATH_IMAGE016
Figure 978347DEST_PATH_IMAGE017
S203: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, obtaining sequence target regions by adopting Faster-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;
s204: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
In a possible implementation manner, in terms of model architecture, the invention adopts a double-tower model as shown in fig. 3 to solve the fusion of image-text characteristics between multi-modal semantic role tasks. The whole architecture is mainly divided into three parts, namely image sequence feature extraction at an image end, semantic graph feature extraction at a language end and a training function for feature fusion.
In one possible embodiment, the image sequence features: for images observed by unmanned system
Figure 392011DEST_PATH_IMAGE019
The present invention adopts the existing Faster-RCNN to obtain a sequence of target regions, and the target regions are combined into an image region sequence
Figure 151020DEST_PATH_IMAGE020
And obtaining the characteristic sequence corresponding to the region sequence
Figure 93568DEST_PATH_IMAGE021
. For regional features in a sequence of features
Figure 938027DEST_PATH_IMAGE022
The invention utilizes an MLP layer to carry out further feature abstraction to obtain the final image feature
Figure 155382DEST_PATH_IMAGE023
:
Figure 768897DEST_PATH_IMAGE024
S205: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;
s206: input instructions of the pre-training model
Figure 413505DEST_PATH_IMAGE001
(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I
Figure 10840DEST_PATH_IMAGE002
Figure 766306DEST_PATH_IMAGE003
S207: enumerating all spans in instruction I
Figure 499907DEST_PATH_IMAGE004
Wherein
Figure 518678DEST_PATH_IMAGE005
Obtaining a feature vector of each span; wherein the span is a preset value;
s208: and generating candidate vectors corresponding to the predicate nodes and the semantic role nodes in the semantic graph according to the feature vector of each span.
In one possible embodiment, the text sequence feature: the invention adopts the current end-to-end semantic role labeling classical semantic graph neural network construction idea to obtain the implied predicates and the corresponding arguments in the sentences. For input instruction
Figure 196784DEST_PATH_IMAGE025
Coding the word vector sequence by using a BERT pre-training model to obtain the word vector sequence corresponding to each word in the instruction
Figure 634238DEST_PATH_IMAGE002
Figure 612558DEST_PATH_IMAGE003
. Then enumerate all of the instructionsSpan (L)
Figure 474335DEST_PATH_IMAGE004
Wherein
Figure 905316DEST_PATH_IMAGE005
Consisting of a plurality of words in a sentence. The maximum length and the minimum length of each span are preset. For each span
Figure 205847DEST_PATH_IMAGE026
Its feature vector is expressed as:
Figure 648461DEST_PATH_IMAGE027
wherein
Figure 805773DEST_PATH_IMAGE028
Figure 599417DEST_PATH_IMAGE029
Representing the hidden layer representation to which each span start word and end word corresponds,
Figure 500376DEST_PATH_IMAGE030
the length characteristic corresponding to each span is represented,
Figure 63076DEST_PATH_IMAGE031
the Attention for each word in the span is calculated using the Self-Attention mechanism and the resulting vector is weighted and averaged according to Attention.
Corresponding representation for each span
Figure 594551DEST_PATH_IMAGE032
And candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph need to be generated, so the method adopts two different MLP layers to respectively obtain the predicate candidate vectors and the semantic role candidate vectors
Figure 734546DEST_PATH_IMAGE033
And
Figure 48984DEST_PATH_IMAGE034
Figure 590823DEST_PATH_IMAGE035
s209: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to carry out semantic analysis on the human-computer interaction instruction.
In a feasible implementation mode, two different MLP layers are adopted to respectively obtain predicate candidate vectors
Figure 965304DEST_PATH_IMAGE006
And semantic role candidate vectors
Figure 592594DEST_PATH_IMAGE007
Wherein:
Figure 710723DEST_PATH_IMAGE008
Figure 638228DEST_PATH_IMAGE009
wherein, MLP P Is a multi-layer feedforward neural network, MLP, for obtaining predicate representations R Is a multi-layer feed-forward neural network for obtaining semantic role representations.
In one possible embodiment, the introducing the loss function completes the training loss of the model, including:
constructing a semantic role labeling loss function, and judging the integrity of predicates and argument structures predicted by the model;
wherein, the system comprises an MLP layer and a Biaffine layer; the MLP scoring layer is used for judging a semantic frame of a current predicate node, and the Biaffine scoring layer is used for judging each predicate in a sentence
Figure 180680DEST_PATH_IMAGE010
Semantic roles
Figure 29687DEST_PATH_IMAGE011
And the relationship between the two
Figure 951507DEST_PATH_IMAGE012
Of the triad
Figure 467939DEST_PATH_IMAGE013
Grading is carried out; calculating the loss of each triple by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):
Figure 184222DEST_PATH_IMAGE014
in one possible embodiment, the present invention defines two loss functions for training the model, in terms of training loss. The first semantic role labeling loss function for judging the integrity of predicates and argument structures predicted by a model comprises an MLP (Multi-level processing) scoring layer for judging a semantic frame of a current predicate node and a Biaffine scoring layer for judging each predicate, semantic role and triple of the relationship between the predicate and the semantic role in a sentence
Figure 520526DEST_PATH_IMAGE013
Scoring is carried out, and the specific definition is as follows:
Figure 980457DEST_PATH_IMAGE036
Figure 351396DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 238580DEST_PATH_IMAGE038
representing a multi-layer feed-forward neural network for obtaining semantic framework class scores;
Figure 327759DEST_PATH_IMAGE039
is a Biaffine weight matrix and,
Figure 325802DEST_PATH_IMAGE040
is a matrix of linear weights that is,
Figure 816826DEST_PATH_IMAGE041
is the bias term. After the score corresponding to each relationship is obtained, the loss of each triple is calculated by adopting cross entropy:
Figure 874912DEST_PATH_IMAGE042
wherein
Figure 185808DEST_PATH_IMAGE043
And
Figure 315438DEST_PATH_IMAGE044
representing the corresponding semantic framework and semantic role set.
In one possible implementation, the introduction of the loss function refines the training loss of the model, and includes:
constructing a mode matching function for mode matching of image and text cross-mode feature pairs, wherein the label of the function is defined as that if the fragment corresponding to the semantic role contains an object corresponding to the target area, the output label is 1, otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):
Figure 536334DEST_PATH_IMAGE045
in a possible embodiment, the second type is a mode matching function for image and text cross-mode feature pairs, and the label of the function is defined by the present invention as that if the segment corresponding to the semantic role contains the object corresponding to the target region, the output label is 1, otherwise the label is 10. The invention also utilizes a Biaffine layer to calculate the triples of the image region characteristics, the semantic roles and the relationship between the image region characteristics and the semantic roles
Figure 889955DEST_PATH_IMAGE046
The score is made and the score is given,
Figure 554724DEST_PATH_IMAGE047
similarly, the corresponding loss function is:
Figure 550362DEST_PATH_IMAGE048
the final loss function is defined by adopting a multi-task learning paradigm:
Figure 625766DEST_PATH_IMAGE049
wherein
Figure 150288DEST_PATH_IMAGE050
For adjusting the weights exerted by the two loss functions in the model training.
In the embodiment of the invention, the target of the multi-mode semantic role labeling is to give an input instruction and obtain the semantic structured output of the instruction, so that a machine can understand and execute the semantic structured output. The structured output of multi-modal semantic character recognition is shown in fig. 4.
In the embodiment of the invention, fig. 5 shows an analysis example of the multi-modal semantic role labeling model on a human-computer interaction instruction. For the instruction issued by the user, the multi-mode semantic role analysis system identifies the predicate in the multi-mode semantic role analysis system, the corresponding semantic frame and the semantic role belonging to the semantic frame, and organizes the semantic role into a machine-identifiable structured output.
In the embodiment of the invention, aiming at the fact that the existing semantic role labeling model is mostly set based on a single mode, image information is innovatively introduced into the existing single-mode semantic role labeling model in an attempt, so that the image information is used for assisting the semantic role labeling model to carry out semantic analysis on input sentences. A paradigm labeled by multi-mode semantic roles is tried to carry out semantic analysis on human-computer interaction instructions, so that the instructions which cannot be understood by the machine originally are converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.
FIG. 6 is a block diagram illustrating a human-machine interaction instruction parsing apparatus based on multi-modal semantic character recognition according to an example embodiment. Referring to fig. 6, the apparatus 300 includes:
the paradigm-constructing module 310 is configured to construct a set of complete instruction semantic role labeling paradigm according to characteristics of human-computer interaction instructions;
the multi-mode construction module 320 is used for expanding the single-mode form of the semantic role labeling model into a visual text multi-mode form by combining image acquisition according to the instruction semantic role labeling paradigm;
and the model training module 330 is used for training and learning the multi-modal form of the visual text of the semantic role labeling model, and completing the multi-modal semantic role recognition to perform semantic analysis on the human-computer interaction instruction.
Optionally, the paradigm building module 310 is configured to use a labeling mode of the verbsaatlas semantic role labeling data as a labeling benchmark;
and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and a set of complete instruction semantic role labeling paradigm is obtained.
Optionally, the multi-modal construction module 320 is configured to acquire an image through an unmanned system according to the instruction semantic role labeling paradigm, obtain a sequence target region by using fast-RCNN, form the sequence target region into an image region sequence, and extract features of the image sequence;
and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
Optionally, the model training module 330 is configured to construct a pre-training model according to the visual text multimodal form of the semantic role labeling model;
input instructions of the pre-training model
Figure 576721DEST_PATH_IMAGE001
(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I
Figure 985837DEST_PATH_IMAGE002
Figure 40380DEST_PATH_IMAGE003
Enumerating all strides in instruction I
Figure 611170DEST_PATH_IMAGE004
Wherein
Figure 383954DEST_PATH_IMAGE005
Obtaining a feature vector of each span; wherein the span is a preset value;
generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;
and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to perform semantic analysis on the human-computer interaction instruction.
Optionally, the model training module 330 is configured to obtain predicate candidate vectors using two different MLP layers respectively
Figure 127919DEST_PATH_IMAGE006
And semantic role candidate vectors
Figure 912336DEST_PATH_IMAGE007
Wherein:
Figure 44240DEST_PATH_IMAGE008
Figure 648527DEST_PATH_IMAGE009
optionally, the model training module 330 is configured to construct a semantic role labeling loss function, and determine integrity of predicates and argument structures predicted by the model;
wherein, the system comprises an MLP layer and a Biaffine layer; the MLP layer score layer is used for judging a semantic frame of a current predicate node, and the Biaffine layer score layer is used for judging each predicate in a sentence
Figure 524080DEST_PATH_IMAGE010
Semantic roles
Figure 160073DEST_PATH_IMAGE011
And the relationship between the two
Figure 462878DEST_PATH_IMAGE012
Of the triad
Figure 820041DEST_PATH_IMAGE013
Grading is carried out; calculating the loss of each triplet by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):
Figure 233705DEST_PATH_IMAGE014
optionally, the model training module 330 is configured to construct a mode matching function for mode matching of a cross-mode feature pair of an image and a text, where a label of the function is defined as that if an object corresponding to the target region is included in a segment corresponding to the semantic role, an output label is 1, and otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):
Figure 789451DEST_PATH_IMAGE015
in the embodiment of the invention, aiming at the fact that the existing semantic role labeling model is mostly set based on a single mode, image information is innovatively introduced into the existing single-mode semantic role labeling model in an attempt, so that the image information is used for assisting the semantic role labeling model to carry out semantic analysis on input sentences. A paradigm labeled by multi-mode semantic roles is tried to carry out semantic analysis on human-computer interaction instructions, so that the instructions which cannot be understood by the machine originally are converted into semantic structural output which can be understood by the machine, and the intention of a user can be executed more conveniently, safely and quickly.
Fig. 7 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following steps of a human-machine interaction instruction parsing method based on multimodal semantic role recognition:
s1: constructing a set of complete instruction semantic role labeling normal forms according to the characteristics of human-computer interaction instructions;
s2: according to the instruction semantic role labeling paradigm, combining image acquisition and expanding a single mode form of a semantic role labeling model into a visual text multi-mode form;
s3: training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.
In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the human-machine interaction instruction parsing method based on multi-modal semantic character recognition is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims (9)

1. A man-machine interaction instruction analysis method based on multi-modal semantic role recognition is characterized by comprising the following steps:
s1: constructing an instruction semantic role labeling paradigm according to the characteristics of the human-computer interaction instruction;
s2: according to the instruction semantic role labeling paradigm, expanding a single-mode form of a semantic role labeling model into a visual text multi-mode form by combining image acquisition;
s3: training and learning the multi-mode form of the visual text of the semantic role labeling model to complete the semantic analysis of the man-machine interaction instruction by multi-mode semantic role recognition;
in the step S3, training and learning the multi-modal form of the visual text of the semantic role labeling model to complete the multi-modal semantic role recognition and perform semantic analysis on the human-computer interaction instruction, including:
s31: constructing a pre-training model according to a visual text multi-mode form of a semantic role labeling model;
s32: input instructions of the pre-training model
Figure 202160DEST_PATH_IMAGE001
(ii) a Coding the instruction I by using a BERT pre-training model to obtain a word vector sequence corresponding to each word in the instruction I
Figure 181617DEST_PATH_IMAGE002
S33: enumerating all strides in instruction I
Figure 243245DEST_PATH_IMAGE003
In which
Figure 435192DEST_PATH_IMAGE004
Obtaining a feature vector of each span; wherein the span is a preset value;
s34: generating candidate vectors corresponding to predicate nodes and semantic role nodes in the semantic graph according to the feature vector of each span;
s35: and a loss function is introduced to perfect the training loss of the model, and the multi-mode semantic role recognition is completed to carry out semantic analysis on the human-computer interaction instruction.
2. The method according to claim 1, wherein in step S1, according to the characteristics of the human-computer interaction instruction, constructing an instruction semantic role labeling paradigm comprises:
s11: adopting a labeling mode of Verbatlas semantic role labeling data as a labeling reference;
s12: and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
3. The method according to claim 2, wherein in step S2, expanding the monomodal form of the semantic character labeling model into a visual text bimodal form according to the instruction semantic character labeling paradigm in combination with image acquisition comprises:
s21: acquiring images through an unmanned system according to the instruction semantic role labeling paradigm, acquiring sequence target regions by adopting fast-RCNN, forming the sequence target regions into an image region sequence, and extracting the image sequence characteristics;
s22: and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
4. The method according to claim 1, wherein in S34, predicate candidate vectors are respectively obtained by using two different layer perceptron MLP layers
Figure 983985DEST_PATH_IMAGE005
And semantic role candidate vectors
Figure 603186DEST_PATH_IMAGE006
Wherein:
Figure 604640DEST_PATH_IMAGE007
5. the method of claim 4, wherein in S35, the introducing a loss function completes training loss of the model, and includes:
constructing a semantic role labeling loss function, and judging the integrity of predicates and argument structures predicted by the model;
wherein, the system comprises an MLP layer and a Biaffine layer; the MLP layer score layer is used for judging a semantic frame of a current predicate node, and the Biaffine layer score layer is used for judging each predicate in a sentence
Figure 600277DEST_PATH_IMAGE008
Semantic roles
Figure 548117DEST_PATH_IMAGE009
And the relationship between the two
Figure 275902DEST_PATH_IMAGE010
Of (2)
Figure 826969DEST_PATH_IMAGE011
Grading is carried out; calculating the loss of each triple by cross entropy, wherein the semantic role labeling loss function is shown as the following formula (1):
Figure 298402DEST_PATH_IMAGE012
6. the method of claim 4, wherein in S35, the introducing a loss function completes training loss of the model, and includes:
constructing a mode matching function for mode matching of image and text cross-mode feature pairs, wherein the label of the function is defined as that if the fragment corresponding to the semantic role contains an object corresponding to the target area, the output label is 1, otherwise, the label is 0; defining, by a paradigm of multitask learning, a loss function of a mode matching function as the following equation (2):
Figure 352945DEST_PATH_IMAGE013
7. a human-computer interaction command parsing device based on multi-modal semantic character recognition, wherein the device is suitable for the method of any one of the preceding claims 1-6, and the device comprises:
the instruction semantic role labeling normal form construction module is used for constructing an instruction semantic role labeling normal form according to the characteristics of the human-computer interaction instruction;
the multi-mode construction module is used for expanding the single-mode form of the semantic role marking model into a visual text multi-mode form by combining image acquisition according to the instruction semantic role marking paradigm;
and the model training module is used for training and learning the multi-mode form of the visual text of the semantic role labeling model, and completing multi-mode semantic role recognition to perform semantic analysis on the man-machine interaction instruction.
8. The device of claim 7, wherein the instruction semantic role labeling paradigm building module is configured to take a labeling mode of Verbatlas semantic role labeling data as a labeling benchmark;
and expanding and modifying the pre-stored Chinese semantic role labeling paradigm, so that the expanded and modified Chinese semantic role labeling paradigm is suitable for semantic analysis of human-computer interaction instructions, and the instruction semantic role labeling paradigm is obtained.
9. The apparatus according to claim 7, wherein the multi-modal construction module is configured to collect images through an unmanned system according to the instruction semantic role labeling paradigm, obtain sequence target regions by using fast-RCNN, form the sequence of the sequence target regions into an image region sequence, and extract features of the image sequence;
and performing auxiliary identification on the semantic role at the semantic text end through the extracted image sequence characteristics, and expanding the monomodal form of the semantic role labeling model into a visual text bimodal form.
CN202210659318.5A 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition Active CN114757209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210659318.5A CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210659318.5A CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Publications (2)

Publication Number Publication Date
CN114757209A CN114757209A (en) 2022-07-15
CN114757209B true CN114757209B (en) 2022-11-11

Family

ID=82336249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210659318.5A Active CN114757209B (en) 2022-06-13 2022-06-13 Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition

Country Status (1)

Country Link
CN (1) CN114757209B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571046A (en) * 2021-06-28 2021-10-29 深圳瑞鑫泰通信有限公司 Artificial intelligent speech recognition analysis method, system, device and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9189742B2 (en) * 2013-11-20 2015-11-17 Justin London Adaptive virtual intelligent agent
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN111191620B (en) * 2020-01-03 2022-03-22 西安电子科技大学 Method for constructing human-object interaction detection data set
CN111274372A (en) * 2020-01-15 2020-06-12 上海浦东发展银行股份有限公司 Method, electronic device, and computer-readable storage medium for human-computer interaction
CN112201228A (en) * 2020-09-28 2021-01-08 苏州贝果智能科技有限公司 Multimode semantic recognition service access method based on artificial intelligence
CN113590776B (en) * 2021-06-23 2023-12-12 北京百度网讯科技有限公司 Knowledge graph-based text processing method and device, electronic equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571046A (en) * 2021-06-28 2021-10-29 深圳瑞鑫泰通信有限公司 Artificial intelligent speech recognition analysis method, system, device and storage medium

Also Published As

Publication number Publication date
CN114757209A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
US20230229898A1 (en) Data processing method and related device
CN111984766B (en) Missing semantic completion method and device
EP3951617A1 (en) Video description information generation method, video processing method, and corresponding devices
US11068667B2 (en) Electronic apparatus, controlling method of thereof and non-transitory computer readable recording medium
EP4109324A2 (en) Method and apparatus for identifying noise samples, electronic device, and storage medium
CN111967272B (en) Visual dialogue generating system based on semantic alignment
Bonial et al. Abstract meaning representation for human-robot dialogue
KR101627428B1 (en) Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
Hu et al. Safe navigation with human instructions in complex scenes
CN113553418B (en) Visual dialogue generation method and device based on multi-modal learning
KR20210059995A (en) Method for Evaluating Foreign Language Speaking Based on Deep Learning and System Therefor
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
JP7169770B2 (en) Artificial intelligence programming server and its program
CN115048936A (en) Method for extracting aspect-level emotion triple fused with part-of-speech information
Nair et al. Knowledge graph based question answering system for remote school education
CN114757209B (en) Man-machine interaction instruction analysis method and device based on multi-mode semantic role recognition
WO2001016794A1 (en) Information processing device and information processing method, and recording medium
Pragst et al. Comparative study of sentence embeddings for contextual paraphrasing
Tufis et al. Making pepper understand and respond in romanian
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN116108918A (en) Training method and related device for dialogue pre-training model
Giachos et al. Systemic and hole semantics in human-machine language interfaces
Giachos et al. A contemporary survey on intelligent human-robot interfaces focused on natural language processing
CN111914560A (en) Text inclusion relation recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant