CN111950298B - BERT model optimization method and system - Google Patents

BERT model optimization method and system Download PDF

Info

Publication number
CN111950298B
CN111950298B CN202010895250.1A CN202010895250A CN111950298B CN 111950298 B CN111950298 B CN 111950298B CN 202010895250 A CN202010895250 A CN 202010895250A CN 111950298 B CN111950298 B CN 111950298B
Authority
CN
China
Prior art keywords
sentence
context
semantic
semantic features
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010895250.1A
Other languages
Chinese (zh)
Other versions
CN111950298A (en
Inventor
俞凯
金乐盛
陈露
赵晏彬
陈志�
朱苏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202010895250.1A priority Critical patent/CN111950298B/en
Publication of CN111950298A publication Critical patent/CN111950298A/en
Application granted granted Critical
Publication of CN111950298B publication Critical patent/CN111950298B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the invention provides an optimization method of a BERT model. The method comprises the following steps: determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model; determining a semantic representation diagram of a sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair; determining semantic features of sentence pairs as auxiliary information of the first context embedding, and determining a second context embedding with the semantic features; based on the second context embedding with semantic features, the containment relationship of the two sentences in the sentence pair is predicted. The embodiment of the invention also provides an optimization system of the BERT model. In the embodiment of the invention, the context is embedded with the auxiliary high-level semantic information and the grammar information in the language model of the natural language reasoning, so that the trained language model is more sensitive to the semantic information, and the performance of the natural language reasoning task is greatly improved.

Description

BERT model optimization method and system
Technical Field
The invention relates to the field of natural language reasoning, in particular to a BERT model optimization method and system.
Background
NLI (Natural Language Interaction, natural language reasoning), also known as identifying text implications, requires a determination of whether a hypothetical sentence can be deduced from a given premise. Based on a key sentence semantic matching method, some other NLP (Natural Language Processing ) tasks have strong relations with NLI, including question answering, semantic recognition and information retrieval.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:
the language model of natural language reasoning is trained on a large plain text corpus. This training approach tends to learn simple contextual features for pre-trained language models, but lacks grammar and semantic understanding. Experiments have also shown that deep learning models focus on simple context words, but rarely understand the true meaning and high-level semantics in natural language text, lack semantic information, and thus affect the effect of language reasoning.
Disclosure of Invention
In order to at least solve the problem of lack of semantic information in the language model for training language reasoning in the prior art.
In a first aspect, an embodiment of the present invention provides a method for optimizing a BERT model, including:
determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;
determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;
determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;
based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.
In a second aspect, an embodiment of the present invention provides an optimization system for a BERT model, including:
a context embedding program module for determining a first context embedding of each subword in the split sequence of subwords by the sentence to be inferred through the BERT model;
the semantic feature extraction program module is used for determining a semantic representation diagram of the sentence pair through a semantic representation language parser and extracting semantic features of the sentence pair;
an auxiliary program module for determining semantic features of the sentence pair as auxiliary information for the first context embedding, and determining a second context embedding with semantic features;
and the prediction program module is used for predicting the inclusion relation of the two sentences in the sentence pair based on the second context embedding with the semantic features.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the optimization method of the BERT model of any of the embodiments of the invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the optimization method of the BERT model of any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: in the language model of natural language reasoning, auxiliary high-level semantic information and grammar information are embedded for the context, so that the trained language model is more sensitive to the semantic information, and the performance of a natural language reasoning task is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for optimizing BERT model according to an embodiment of the present invention;
FIG. 2 is a diagram of dependency tree+AMR of a BERT model optimization method according to an embodiment of the present invention;
FIG. 3 is a block diagram of AMR charts of a BERT model optimization method, "two racing riders riding a motorcycle" (left) and "two people racing" (right), according to one embodiment of the present invention;
FIG. 4 is a diagram of neighborhood information of different node types and orders in Levi diagram of a BERT model optimization method according to an embodiment of the present invention;
FIG. 5 is a diagram of a U-BERT model structure of a BERT model optimization method according to an embodiment of the present invention;
FIG. 6 is an alignment structure diagram of a sub word sequence, a dependency tree and an AMR graph of a BERT model optimization method according to an embodiment of the present invention;
FIG. 7 is a data diagram of SNLI dataset and MNLI dataset of a BERT model optimization method according to an embodiment of the invention;
FIG. 8 is a graph of ablation analysis data of dependence tree and AMR map on SNLI dataset of a BERT model optimization method provided by an embodiment of the present invention;
FIG. 9 is a data diagram of different neighborhood information order models of a BERT model optimization method according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an optimization system of BERT model according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a method for optimizing a BERT model according to an embodiment of the present invention, including the following steps:
s11: determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;
s12: determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;
s13: determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;
s14: based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.
In this embodiment, the method proposes a novel model, U-BERT, incorporating syntax and semantic structures into the BERT model (Bidirectional Encoder Representation from Transformers, a bi-directional coded representation model based on transformations). In actual use, the refining task can be carried out by only AMR or a dependency tree, and the processing can also be carried out by depending on tree+AMR (Abstract Meaning Representation ). For example, the structure of dependency tree+amr is shown in fig. 2.
For step S11, natural language reasoning mainly determines semantic relationships between two sentences (Premise) or two words. After two sentence pairs are prepared, the pre-trained BERT model is used for encoding.
First, sentence S a And S is b Are all marked as sub-word sequences
Figure BDA0002658242200000041
And->
Figure BDA0002658242200000042
l a And l b Is the number of subwords in two sentences. Then, we join the sub-word sequences to form a new sequence
Figure BDA0002658242200000043
They are then provided to the BERT model to obtain a contextual embedding of each subword: />
Figure BDA0002658242200000044
As a result, the contextual representation of the two sentences can be represented as +.>
Figure BDA0002658242200000045
And->
Figure BDA0002658242200000046
For step S12, determining, by a semantic representation language parser, a semantic representation of the sentence pair, the semantic representation language comprising, as an embodiment: AMR abstract meaning characterization.
AMR graph is a keyword graph that is abstracted from syntactic representations based on abstract meaning representation language. Structured knowledge can help the model discard unimportant information and focus on key points. In ARM FIG. 3, two AMR graphs are parsed from sentence pairs in the SNLI dataset. We can find that the major parts of the two sentences are indeed very similar. However, the difference between the "car" node and the "motorcycle" node indicates that the relationship between the two sentences is a "contradiction". However, little work has been done to explore the effectiveness of AMR graphs for NLI tasks, particularly when used in conjunction with dependency trees.
The method thus carried out further investigation of AMR maps. AMR representation format: operation for extracting semantic information from AMR
The graph is similar to a dependency tree. Representation of
Figure BDA0002658242200000047
And->
Figure BDA0002658242200000048
As a corresponding AMR map. For the figure->
Figure BDA0002658242200000049
Neighborhood information is +.>
Figure BDA00026582422000000410
Initial embedding of concept nodes and dependency representation D a Alignment:
Figure BDA0002658242200000051
wherein the method comprises the steps of
Figure BDA0002658242200000052
Is AMR diagram->
Figure BDA0002658242200000053
Is embedded in the original concept node of (c), and m a Is the number of AMR concept nodes aligned from the dependency tree. As with the dependency representation, we can directly possess:
Figure BDA0002658242200000054
Figure BDA0002658242200000055
M a and M b Is the final output of the shrink path.
For step S13, the semantic features of the sentence pairs are determined as the first context embedded auxiliary information.
Our model represents AMR as M a And M b The level of subwords is expanded. To achieve these operations, we perform the inverse of the pooling layer and restore the representation to its original structure. To achieve this we record the location of the nodes that are merged in the corresponding Pooling layer and use this information to put the nodes back in their original locations. For example, if the word xi is composed of a series of sub-words S i ={s 1 ,s 2 ,...,s K E (x) from the previous layers i ) Represented as x i Is the current representation of the subword set S i Is embedded as follows:
S i =UnPooling(e(x i )),
s 1 =…=s K =e(x i );where s k ∈S i
the same method is used in the extension from AMR. For AMR representation in the extension path, there are:
Figure BDA0002658242200000056
Figure BDA0002658242200000057
for step S14, output S with BERT a (first context embedding) the final representation
Figure BDA0002658242200000058
The (second context embedding) includes semantic information from the AMR map. In addition, comparison information having a semantic structure of another sentence is included. Finally, it is possible to pass->
Figure BDA0002658242200000059
All the embeddings in (1) are stored centrally to calculate the sentence level representation h a I.e.
Figure BDA00026582422000000510
Wherein alpha is i Is an attention weight vector calculated from a multidimensional attention. We can also obtain h in the same way b
As one embodiment, the second context is embedded and processed through an attention mechanism to generate a sentence-level representation of the sentence pair;
and carrying out language reasoning on the sentence level representation of the sentence pair based on a relation classifier, and predicting the inclusion relation of two sentences in the sentence pair.
Relation classifier, using h a And h b Two sentence-level representations, we can predict the inclusion relationship of two sentences,
p=FFN([h a ,h b ,h a ⊙h b ,|h a -h b |])
if in the training phase, the training goal is to minimize cross entropy loss.
According to the embodiment, in the language model of natural language reasoning, auxiliary high-level semantic information is embedded for the context, so that the trained language model is more sensitive to the semantic information, and the language reasoning effect is improved.
As an implementation manner, in this embodiment, the method further includes:
establishing a dependency tree of the sentence pairs through a natural language analysis tool, and extracting grammar information of the sentence pairs;
determining semantic features of the sentence pairs and the grammar information as auxiliary information embedded by the first context, and determining a third context embedded with the semantic features and the grammar information;
based on a third context embedding with semantic features and grammar information, the inclusion relationship of two sentences in the sentence pair is predicted.
The extracting the grammar information of the sentence pairs includes:
two-way embedding update is carried out on word nodes and edge nodes in the dependency tree, and grammar information is determined based on the dependency relationship between the word nodes and the edge nodes;
the determining the semantic features of the sentence pairs, the grammar information as the first context embedded auxiliary information includes:
and expanding the semantic representation to the dependency tree through a pooling layer, combining the semantic features with the grammar information, and determining the semantic features as the first context embedded auxiliary information.
The natural language analysis tool includes at least: coreNLP.
In this embodiment, the U-BERT extracts grammar information from the dependency tree of two sentences and fuses it into sentence expression form. Representation of
Figure BDA0002658242200000061
And->
Figure BDA0002658242200000062
As a dependency tree graph. For the figure->
Figure BDA0002658242200000063
We can use +.>
Figure BDA0002658242200000064
1-K order neighborhood information representing word nodes and edge nodes. The embedding of edge nodes is randomly initialized, and the initial embedding of word nodes is based on the subword representation S a Aligned:
Figure BDA0002658242200000071
wherein the method comprises the steps of
Figure BDA0002658242200000072
Is a dependency tree->
Figure BDA0002658242200000073
Is embedded in the initial word node of (c), t a Is the number of words aligned from the sequence of subwords. The representation of word nodes in the dependency tree is updated by Levi-GAT:
Figure BDA0002658242200000074
wherein the method comprises the steps of
Figure BDA0002658242200000075
Based on neighborhood information->
Figure BDA0002658242200000076
Updated representation. One thing to be noted is +.>
Figure BDA0002658242200000077
Is a directed graph. This means that the propagation of information in the graph follows a pre-specified direction. However, one-way propagation may lose structural information in the opposite direction. To solve this problem, we also aggregate the structure messages from the reverse edge direction. Will->
Figure BDA0002658242200000078
Expressed as corresponding neighborhood information in the opposite direction, we have:
Figure BDA0002658242200000079
since we have updated node embeddings in both directions, the updated representation of the dependency tree graph is a combination of bi-directional embeddings, namely:
Figure BDA00026582422000000710
Figure BDA00026582422000000711
is a trainable matrix of projections. After extracting the dependency information using the graph coding layer, we get the sentence S a Form of expression +.>
Figure BDA00026582422000000712
Representation of sentence Sb->
Figure BDA00026582422000000713
We use the attention mechanism to make them interact with each other. Note that the calculation formula of the weight matrix is:
Figure BDA00026582422000000714
wherein the method comprises the steps of
Figure BDA00026582422000000715
And->
Figure BDA00026582422000000716
Is a learnable projection matrix. For->
Figure BDA00026582422000000717
From->
Figure BDA00026582422000000718
Is identified as:
Figure BDA00026582422000000719
the final dependency representation from the graph coding layer is a combination of the original embedding and the context representation from another sentence, namely:
Figure BDA00026582422000000720
FFN is a feed forward network that includes two linear transforms. Similarly, we can obtain sentence S b Final dependency representation D of (2) b . For ease of description, use is made of
Figure BDA00026582422000000721
To represent the operation in the above equation.
For dependency tree representation, we use the Unpooling layer to embed and extend concepts in AMR graphs to the dependency tree, namely:
Figure BDA00026582422000000722
the output of the shrink path is used as a residual connection during graphics operations in the expansion path. For the dependency representation we have:
Figure BDA0002658242200000081
likewise, the final embedding of the subwords is extended from the dependency representation:
Figure BDA0002658242200000082
Figure BDA0002658242200000083
and BERT output S a In comparison, the final representation
Figure BDA0002658242200000084
Including semantic information from dependency trees and AMR maps. In addition, comparison information having a semantic structure of another sentence is included. Finally, it is possible to pass->
Figure BDA0002658242200000085
All the embeddings in (1) are stored centrally to calculate the sentence level representation h a I.e. +.>
Figure BDA0002658242200000086
Likewise, where alpha i Is an attention weight vector calculated from a multidimensional attention. We can also obtain h in the same way b
Relation classifier, using h a And h b Two sentence-level representations, we can predict the inclusion relationship of two sentences,
p=FFN([h a ,h b ,h a ⊙h b ,|h a -h b |])
according to the embodiment, in the language model of natural language reasoning, auxiliary high-level semantic information and grammar information are embedded for the context, so that the trained language model is more sensitive to the semantic information, and the performance of a natural language reasoning task is greatly improved.
As an embodiment, the BERT model includes: a contracted path and an expanded path;
wherein the extracting semantic features of the sentence pairs and the extracting grammar information of the sentence pairs are in the contracted path;
the third context with semantic features and grammar information is determined to be embedded in the extended path.
In this embodiment, the method proposes a new U-BERT model that can integrate structured knowledge into BERT. It has two paths: a contracted path and an expanded path. In the shrink path, it takes the context representation from the BERT, then extracts the grammar knowledge from the dependency tree, and extracts the semantic knowledge from the AMR map. In the extended path, it sequentially merges the grammatical and semantic features from the contracted path into a context word representation.
It can be seen from this embodiment that the U-BERT model of the contracted path and the expanded path is used, in order to extract semantic features and semantic information in the contracted path, and in the expanded path, auxiliary high-level semantic information and syntax information are embedded for the context. The trained language model is more sensitive to semantic information, and the performance of a natural language reasoning task is improved.
Describing the basis of the method in detail, first, the GAT (graph attention networks, graph annotation force network) and its graph extensions with marked edges are introduced, which is the basis of the model of the method.
Graph attention network GAT is a special type of network that processes graph structure data through an attention mechanism. Given a picture
Figure BDA0002658242200000091
Wherein V and->
Figure BDA0002658242200000092
Respectively node x i The set of edges e ij Is a set of (3). />
Figure BDA0002658242200000093
Expressed by x i Directly connected nodes. />
Figure BDA0002658242200000094
Is comprised of x i And a set of all its immediate neighbors. We have->
Figure BDA0002658242200000095
Each node x in the graph i With initial characteristics
Figure BDA0002658242200000096
Where d is the feature size. The representation of each node is iteratively updated by a graph attention operation. In the first step, each node x i The context information is aggregated by joining its neighbors and itself. Updated representation +.>
Figure BDA0002658242200000097
From the weighted average of the connected nodes, it is calculated:
Figure BDA0002658242200000098
attention coefficient a ij The calculation formula of (2) is as follows:
Figure BDA0002658242200000099
wherein σ is a nonlinear activation function, such as a linear rectification function ReLU.
Figure BDA00026582422000000910
And->
Figure BDA00026582422000000911
Is a learnable parameter of the projection.
Note that in the above-mentioned equation,
Figure BDA00026582422000000912
is a scalar, which means +.>
Figure BDA00026582422000000913
All dimensions in (a) are treated equally. This may limit the ability to model complex dependencies. We replaced the common attention with MDA (Multi-dimensional Attention, multidimensional attention). MDA has proven useful in handling context changes and ambiguity problems in many NLP (Natural Language Processing ) tasks. For each insert->
Figure BDA00026582422000000914
Instead of calculating a single scalar score, MDA calculates a score vector a that is classified by feature ij . We have:
Figure BDA00026582422000000915
wherein the method comprises the steps of
Figure BDA00026582422000000916
Is the scalar in the equation before the softmax operation, and +.>
Figure BDA00026582422000000917
Is a vector. Addition in the equation means that a scalar is added to each element of the vector. />
Figure BDA0002658242200000101
For estimating->
Figure BDA0002658242200000102
Contribution of each feature dimension of (a):
Figure BDA0002658242200000103
wherein the method comprises the steps of
Figure BDA0002658242200000104
And->
Figure BDA0002658242200000105
Is a learnable parameter. Finally, the function-oriented multidimensional softmax (MD-softmax) function will be used for the attention weight vector +.>
Figure BDA0002658242200000106
Normalization was performed. Thus, the formula may be modified as:
Figure BDA0002658242200000107
after L steps, each node will eventually have a context-aware representation
Figure BDA0002658242200000108
To achieve a stable training process we also use residual connection and then layer normalization between the two graphic attention layers.
Message propagation in conventional GATs using higher order information is only handled on first order neighboring nodes. An important extension is to use higher order neighborhood information. This may help the model explore the relationships between indirectly connected nodes. Representation of
Figure BDA0002658242200000109
As neighborhood information from 1 st order to K th order. />
Figure BDA00026582422000001010
Representing a kth order neighborhood, which means +.>
Figure BDA00026582422000001011
X for all nodes in k points (k.gtoreq.1) i Are reachable. />
Figure BDA00026582422000001012
We can obtain:
Figure BDA00026582422000001013
k-time GAT integrates neighborhood information R K . In the first updating step, each x i Will interact with its reachable neighbors in a different order and calculate the attention features independently. Representation of
Figure BDA00026582422000001014
Updated by tandem features from different orders.
Figure BDA00026582422000001015
Wherein I represents a series connection,
Figure BDA00026582422000001016
is the attention weight vector of the kth order, and +.>
Figure BDA00026582422000001017
Is a learnable weight of the projection. More generally, for->
Figure BDA00026582422000001018
We define k= infinity as x within any jump point i A set of nodes that are reachable. />
Figure BDA00026582422000001019
Can be easily extended to the above formula.
Conventional graph-meaning networks can only represent graph nodes, and ignore the properties of edges. However, both the dependency tree and the AMR map have marked edges. To address this problem, previous GNN-based models convert traditional graphs into their equivalent Levi graphs by turning edges into additional relational nodes. As shown in fig. 4, the edges in the corresponding Levi graph have no attributes. Under this setting, the edge labels are considered to be the same as the nodes and are represented by the model average. This means that the edge labels and word nodes share the same semantic space, which is not an ideal way, as nodes and edges are typically different elements.
To solve this problem, we use different parameters to represent different types of nodes in the Levi graph. As shown in fig. 4, for word node w i We use
Figure BDA0002658242200000111
Representing w i Is used +.>
Figure BDA0002658242200000112
Representing adjacent word nodes (ignoring relationship nodes). This definition can also be extended to integrate higher order information. Representation->
Figure BDA0002658242200000113
Neighborhood information from 1 to K as word nodes. Likewise, a->
Figure BDA0002658242200000114
As distinct ordered information of the edge nodes. Note that the edges between two edge nodes may be opposite (+_ in fig. 4>
Figure BDA0002658242200000115
Is included) but we also consider these two nodes as neighboring edge nodes.
Word node w i Sequentially summarizing byte point neighbors R c (K) And edge node neighbor R e (K) Finally updating the representation form:
Figure BDA0002658242200000116
Figure BDA0002658242200000117
will be used
Figure BDA0002658242200000118
Represents the graph encoding operation set forth above, where R (K) = (R) c (K),R e (K) Is information of different types of nodes in the neighborhood.
In the present method, the proposed U-BERT architecture will be described in detail. First, NLI (Natural Language Inference, natural language inference task) tasks are defined in a formal manner. Given two sentences
Figure BDA0002658242200000119
And is also provided with
Figure BDA00026582422000001110
Our model f (S a ,S b ) Is to predict S a And S is b Whether or not there is an enclosing relationship. In this context,
Figure BDA00026582422000001111
and->
Figure BDA00026582422000001112
Respectively representing the ith word and the jth word in the sentence, and t a And t b Representing the number of words in the sentence.
The network architecture is shown in fig. 5. As a definition in U-Net, the left side is the contracted path and the right side is the expanded path. In the contracted path, U-Bert obtains context information from BERT and then extracts semantic features from the dependency tree and AMR graph. From sentence context to dependency relationships, and finally to abstract meaning, the information becomes more and more abstract, which allows the model to learn semantic information step by step. The configuration of the expansion paths is substantially symmetrical but in reverse order. Along the extension path, U-BERT obtains semantic information from the AMR graph, the dependency tree, and from the feature order in the contraction path. Finally, the classifier will obtain a representation based on word level and high-level semantic features.
We integrate two types of semantic features into a pre-trained language model, including dependency trees and AMR graphs. The dependency tree reflects explicit relationships between different parts of the sentence. As shown in fig. 6 (b), the relationship between words is represented by a circle with direction marks from the beginning to the end. The dependency tree retains all words and order of sentences, while AMR is more abstract. AMR is a sentence-level semantic representation formalized by a rooted directed graph, where nodes are concepts and edges are semantic relationships. Concepts are extracted from sentences, each aligned with several words.
To integrate semantic features, we need to fuse the original BERT embedding with the semantic structure representation. Since the original pre-trained BERT is based on a series of subwords, the dependency tree is based on words, and the AMR map is based on concepts, we need to align these differently sized representations. As shown in fig. 6, we group the sub-words of each word and use the concentration pool to obtain a word-level representation in the dependency tree.
For example, assume that word xi is composed of a series of subwords S i ={s 1 ,s 2 ,...,s K Composition; represent S i ={s 1 ,s 2 ,...,s K As their representation from BERT. X is x i Is as follows:
Figure BDA0002658242200000121
wherein the attention weight vector alpha k Is calculated from the multidimensional attention. The conceptual level representation in the AMR map uses the same approach.
Experiments were performed on this method, data and pretreatment we performed experiments on two NLI benchmark datasets: SNLI dataset and MNLI dataset. The evaluation index is the classification accuracy. We use CoreNLP natural language analysis tool set and CAMR to obtain semantic relationships of sentences. CoreNLP is a set of human language analysis tools developed by the university of stanford. CAMR is a transition-based tree-to-graph parser for generating AMR graphs of sentences. More specifically, we use CoreNLP to obtain part-of-speech (POS) tags and syntactic dependencies for each sentence. After the flow of CoreNLP, CAMR resolves the dependency into AMR graphs.
Training details for model parameters, the representation dimension d is set to 300. K=1, 2,3, we will also reach the node set
Figure BDA0002658242200000122
Merging to neighborhood information->
Figure BDA0002658242200000123
Is a kind of medium. We use Bertadam as our optimizer and cosine decay as our learning schedule:
Figure BDA0002658242200000124
wherein t represents the cumulative number of training steps, t all Indicating the total number of steps to be attenuated. For SNLI, initial learning Rate l r0 Set to 1.4e-5; for MNLI, initial learning Rate l r0 Set to 2e-5. The batch size was 32 and the conjugate rate for all layers was set to 0.2.
We base BERT-based larger BERT and SemBERT on. SemBERT is an improved language representation model that utilizes context semantics on the BERT backbone. Unlike the AMR map used in our model, semBERT uses semantic role markers (SRLs) as additional semantic information for BERT. SemBERT has excellent performance in terms of natural language reasoning and has reached up-to-date levels on SNLI and MNLI datasets. Fig. 7 shows our results on SNLI and MNLI development and test sets. All models are trained in a single dataset without the need for integration or other unlabeled data.
Compared to the BERT-based model, we have a BERT-based model that is better than 0.4% over the SNLI test set, better than 0.6% over the MNLI-m test set, and better than 0.4% over the MNLI-mm test set. Our BERTbase-based model can even achieve the same performance of BERTlarge on the SNLI dataset. Compared with BERTlarge, the model based on BERT-large improves the result by 0.5% on the SNLI test set, 1.0% on the MNLim test set and 0.6% on the MNLI-mm test set. Such significant improvements indicate that merging semantic information helps the pre-trained model perform better.
We will also compare with SemBERT. On the SNLI test set, our BERT-based model performs better than SemBERT base Baseline 0.1%, performance of UBERT based on BERT-large versus SemBERT large The same applies. On MNLI matched dataset, our BERT-based model performs better than SemBERT base 0.8% of (C). With SemBERT large Compared with the U-BERTlarge, the performance of the U-BERTlarge on the test set with the unmatched MNLI is 0.2%, and the performance of the U-BERTlarge on the test set with the unmatched MNLI is 0.1%.
To examine the contribution of key elements of the model, we initiated ablation experiments on the SNLI development set. The results are illustrated in fig. 8 and 9. We focus on two parts:
(1) Dependency tree and AMR graph effects;
(2) Neighboring orders K and
Figure BDA0002658242200000131
is a function of (a) and (b).
From the results we find that independent use of dependency trees or AMR maps is also better than baseline, but U-BERT still achieves the best performance. On the other hand, higher order neighborhood information is also very important for U-BERT.
In the ablation experiments we have as the base model U-BERT with dependency tree and AMR map and as the test set of the ablation experiments the model without dependency tree (-DEP) or AMR map (-AMR). As shown in FIG. 8, these three models are all superior to BERTlarge baseline, which represents the grammar information in the dependency, while the semantic information in AMR graphs is all beneficial to NLI tasks. We perform best with both the dependency tree and the original model of the AMR map. The model without AMR map performs slightly better than the model without dependency tree. However, when both of these information are used, improvement is limited. We consider two reasons:
(1) To some extent, the information in the dependency tree and AMR diagrams is homogenous.
(2) CAMR parses AMR maps from dependency trees by a transformation method, so the performance of CAMR depends on the accuracy of the dependency tree provided. Error accumulation can affect the results of AMR parsing. In this case, our model will not be able to extract useful information from both semantic structures. In the future, we will try other end-to-end AMR resolvers.
To better understand the effectiveness of different order neighborhood information, we performed a series of ablation tests for different orders K. Fig. 9 shows the effect of having different order neighborhood information in the SNLI set. We test k=0, 1,2,3 ordered neighborhood information, where k=0 means that a node in the graph can only interact with itself. Furthermore, we have discussed the impact of the set of reachable nodes R1. The results show that when K >0, the performance of all models is better than baseline alone. In contrast, when k=0, the performance is not improved. K=2 and k=3 obtain the same score, indicating that the information in the 2 nd order neighborhood is sufficiently valid for the AMR map.
The second part of fig. 9 shows that infinite order neighborhood information R1 can improve the performance of all models with K >0, which means that the global information provided by infinite order neighborhood is only useful for modeling using local information.
Deep learning-based matching models have made tremendous progress in natural language reasoning with the release of large-scale annotation data such as SNLI, MNLI. There are two main frames. The first framework is based on the siamese architecture, where sentence pairs will be encoded into advanced representations through two symmetrical networks, respectively. The second framework applies explicit interactions between sentence pairs during encoding. Under this framework, models have more ability to match multiple levels of granularity of statements, so they generally perform better than the former. The method is based on a second framework.
Pre-trained language models, such as GPT, BERT and XLNET, have shown powerful functions on NLP, enabling artistic effort to be achieved on several Natural Language Understanding (NLU) tasks such as glute. Typically, a pre-trained language model is used as the encoder portion of the downstream model, or to fine tune a particular NLP task (e.g., NLI). In the method, U-BERT uses a trunk based on BERT and BERT-large, and combines the structural semantic information of the dependency tree and the AMR graph.
Language knowledge plays an important role in natural language processing. Recently, there has been a trend to combine linguistic knowledge with pre-trained models. In the model of the method, two structured semantic representations, a dependency tree and an AMR graph, are applied to combine with the output representation of BERT.
The present approach proposes a novel BERT-based network U-BERT that incorporates structured semantic information from dependency trees and AMR graphs. Experiments show that our model greatly improves the performance of BERT on NLI tasks. This work demonstrates the effectiveness of dependency trees and AMR maps in natural language processing.
Fig. 10 is a schematic structural diagram of a BERT model optimization system according to an embodiment of the present invention, where the system may execute the BERT model optimization method according to any of the foregoing embodiments and be configured in a terminal.
The optimization system of the BERT model provided in this embodiment includes: a context embedding program module 11, a semantic feature extraction program module 12, an auxiliary program module 13 and a prediction program module 14.
Wherein the context embedding program module 11 is configured to determine, by means of a BERT model, a first context embedding of each subword in the split sequence of subwords by the sentence to be inferred; the semantic feature extraction program module 12 is configured to determine a semantic representation of the sentence pair through a semantic representation language parser, and extract semantic features of the sentence pair; the auxiliary program module 13 is used for determining semantic features of the sentence pair as auxiliary information embedded by the first context and determining a second context embedded with the semantic features; the predictor module 14 is for predicting the inclusion relationship of two sentences of the sentence pair based on a second context embedding with semantic features.
Further, the system further comprises:
the grammar information extraction program module is used for establishing a dependency tree of the sentence pairs through a natural language analysis tool and extracting grammar information of the sentence pairs;
an auxiliary program module, configured to determine semantic features of the sentence pair and the grammar information as auxiliary information embedded in the first context, and determine a third context embedded with the semantic features and the grammar information;
and the prediction program module is used for predicting the inclusion relation of two sentences in the sentence pair based on the third context embedding with semantic features and grammar information.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the optimization method of the BERT model in any method embodiment;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;
determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;
determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;
based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the method of optimizing the BERT model in any of the method embodiments described above.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the optimization method of the BERT model of any of the embodiments of the invention.
The client of the embodiments of the present application exists in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of optimization of a BERT model, comprising:
determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;
determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;
determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;
based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.
2. The method of claim 1, wherein the method further comprises:
establishing a dependency tree of the sentence pairs through a natural language analysis tool, and extracting grammar information of the sentence pairs;
determining semantic features of the sentence pairs and the grammar information as auxiliary information embedded by the first context, and determining a third context embedded with the semantic features and the grammar information;
based on a third context embedding with semantic features and grammar information, the inclusion relationship of two sentences in the sentence pair is predicted.
3. The method of claim 2, wherein the BERT model comprises: a contracted path and an expanded path;
wherein the extracting semantic features of the sentence pairs and the extracting grammar information of the sentence pairs are in the contracted path;
the third context with semantic features and grammar information is determined to be embedded in the extended path.
4. The method of claim 2, wherein the extracting grammar information for the sentence pairs comprises:
two-way embedding update is carried out on word nodes and edge nodes in the dependency tree, and grammar information is determined based on the dependency relationship between the word nodes and the edge nodes;
the determining the semantic features of the sentence pairs, the grammar information as the first context embedded auxiliary information includes:
and expanding the semantic representation to the dependency tree through a pooling layer, combining the semantic features with the grammar information, and determining the semantic features as the first context embedded auxiliary information.
5. The method of claim 1, wherein the predicting the inclusion relationship of two sentences in the sentence pair based on the second context embedding with semantic features comprises:
embedding the second context for processing through an attention mechanism to generate a sentence level representation of the sentence pair;
and carrying out language reasoning on the sentence level representation of the sentence pair based on a relation classifier, and predicting the inclusion relation of two sentences in the sentence pair.
6. The method of claim 2, wherein the semantic representation language comprises: AMR abstract meaning representation;
the natural language analysis tool includes at least: coreNLP.
7. An optimization system of a BERT model, comprising:
a context embedding program module for determining a first context embedding of each subword in the split sequence of subwords by the sentence to be inferred through the BERT model;
the semantic feature extraction program module is used for determining a semantic representation diagram of the sentence pair through a semantic representation language parser and extracting semantic features of the sentence pair;
an auxiliary program module for determining semantic features of the sentence pair as auxiliary information for the first context embedding, and determining a second context embedding with semantic features;
and the prediction program module is used for predicting the inclusion relation of the two sentences in the sentence pair based on the second context embedding with the semantic features.
8. The system of claim 7, wherein the system further comprises:
the grammar information extraction program module is used for establishing a dependency tree of the sentence pairs through a natural language analysis tool and extracting grammar information of the sentence pairs;
an auxiliary program module, configured to determine semantic features of the sentence pair and the grammar information as auxiliary information embedded in the first context, and determine a third context embedded with the semantic features and the grammar information;
and the prediction program module is used for predicting the inclusion relation of two sentences in the sentence pair based on the third context embedding with semantic features and grammar information.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-6.
CN202010895250.1A 2020-08-31 2020-08-31 BERT model optimization method and system Active CN111950298B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010895250.1A CN111950298B (en) 2020-08-31 2020-08-31 BERT model optimization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010895250.1A CN111950298B (en) 2020-08-31 2020-08-31 BERT model optimization method and system

Publications (2)

Publication Number Publication Date
CN111950298A CN111950298A (en) 2020-11-17
CN111950298B true CN111950298B (en) 2023-06-23

Family

ID=73368181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010895250.1A Active CN111950298B (en) 2020-08-31 2020-08-31 BERT model optimization method and system

Country Status (1)

Country Link
CN (1) CN111950298B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395393B (en) * 2020-11-27 2022-09-30 华东师范大学 Remote supervision relation extraction method based on multitask and multiple examples
CN114821257B (en) * 2022-04-26 2024-04-05 中国科学院大学 Intelligent processing method, device and equipment for video stream and natural language in navigation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516253A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Chinese spoken language semantic understanding method and system
CN111460821A (en) * 2020-03-13 2020-07-28 云知声智能科技股份有限公司 Entity identification and linking method and device
CN111488734A (en) * 2020-04-14 2020-08-04 西安交通大学 Emotional feature representation learning system and method based on global interaction and syntactic dependency

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9818401B2 (en) * 2013-05-30 2017-11-14 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding
WO2018203147A2 (en) * 2017-04-23 2018-11-08 Voicebox Technologies Corporation Multi-lingual semantic parser based on transferred learning
CN112771564A (en) * 2018-07-18 2021-05-07 邓白氏公司 Artificial intelligence engine that generates semantic directions for web sites to map identities for automated entity seeking

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516253A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Chinese spoken language semantic understanding method and system
CN111460821A (en) * 2020-03-13 2020-07-28 云知声智能科技股份有限公司 Entity identification and linking method and device
CN111488734A (en) * 2020-04-14 2020-08-04 西安交通大学 Emotional feature representation learning system and method based on global interaction and syntactic dependency

Also Published As

Publication number Publication date
CN111950298A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
CN111914067B (en) Chinese text matching method and system
CN110516253B (en) Chinese spoken language semantic understanding method and system
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN111950298B (en) BERT model optimization method and system
CN113204611A (en) Method for establishing reading understanding model, reading understanding method and corresponding device
CN111737974A (en) Semantic abstract representation method and device for statement
CN110678882A (en) Selecting answer spans from electronic documents using machine learning
CN110084323A (en) End-to-end semanteme resolution system and training method
CN114510946B (en) Deep neural network-based Chinese named entity recognition method and system
CN111723207B (en) Intention identification method and system
CN113535897A (en) Fine-grained emotion analysis method based on syntactic relation and opinion word distribution
CN112732862B (en) Neural network-based bidirectional multi-section reading zero sample entity linking method and device
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN114328814A (en) Text abstract model training method and device, electronic equipment and storage medium
CN114579605B (en) Table question-answer data processing method, electronic equipment and computer storage medium
CN113177123B (en) Optimization method and system for text-to-SQL model
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN117591543B (en) SQL sentence generation method and device for Chinese natural language
CN116227484B (en) Model training method, apparatus, device, storage medium and computer program product
CN113378543B (en) Data analysis method, method for training data analysis model and electronic equipment
Li et al. STCP: An Efficient Model Combining Subject Triples and Constituency Parsing for Recognizing Textual Entailment
CN116682419A (en) Training method of multi-domain multi-intention spoken language semantic understanding model
Bashir et al. Efficient Deep Learning based Code Retrieval using Unified Graph Structure and Semantic Graph Matching Encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

GR01 Patent grant
GR01 Patent grant