CN111950298B

CN111950298B - BERT model optimization method and system

Info

Publication number: CN111950298B
Application number: CN202010895250.1A
Authority: CN
Inventors: 俞凯; 金乐盛; 陈露; 赵晏彬; 陈志�; 朱苏
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2023-06-23
Anticipated expiration: 2040-08-31
Also published as: CN111950298A

Abstract

The embodiment of the invention provides an optimization method of a BERT model. The method comprises the following steps: determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model; determining a semantic representation diagram of a sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair; determining semantic features of sentence pairs as auxiliary information of the first context embedding, and determining a second context embedding with the semantic features; based on the second context embedding with semantic features, the containment relationship of the two sentences in the sentence pair is predicted. The embodiment of the invention also provides an optimization system of the BERT model. In the embodiment of the invention, the context is embedded with the auxiliary high-level semantic information and the grammar information in the language model of the natural language reasoning, so that the trained language model is more sensitive to the semantic information, and the performance of the natural language reasoning task is greatly improved.

Description

BERT model optimization method and system

Technical Field

The invention relates to the field of natural language reasoning, in particular to a BERT model optimization method and system.

Background

NLI (Natural Language Interaction, natural language reasoning), also known as identifying text implications, requires a determination of whether a hypothetical sentence can be deduced from a given premise. Based on a key sentence semantic matching method, some other NLP (Natural Language Processing ) tasks have strong relations with NLI, including question answering, semantic recognition and information retrieval.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:

the language model of natural language reasoning is trained on a large plain text corpus. This training approach tends to learn simple contextual features for pre-trained language models, but lacks grammar and semantic understanding. Experiments have also shown that deep learning models focus on simple context words, but rarely understand the true meaning and high-level semantics in natural language text, lack semantic information, and thus affect the effect of language reasoning.

Disclosure of Invention

In order to at least solve the problem of lack of semantic information in the language model for training language reasoning in the prior art.

In a first aspect, an embodiment of the present invention provides a method for optimizing a BERT model, including:

determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;

determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;

determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;

based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.

In a second aspect, an embodiment of the present invention provides an optimization system for a BERT model, including:

a context embedding program module for determining a first context embedding of each subword in the split sequence of subwords by the sentence to be inferred through the BERT model;

the semantic feature extraction program module is used for determining a semantic representation diagram of the sentence pair through a semantic representation language parser and extracting semantic features of the sentence pair;

an auxiliary program module for determining semantic features of the sentence pair as auxiliary information for the first context embedding, and determining a second context embedding with semantic features;

and the prediction program module is used for predicting the inclusion relation of the two sentences in the sentence pair based on the second context embedding with the semantic features.

In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the optimization method of the BERT model of any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the optimization method of the BERT model of any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: in the language model of natural language reasoning, auxiliary high-level semantic information and grammar information are embedded for the context, so that the trained language model is more sensitive to the semantic information, and the performance of a natural language reasoning task is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for optimizing BERT model according to an embodiment of the present invention;

FIG. 2 is a diagram of dependency tree+AMR of a BERT model optimization method according to an embodiment of the present invention;

FIG. 3 is a block diagram of AMR charts of a BERT model optimization method, "two racing riders riding a motorcycle" (left) and "two people racing" (right), according to one embodiment of the present invention;

FIG. 4 is a diagram of neighborhood information of different node types and orders in Levi diagram of a BERT model optimization method according to an embodiment of the present invention;

FIG. 5 is a diagram of a U-BERT model structure of a BERT model optimization method according to an embodiment of the present invention;

FIG. 6 is an alignment structure diagram of a sub word sequence, a dependency tree and an AMR graph of a BERT model optimization method according to an embodiment of the present invention;

FIG. 7 is a data diagram of SNLI dataset and MNLI dataset of a BERT model optimization method according to an embodiment of the invention;

FIG. 8 is a graph of ablation analysis data of dependence tree and AMR map on SNLI dataset of a BERT model optimization method provided by an embodiment of the present invention;

FIG. 9 is a data diagram of different neighborhood information order models of a BERT model optimization method according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an optimization system of BERT model according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a method for optimizing a BERT model according to an embodiment of the present invention, including the following steps:

s11: determining a first context embedding of each subword in the split subword sequence by the sentence to be inferred through the BERT model;

s12: determining a semantic representation diagram of the sentence pair through a semantic representation language analyzer, and extracting semantic features of the sentence pair;

s13: determining semantic features of the sentence pairs as auxiliary information embedded by the first context, and determining a second context embedded with the semantic features;

s14: based on the second context embedding with semantic features, the inclusion relationship of the two sentences in the sentence pair is predicted.

In this embodiment, the method proposes a novel model, U-BERT, incorporating syntax and semantic structures into the BERT model (Bidirectional Encoder Representation from Transformers, a bi-directional coded representation model based on transformations). In actual use, the refining task can be carried out by only AMR or a dependency tree, and the processing can also be carried out by depending on tree+AMR (Abstract Meaning Representation ). For example, the structure of dependency tree+amr is shown in fig. 2.

For step S11, natural language reasoning mainly determines semantic relationships between two sentences (Premise) or two words. After two sentence pairs are prepared, the pre-trained BERT model is used for encoding.

First, sentence S ^a And S is ^b Are all marked as sub-word sequences

And->

l _a And l _b Is the number of subwords in two sentences. Then, we join the sub-word sequences to form a new sequence

They are then provided to the BERT model to obtain a contextual embedding of each subword: />

As a result, the contextual representation of the two sentences can be represented as +.>

And->

For step S12, determining, by a semantic representation language parser, a semantic representation of the sentence pair, the semantic representation language comprising, as an embodiment: AMR abstract meaning characterization.

AMR graph is a keyword graph that is abstracted from syntactic representations based on abstract meaning representation language. Structured knowledge can help the model discard unimportant information and focus on key points. In ARM FIG. 3, two AMR graphs are parsed from sentence pairs in the SNLI dataset. We can find that the major parts of the two sentences are indeed very similar. However, the difference between the "car" node and the "motorcycle" node indicates that the relationship between the two sentences is a "contradiction". However, little work has been done to explore the effectiveness of AMR graphs for NLI tasks, particularly when used in conjunction with dependency trees.

The method thus carried out further investigation of AMR maps. AMR representation format: operation for extracting semantic information from AMR

The graph is similar to a dependency tree. Representation of

And->

As a corresponding AMR map. For the figure->

Neighborhood information is +.>

Initial embedding of concept nodes and dependency representation D ^a Alignment:

wherein the method comprises the steps of

Is AMR diagram->

Is embedded in the original concept node of (c), and m _a Is the number of AMR concept nodes aligned from the dependency tree. As with the dependency representation, we can directly possess:

M ^a and M ^b Is the final output of the shrink path.

For step S13, the semantic features of the sentence pairs are determined as the first context embedded auxiliary information.

Our model represents AMR as M ^a And M ^b The level of subwords is expanded. To achieve these operations, we perform the inverse of the pooling layer and restore the representation to its original structure. To achieve this we record the location of the nodes that are merged in the corresponding Pooling layer and use this information to put the nodes back in their original locations. For example, if the word xi is composed of a series of sub-words S _i ＝{s ₁ ，s ₂ ，...，s _K E (x) from the previous layers _i ) Represented as x _i Is the current representation of the subword set S _i Is embedded as follows:

S _i ＝UnPooling(e(x _i ))，

s ₁ ＝…＝s _K ＝e(x _i )；where s _k ∈S _i

the same method is used in the extension from AMR. For AMR representation in the extension path, there are:

for step S14, output S with BERT ^a (first context embedding) the final representation

The (second context embedding) includes semantic information from the AMR map. In addition, comparison information having a semantic structure of another sentence is included. Finally, it is possible to pass->

All the embeddings in (1) are stored centrally to calculate the sentence level representation h ^a I.e.

Wherein alpha is _i Is an attention weight vector calculated from a multidimensional attention. We can also obtain h in the same way ^b 。

As one embodiment, the second context is embedded and processed through an attention mechanism to generate a sentence-level representation of the sentence pair;

and carrying out language reasoning on the sentence level representation of the sentence pair based on a relation classifier, and predicting the inclusion relation of two sentences in the sentence pair.

Relation classifier, using h ^a And h ^b Two sentence-level representations, we can predict the inclusion relationship of two sentences,

p＝FFN([h ^a ，h ^b ，h ^a ⊙h ^b ，|h ^a -h ^b |])

if in the training phase, the training goal is to minimize cross entropy loss.

According to the embodiment, in the language model of natural language reasoning, auxiliary high-level semantic information is embedded for the context, so that the trained language model is more sensitive to the semantic information, and the language reasoning effect is improved.

As an implementation manner, in this embodiment, the method further includes:

establishing a dependency tree of the sentence pairs through a natural language analysis tool, and extracting grammar information of the sentence pairs;

determining semantic features of the sentence pairs and the grammar information as auxiliary information embedded by the first context, and determining a third context embedded with the semantic features and the grammar information;

based on a third context embedding with semantic features and grammar information, the inclusion relationship of two sentences in the sentence pair is predicted.

The extracting the grammar information of the sentence pairs includes:

two-way embedding update is carried out on word nodes and edge nodes in the dependency tree, and grammar information is determined based on the dependency relationship between the word nodes and the edge nodes;

the determining the semantic features of the sentence pairs, the grammar information as the first context embedded auxiliary information includes:

and expanding the semantic representation to the dependency tree through a pooling layer, combining the semantic features with the grammar information, and determining the semantic features as the first context embedded auxiliary information.

The natural language analysis tool includes at least: coreNLP.

In this embodiment, the U-BERT extracts grammar information from the dependency tree of two sentences and fuses it into sentence expression form. Representation of

And->

As a dependency tree graph. For the figure->

We can use +.>

1-K order neighborhood information representing word nodes and edge nodes. The embedding of edge nodes is randomly initialized, and the initial embedding of word nodes is based on the subword representation S ^a Aligned:

wherein the method comprises the steps of

Is a dependency tree->

Is embedded in the initial word node of (c), t _a Is the number of words aligned from the sequence of subwords. The representation of word nodes in the dependency tree is updated by Levi-GAT:

wherein the method comprises the steps of

Based on neighborhood information->

Updated representation. One thing to be noted is +.>

Is a directed graph. This means that the propagation of information in the graph follows a pre-specified direction. However, one-way propagation may lose structural information in the opposite direction. To solve this problem, we also aggregate the structure messages from the reverse edge direction. Will->

Expressed as corresponding neighborhood information in the opposite direction, we have:

since we have updated node embeddings in both directions, the updated representation of the dependency tree graph is a combination of bi-directional embeddings, namely:

is a trainable matrix of projections. After extracting the dependency information using the graph coding layer, we get the sentence S ^a Form of expression +.>

Representation of sentence Sb->

We use the attention mechanism to make them interact with each other. Note that the calculation formula of the weight matrix is:

wherein the method comprises the steps of

And->

Is a learnable projection matrix. For->

From->

Is identified as:

the final dependency representation from the graph coding layer is a combination of the original embedding and the context representation from another sentence, namely:

FFN is a feed forward network that includes two linear transforms. Similarly, we can obtain sentence S ^b Final dependency representation D of (2) ^b . For ease of description, use is made of

To represent the operation in the above equation.

For dependency tree representation, we use the Unpooling layer to embed and extend concepts in AMR graphs to the dependency tree, namely:

the output of the shrink path is used as a residual connection during graphics operations in the expansion path. For the dependency representation we have:

likewise, the final embedding of the subwords is extended from the dependency representation:

and BERT output S ^a In comparison, the final representation

Including semantic information from dependency trees and AMR maps. In addition, comparison information having a semantic structure of another sentence is included. Finally, it is possible to pass->

All the embeddings in (1) are stored centrally to calculate the sentence level representation h ^a I.e. +.>

Likewise, where alpha _i Is an attention weight vector calculated from a multidimensional attention. We can also obtain h in the same way ^b 。

p＝FFN([h ^a ，h ^b ，h ^a ⊙h ^b ，|h ^a -h ^b |])

according to the embodiment, in the language model of natural language reasoning, auxiliary high-level semantic information and grammar information are embedded for the context, so that the trained language model is more sensitive to the semantic information, and the performance of a natural language reasoning task is greatly improved.

As an embodiment, the BERT model includes: a contracted path and an expanded path;

wherein the extracting semantic features of the sentence pairs and the extracting grammar information of the sentence pairs are in the contracted path;

the third context with semantic features and grammar information is determined to be embedded in the extended path.

In this embodiment, the method proposes a new U-BERT model that can integrate structured knowledge into BERT. It has two paths: a contracted path and an expanded path. In the shrink path, it takes the context representation from the BERT, then extracts the grammar knowledge from the dependency tree, and extracts the semantic knowledge from the AMR map. In the extended path, it sequentially merges the grammatical and semantic features from the contracted path into a context word representation.

It can be seen from this embodiment that the U-BERT model of the contracted path and the expanded path is used, in order to extract semantic features and semantic information in the contracted path, and in the expanded path, auxiliary high-level semantic information and syntax information are embedded for the context. The trained language model is more sensitive to semantic information, and the performance of a natural language reasoning task is improved.

Describing the basis of the method in detail, first, the GAT (graph attention networks, graph annotation force network) and its graph extensions with marked edges are introduced, which is the basis of the model of the method.

Graph attention network GAT is a special type of network that processes graph structure data through an attention mechanism. Given a picture

Wherein V and->

Respectively node x _i The set of edges e _ij Is a set of (3). />

Expressed by x _i Directly connected nodes. />

Is comprised of x _i And a set of all its immediate neighbors. We have->

Each node x in the graph _i With initial characteristics

Where d is the feature size. The representation of each node is iteratively updated by a graph attention operation. In the first step, each node x _i The context information is aggregated by joining its neighbors and itself. Updated representation +.>

From the weighted average of the connected nodes, it is calculated:

attention coefficient a _ij The calculation formula of (2) is as follows:

wherein σ is a nonlinear activation function, such as a linear rectification function ReLU.

And->

Is a learnable parameter of the projection.

Note that in the above-mentioned equation,

is a scalar, which means +.>

All dimensions in (a) are treated equally. This may limit the ability to model complex dependencies. We replaced the common attention with MDA (Multi-dimensional Attention, multidimensional attention). MDA has proven useful in handling context changes and ambiguity problems in many NLP (Natural Language Processing ) tasks. For each insert->

Instead of calculating a single scalar score, MDA calculates a score vector a that is classified by feature _ij . We have:

wherein the method comprises the steps of

Is the scalar in the equation before the softmax operation, and +.>

Is a vector. Addition in the equation means that a scalar is added to each element of the vector. />

For estimating->

Contribution of each feature dimension of (a):

wherein the method comprises the steps of

And->

Is a learnable parameter. Finally, the function-oriented multidimensional softmax (MD-softmax) function will be used for the attention weight vector +.>

Normalization was performed. Thus, the formula may be modified as:

after L steps, each node will eventually have a context-aware representation

To achieve a stable training process we also use residual connection and then layer normalization between the two graphic attention layers.

Message propagation in conventional GATs using higher order information is only handled on first order neighboring nodes. An important extension is to use higher order neighborhood information. This may help the model explore the relationships between indirectly connected nodes. Representation of

As neighborhood information from 1 st order to K th order. />

Representing a kth order neighborhood, which means +.>

X for all nodes in k points (k.gtoreq.1) _i Are reachable. />

We can obtain:

k-time GAT integrates neighborhood information R ^K . In the first updating step, each x _i Will interact with its reachable neighbors in a different order and calculate the attention features independently. Representation of

Updated by tandem features from different orders.

Wherein I represents a series connection,

is the attention weight vector of the kth order, and +.>

Is a learnable weight of the projection. More generally, for->

We define k= infinity as x within any jump point _i A set of nodes that are reachable. />

Can be easily extended to the above formula.

Conventional graph-meaning networks can only represent graph nodes, and ignore the properties of edges. However, both the dependency tree and the AMR map have marked edges. To address this problem, previous GNN-based models convert traditional graphs into their equivalent Levi graphs by turning edges into additional relational nodes. As shown in fig. 4, the edges in the corresponding Levi graph have no attributes. Under this setting, the edge labels are considered to be the same as the nodes and are represented by the model average. This means that the edge labels and word nodes share the same semantic space, which is not an ideal way, as nodes and edges are typically different elements.

To solve this problem, we use different parameters to represent different types of nodes in the Levi graph. As shown in fig. 4, for word node w _i We use

Representing w _i Is used +.>

Representing adjacent word nodes (ignoring relationship nodes). This definition can also be extended to integrate higher order information. Representation->

Neighborhood information from 1 to K as word nodes. Likewise, a->

As distinct ordered information of the edge nodes. Note that the edges between two edge nodes may be opposite (+_ in fig. 4>

Is included) but we also consider these two nodes as neighboring edge nodes.

Word node w _i Sequentially summarizing byte point neighbors R _c (K) And edge node neighbor R _e (K) Finally updating the representation form:

will be used

Represents the graph encoding operation set forth above, where R (K) = (R) _c (K)，R _e (K) Is information of different types of nodes in the neighborhood.

In the present method, the proposed U-BERT architecture will be described in detail. First, NLI (Natural Language Inference, natural language inference task) tasks are defined in a formal manner. Given two sentences

And is also provided with

Our model f (S ^a ，S ^b ) Is to predict S ^a And S is ^b Whether or not there is an enclosing relationship. In this context,

and->

Respectively representing the ith word and the jth word in the sentence, and t _a And t _b Representing the number of words in the sentence.

The network architecture is shown in fig. 5. As a definition in U-Net, the left side is the contracted path and the right side is the expanded path. In the contracted path, U-Bert obtains context information from BERT and then extracts semantic features from the dependency tree and AMR graph. From sentence context to dependency relationships, and finally to abstract meaning, the information becomes more and more abstract, which allows the model to learn semantic information step by step. The configuration of the expansion paths is substantially symmetrical but in reverse order. Along the extension path, U-BERT obtains semantic information from the AMR graph, the dependency tree, and from the feature order in the contraction path. Finally, the classifier will obtain a representation based on word level and high-level semantic features.

We integrate two types of semantic features into a pre-trained language model, including dependency trees and AMR graphs. The dependency tree reflects explicit relationships between different parts of the sentence. As shown in fig. 6 (b), the relationship between words is represented by a circle with direction marks from the beginning to the end. The dependency tree retains all words and order of sentences, while AMR is more abstract. AMR is a sentence-level semantic representation formalized by a rooted directed graph, where nodes are concepts and edges are semantic relationships. Concepts are extracted from sentences, each aligned with several words.

To integrate semantic features, we need to fuse the original BERT embedding with the semantic structure representation. Since the original pre-trained BERT is based on a series of subwords, the dependency tree is based on words, and the AMR map is based on concepts, we need to align these differently sized representations. As shown in fig. 6, we group the sub-words of each word and use the concentration pool to obtain a word-level representation in the dependency tree.

For example, assume that word xi is composed of a series of subwords S _i ＝{s ₁ ，s ₂ ，...，s _K Composition; represent S _i ＝{s ₁ ，s ₂ ，...，s _K As their representation from BERT. X is x _i Is as follows:

wherein the attention weight vector alpha _k Is calculated from the multidimensional attention. The conceptual level representation in the AMR map uses the same approach.

Experiments were performed on this method, data and pretreatment we performed experiments on two NLI benchmark datasets: SNLI dataset and MNLI dataset. The evaluation index is the classification accuracy. We use CoreNLP natural language analysis tool set and CAMR to obtain semantic relationships of sentences. CoreNLP is a set of human language analysis tools developed by the university of stanford. CAMR is a transition-based tree-to-graph parser for generating AMR graphs of sentences. More specifically, we use CoreNLP to obtain part-of-speech (POS) tags and syntactic dependencies for each sentence. After the flow of CoreNLP, CAMR resolves the dependency into AMR graphs.

Training details for model parameters, the representation dimension d is set to 300. K=1, 2,3, we will also reach the node set

Merging to neighborhood information->

Is a kind of medium. We use Bertadam as our optimizer and cosine decay as our learning schedule:

wherein t represents the cumulative number of training steps, t _all Indicating the total number of steps to be attenuated. For SNLI, initial learning Rate l _r0 Set to 1.4e-5; for MNLI, initial learning Rate l _r0 Set to 2e-5. The batch size was 32 and the conjugate rate for all layers was set to 0.2.

We base BERT-based larger BERT and SemBERT on. SemBERT is an improved language representation model that utilizes context semantics on the BERT backbone. Unlike the AMR map used in our model, semBERT uses semantic role markers (SRLs) as additional semantic information for BERT. SemBERT has excellent performance in terms of natural language reasoning and has reached up-to-date levels on SNLI and MNLI datasets. Fig. 7 shows our results on SNLI and MNLI development and test sets. All models are trained in a single dataset without the need for integration or other unlabeled data.

Compared to the BERT-based model, we have a BERT-based model that is better than 0.4% over the SNLI test set, better than 0.6% over the MNLI-m test set, and better than 0.4% over the MNLI-mm test set. Our BERTbase-based model can even achieve the same performance of BERTlarge on the SNLI dataset. Compared with BERTlarge, the model based on BERT-large improves the result by 0.5% on the SNLI test set, 1.0% on the MNLim test set and 0.6% on the MNLI-mm test set. Such significant improvements indicate that merging semantic information helps the pre-trained model perform better.

We will also compare with SemBERT. On the SNLI test set, our BERT-based model performs better than SemBERT _base Baseline 0.1%, performance of UBERT based on BERT-large versus SemBERT _large The same applies. On MNLI matched dataset, our BERT-based model performs better than SemBERT _base 0.8% of (C). With SemBERT _large Compared with the U-BERTlarge, the performance of the U-BERTlarge on the test set with the unmatched MNLI is 0.2%, and the performance of the U-BERTlarge on the test set with the unmatched MNLI is 0.1%.

To examine the contribution of key elements of the model, we initiated ablation experiments on the SNLI development set. The results are illustrated in fig. 8 and 9. We focus on two parts:

(1) Dependency tree and AMR graph effects;

(2) Neighboring orders K and

is a function of (a) and (b).

From the results we find that independent use of dependency trees or AMR maps is also better than baseline, but U-BERT still achieves the best performance. On the other hand, higher order neighborhood information is also very important for U-BERT.

In the ablation experiments we have as the base model U-BERT with dependency tree and AMR map and as the test set of the ablation experiments the model without dependency tree (-DEP) or AMR map (-AMR). As shown in FIG. 8, these three models are all superior to BERTlarge baseline, which represents the grammar information in the dependency, while the semantic information in AMR graphs is all beneficial to NLI tasks. We perform best with both the dependency tree and the original model of the AMR map. The model without AMR map performs slightly better than the model without dependency tree. However, when both of these information are used, improvement is limited. We consider two reasons:

(1) To some extent, the information in the dependency tree and AMR diagrams is homogenous.

(2) CAMR parses AMR maps from dependency trees by a transformation method, so the performance of CAMR depends on the accuracy of the dependency tree provided. Error accumulation can affect the results of AMR parsing. In this case, our model will not be able to extract useful information from both semantic structures. In the future, we will try other end-to-end AMR resolvers.

To better understand the effectiveness of different order neighborhood information, we performed a series of ablation tests for different orders K. Fig. 9 shows the effect of having different order neighborhood information in the SNLI set. We test k=0, 1,2,3 ordered neighborhood information, where k=0 means that a node in the graph can only interact with itself. Furthermore, we have discussed the impact of the set of reachable nodes R1. The results show that when K >0, the performance of all models is better than baseline alone. In contrast, when k=0, the performance is not improved. K=2 and k=3 obtain the same score, indicating that the information in the 2 nd order neighborhood is sufficiently valid for the AMR map.

The second part of fig. 9 shows that infinite order neighborhood information R1 can improve the performance of all models with K >0, which means that the global information provided by infinite order neighborhood is only useful for modeling using local information.

Deep learning-based matching models have made tremendous progress in natural language reasoning with the release of large-scale annotation data such as SNLI, MNLI. There are two main frames. The first framework is based on the siamese architecture, where sentence pairs will be encoded into advanced representations through two symmetrical networks, respectively. The second framework applies explicit interactions between sentence pairs during encoding. Under this framework, models have more ability to match multiple levels of granularity of statements, so they generally perform better than the former. The method is based on a second framework.

Pre-trained language models, such as GPT, BERT and XLNET, have shown powerful functions on NLP, enabling artistic effort to be achieved on several Natural Language Understanding (NLU) tasks such as glute. Typically, a pre-trained language model is used as the encoder portion of the downstream model, or to fine tune a particular NLP task (e.g., NLI). In the method, U-BERT uses a trunk based on BERT and BERT-large, and combines the structural semantic information of the dependency tree and the AMR graph.

Language knowledge plays an important role in natural language processing. Recently, there has been a trend to combine linguistic knowledge with pre-trained models. In the model of the method, two structured semantic representations, a dependency tree and an AMR graph, are applied to combine with the output representation of BERT.

The present approach proposes a novel BERT-based network U-BERT that incorporates structured semantic information from dependency trees and AMR graphs. Experiments show that our model greatly improves the performance of BERT on NLI tasks. This work demonstrates the effectiveness of dependency trees and AMR maps in natural language processing.

Fig. 10 is a schematic structural diagram of a BERT model optimization system according to an embodiment of the present invention, where the system may execute the BERT model optimization method according to any of the foregoing embodiments and be configured in a terminal.

The optimization system of the BERT model provided in this embodiment includes: a context embedding program module 11, a semantic feature extraction program module 12, an auxiliary program module 13 and a prediction program module 14.

Wherein the context embedding program module 11 is configured to determine, by means of a BERT model, a first context embedding of each subword in the split sequence of subwords by the sentence to be inferred; the semantic feature extraction program module 12 is configured to determine a semantic representation of the sentence pair through a semantic representation language parser, and extract semantic features of the sentence pair; the auxiliary program module 13 is used for determining semantic features of the sentence pair as auxiliary information embedded by the first context and determining a second context embedded with the semantic features; the predictor module 14 is for predicting the inclusion relationship of two sentences of the sentence pair based on a second context embedding with semantic features.

Further, the system further comprises:

the grammar information extraction program module is used for establishing a dependency tree of the sentence pairs through a natural language analysis tool and extracting grammar information of the sentence pairs;

an auxiliary program module, configured to determine semantic features of the sentence pair and the grammar information as auxiliary information embedded in the first context, and determine a third context embedded with the semantic features and the grammar information;

and the prediction program module is used for predicting the inclusion relation of two sentences in the sentence pair based on the third context embedding with semantic features and grammar information.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the optimization method of the BERT model in any method embodiment;

as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the method of optimizing the BERT model in any of the method embodiments described above.

The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the optimization method of the BERT model of any of the embodiments of the invention.

The client of the embodiments of the present application exists in a variety of forms including, but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.

(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.

(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.

(4) Other electronic devices with data processing functions.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of optimization of a BERT model, comprising:

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the BERT model comprises: a contracted path and an expanded path;

4. The method of claim 2, wherein the extracting grammar information for the sentence pairs comprises:

5. The method of claim 1, wherein the predicting the inclusion relationship of two sentences in the sentence pair based on the second context embedding with semantic features comprises:

embedding the second context for processing through an attention mechanism to generate a sentence level representation of the sentence pair;

6. The method of claim 2, wherein the semantic representation language comprises: AMR abstract meaning representation;

the natural language analysis tool includes at least: coreNLP.

7. An optimization system of a BERT model, comprising:

8. The system of claim 7, wherein the system further comprises:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.

10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-6.