CN114580385A

CN114580385A - Text semantic similarity calculation method combined with grammar

Info

Publication number: CN114580385A
Application number: CN202210252170.3A
Authority: CN
Inventors: 龙军; 向一平; 刘磊; 李浩然
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-03

Abstract

The invention discloses a text semantic similarity calculation method combined with grammar, which comprises the steps of obtaining two sentences of semantic similarity to be calculated; extracting semantic feature vectors of the two sentences through a deep semantic interaction model; respectively constructing grammar dependency trees of the two sentences, and carrying out structured embedding to respectively obtain grammar tree feature vectors of the two sentences; splicing the respective semantic feature vectors of the two sentences with the feature vector of the syntax tree to respectively obtain final semantic feature vectors of the two sentences; and calculating the semantic similarity of the two sentences based on the final semantic feature vectors of the two sentences. The sentence features extracted by the scheme of the invention are combined with grammatical information in the sentences, the extracted features are more comprehensive and deep, the context relation in the sentences is fully considered, the similarity accuracy obtained by calculation is higher, and meanwhile, the accuracy and the calculation efficiency are balanced.

Description

Text semantic similarity calculation method combined with grammar

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text semantic similarity calculation method combined with grammar.

Background

The similarity calculation of the research semantic level can enable a computer to better semantically understand sentences. Meanwhile, semantic understanding plays a very important role in various researches. In information retrieval, the semantic similarity calculation can find out a search result with the highest problem matching degree; for the community question and answer, the similar questions can be classified through semantic similarity calculation, so that the answer to a certain question is more concentrated; for translation software, the semantic similarity can be used as an evaluation index between an original sentence and a translated sentence. Therefore, the semantic similarity calculation has important research significance and value in various fields.

Semantic similarity calculation is a difficult problem in the field of natural language processing, and is also the most widely used technology in text processing. For text, because of problems of ambiguous words, near-meaning words, complex grammatical structures in languages, and the like, sentence patterns expressing the same meaning can have very many forms. Currently, semantic similarity calculation methods for sentences generally fall into two categories: semantic representation mode and semantic interaction mode. The semantic representation mode is to separately calculate a semantic vector of each sentence in the sentence pair, calculate the similarity through the two vectors, and the semantic interaction mode needs to perform semantic modeling on the two sentences simultaneously and calculate the similarity score by considering the interaction characteristics between the two sentences in the modeling process. The two modes have good and bad mutually, and for the semantic representation mode, the general calculation complexity is low, the calculation efficiency is high, but the accuracy is relatively low. The complexity of the semantic interaction mode is generally higher, the calculation efficiency is lower, and the accuracy is relatively higher.

Disclosure of Invention

In view of the defects of the prior art, the invention provides a text semantic similarity calculation method combined with grammar, so as to solve the problem that the existing semantic similarity calculation method is difficult to consider both accuracy and calculation efficiency.

In order to achieve the above object, the present invention adopts the following technical solutions.

A text semantic similarity calculation method combined with grammar includes:

acquiring two sentences of semantic similarity to be calculated;

extracting semantic feature vectors of the two sentences through a deep semantic interaction model;

respectively constructing grammar dependency trees of the two sentences, and carrying out structured embedding to respectively obtain grammar tree feature vectors of the two sentences;

splicing the semantic feature vectors of the two sentences with the feature vector of the syntax tree to respectively obtain the final semantic feature vectors of the two sentences;

and calculating the semantic similarity of the two sentences based on the final semantic feature vectors of the two sentences.

Further, the deep semantic interaction model is a BERT derivative model trained by using word MASK.

Further, the process of constructing the syntactic dependency tree of the two sentences includes:

respectively carrying out syntactic analysis on the two sentences;

the dependency tree structure defined by Stanford Dependencies is used to derive the syntactic dependency tree for the two sentences based on the parsing.

Further, structured embedding is carried out on the grammar dependency trees of the two sentences, and grammar tree feature vectors of the two sentences are obtained respectively, wherein the process comprises the following steps:

for each sentence's syntactic dependency tree, define the syntactic sequence C_pAll the child nodes of the dependency tree node are in the original sequence of the vocabulary in the sentence;

defining a longest sequence length l;

grammar sequence C_pAll the elements in the system are input into a word embedding model for calculation to obtain word embedding; if the grammar sequence C_pThe length of the medium element is less than l, and the word embedding is filled with a zero matrix to reach the longest sequence length; if the grammar sequence C_pThe length of the medium element exceeds l,truncating the elements beyond the length, keeping only the first l elements, and defining the newly obtained syntax sequence as

Calculating each word in each of two sentences

Get a grammar sequence of two sentences

And

and respectively inputting the grammar sequences of the two sentences into a bidirectional LSTM neural network to obtain the grammar tree embedding of each word in each sentence, and splicing to obtain the grammar tree feature vector of each sentence.

Further, the method includes the steps of inputting grammar sequences of two sentences into a bidirectional LSTM neural network respectively to obtain grammar tree embedding of each word in each sentence, and obtaining a grammar tree feature vector of each sentence through splicing, and specifically includes the following steps:

the grammatical sequence of the two sentences is input into a bi-directional LSTM neural network whose output at time t is:

wherein the content of the first and second substances,

representing the positive output of the bi-directional LSTM neural network at time t,

representing the inverse output, w, of a bi-directional LSTM neural network at time t_fAnd w_bHidden layer states representing forward LSTM and backward LSTM, b_tRepresents an offset;

for a word p, acquiring the last layer of states of a grammar sequence of the word p in a forward network and a reverse network of a bidirectional LSTM neural network, and constructing a grammar tree embedding V ═ E of the word p_w,f_m,b_n]Wherein, in the step (A),

word embedding representing the vocabulary p, f_mRepresenting the last layer of computation of the forward network in a bidirectional LSTM network, b_nRepresenting the last layer of computation results of the reverse network in the bidirectional LSTM network;

and acquiring the grammar tree embedding of each vocabulary in each sentence, and splicing to obtain the grammar tree feature vector of each sentence.

Further, the semantic similarity of the two sentences is calculated based on the final semantic feature vectors of the two sentences, and the method specifically includes the following steps:

inputting the final semantic feature vectors of the two sentences into a final prediction layer to calculate the final semantic similarity;

the calculation process of the prediction layer comprises the following steps: fusing the final semantic feature vectors of the two sentences, inputting the final semantic feature vectors into a multilayer perceptron, wherein each hidden layer of the multilayer perceptron uses a hyperbolic tangent function tanh as an activation function, and the calculation formula of the multilayer perceptron is as follows:

where s denotes the output of the multilayer perceptron, W₁And W₂Is a parameter, b₁And b₂As an offset, M_AAnd M_BRespectively representing final semantic feature vectors of two sentences, wherein sigma represents a tanh activation function;

and continuously sending the output of the multilayer perceptron to a full connection layer, and using a sigmoid function as an activation function to obtain a final similarity in a range of [0,1], namely the semantic similarity of two sentences.

Advantageous effects

The invention provides a text semantic similarity calculation method combined with grammar, which comprises the steps of respectively extracting the characteristics of sentence pairs to be calculated through a deep semantic interaction model to obtain semantic characteristic vectors of the sentence pairs; respectively analyzing grammatical structures of sentences to obtain grammatical dependency trees of sentence pairs; calculating the syntax tree through a neural network to convert the syntax tree into a syntax tree feature vector; respectively splicing the semantic feature vectors of the sentences with the grammar tree feature vectors to obtain final semantic feature vectors combined with grammar; and calculating the vector distance between the final semantic feature vectors of the sentences so as to obtain the semantic similarity between the sentence pairs. The sentence characteristics extracted by the scheme of the invention not only comprise vocabulary and sentence codes, but also combine the grammatical characteristics of the sentences, the extracted characteristics are more comprehensive and deeper, the similarity obtained by calculation is higher in accuracy, the calculation efficiency is high, and the accuracy and the calculation efficiency are balanced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a text semantic similarity calculation method in combination with grammar according to an embodiment of the present invention;

FIG. 2 is a diagram of a semantic feature vector extraction architecture provided by an embodiment of the present invention;

FIGS. 3 (a) and (b) are two representations of examples of syntax dependency trees provided by embodiments of the present invention;

FIG. 4 is a diagram illustrating a structure of extracting feature vectors of a syntax tree according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a prediction layer structure according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a text semantic similarity calculation method in combination with a grammar, including:

s1: and acquiring two sentences of semantic similarity to be calculated.

S2: and extracting semantic feature vectors of the two sentences through a deep semantic interaction model.

In particular, because the syntactic dependency tree node is a word and not a BERT_baseThe word granularity used by the model, the deep semantic interaction model, should apply a BERT derivative model trained using a full-word mask, such as Chinese-BERT-wwm.

As shown in fig. 2, for the sentence pairs a ═ { a1, a2, …, An } and B ═ B1, B2, …, Bm } in the input BERT derivative model, first, its structured embedded representation is obtained by concatenating them into a sequence x, and then inputting them into the embedding layer:

H⁽⁰⁾＝Embedding(x)

H⁽ⁱ⁾＝Transformer(H^(i-1))

wherein L represents a BERT derivativeThe number of layers of the raw model, N is the maximum sequence length, and d is the dimension of the hidden layer. H denotes the hidden layer calculation result, H⁽⁰⁾As a result of layer 0, layer 0 is the initialized matrix vector, and each layer is then input via the output of the previous layer, so that the first layer initialized input is a sentence, and each layer input is the output of the previous layer. In the semantic feature vector, the algorithm takes the output of the last layer as the final semantic feature representation. The calculation mode of Embedding is consistent with that of the word Embedding model, and the calculation mode of transducer is consistent with that of the transducer framework proposed by google, both of which are the prior art, and are not described herein again.

S3: and (3) respectively constructing grammar dependency trees of the two sentences, and carrying out structured embedding to respectively obtain grammar tree feature vectors of the two sentences.

Specifically, the process of constructing the syntactic dependency tree of the two sentences includes:

respectively carrying out syntactic analysis on the two sentences;

For each sentence, the grammar dependency tree defines the way between the directional words representing the dependency in the sentence, the connection way of the tree represents the definition on the grammar, and the node value of the tree is a certain word in the sentence. In this way, the syntactic structure in the sentence can be converted into a tree representation, and the different components in the sentence and the syntactic relation between the components can be determined through the connection between the nodes, as in the example shown in fig. 3, the direction is quickly turned around in the corner by "red cars". Two modes of display are performed for example, and it can be seen that the relationship between the root node "turn" and its child node "direction" is dobj, which represents a direct object.

As shown in fig. 4, the syntax dependency trees of the two sentences are structurally embedded to obtain syntax tree feature vectors of the two sentences, respectively, and the process includes:

to prevent explosion of the final dimension, a longest sequence length l is defined;

grammar sequence C_pAll the elements in the system are input into a word embedding model for calculation to obtain word embedding; consider the dimension alignment problem if grammar sequence C_pThe length of the medium element is less than l, and the word embedding is filled with a zero matrix to reach the longest sequence length; if the grammar sequence C_pIf the length of the middle element exceeds l, the elements exceeding the length are truncated, only the first l elements are reserved, and the newly obtained syntax sequence is defined as

The grammar sequence here is still the word sequence of all children of a single word, taking the grammar dependency tree in FIG. 3 as an example, for the word "turn around", its word sequence C_pIs [ car, corner, fast, of, direction]If l is designated as 3, the latter two words need to be discarded, only the first 3 words are reserved, and the word sequence is updated to [ car, corner, fast]Similarly, if the sequence length is less than 3, zero matrix padding is used in the calculation.

Calculating each word in each of two sentences

Get a grammar sequence of two sentences

And

the grammatical sequences of the two sentences are respectively input into a bidirectional LSTM neural network, and the output of the bidirectional LSTM neural network at the time t is as follows:

wherein the content of the first and second substances,

representing the inverse output, w, of a bi-directional LSTM neural network at time t_fAnd w_bHidden layer states representing forward LSTM and backward LSTM, b_tRepresents an offset; the hidden state of the neural network is updated according to the following formula:

wherein i is the ith node in the syntax sequence;

is the hidden layer state of the ith layer in the sequence of words p. The hidden layer states are all the same calculation mode, and only the calculation sequence is different, namely the forward direction and the reverse direction.

The state of the i-th layer word p (i.e., the input at that time).

word embedding representing the vocabulary p, f_mRepresenting a bidirectional LSTM networkLast layer of computation results of the medium forward network, b_nRepresenting the last layer of computation results of the reverse network in the bidirectional LSTM network;

S4: and splicing the respective semantic feature vectors of the two sentences with the feature vector of the syntax tree to respectively obtain the final semantic feature vectors of the two sentences.

S5: and calculating the semantic similarity of the two sentences based on the final semantic feature vectors of the two sentences. The method specifically comprises the following steps:

as shown in fig. 5, the calculation process of the prediction layer includes: fusing the final semantic feature vectors of the two sentences, inputting the final semantic feature vectors into a multilayer perceptron with three hidden layers, wherein the hidden layers are respectively set as 256, 64 and 16 hidden layer units (the number of the hidden layer units can be determined as different values according to vectors after front calculation), each hidden layer of the multilayer perceptron uses a hyperbolic tangent function tanh as an activation function, and the calculation formula of the multilayer perceptron is as follows:

and continuously sending the output of the multilayer perceptron to a full-connection layer, and using a sigmoid function as an activation function to obtain a final similarity in a range of [0,1], namely the semantic similarity of two sentences.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Compared with the traditional literal similarity detection mode, the method can excavate deep semantic information contained in the sentence text through the deep neural network, and can detect a result more accurate than that of the literal similarity method under the condition that the literal similarity is very high but the semantics are different. Meanwhile, the algorithm takes the grammar structure of the sentence as semantic information in the calculation process by introducing the grammar dependency tree, and takes the grammar information as characteristics, thereby enhancing the semantic information of the sentence. The invention can be used as an important auxiliary function of a duplicate checking system and also can be used as a main calculation mode of some classification and question-answering tasks. The algorithm can extract semantic features of texts from multiple angles and at multiple levels by combining a semantic calculation model of grammar, mainly solves the problem of insufficient Chinese support of a primary BERT model by using a derivative model trained by Chinese-BERT-wwm and other full-word masks, and extracts interaction information among texts by using a multi-head attention mechanism. Meanwhile, a mature grammar dependency tree representation method is introduced, deep analysis is carried out on the grammar of the sentence, and feature extraction is carried out from the grammar angle. And inputting the extracted syntax tree into a bidirectional LSTM network after performing word embedding and similarity matrix calculation, capturing information in the syntax tree through the LSTM network, and integrating the information into final semantic features. The measures can more effectively analyze the semantics of the sentences and help a computer to better understand similar sentences under some grammar transformation. Finally inputting the semantic features into a prediction layer with a multilayer perceptron and a full connection layer, reducing the dimension and interacting the semantic features, finally converting the similarity into a specific numerical value within the range of [0,1] by using a sigmoid activation function, and measuring the similarity of semantic layers between any two sentences by using the numerical value.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A text semantic similarity calculation method combined with grammar is characterized by comprising the following steps:

acquiring two sentences of semantic similarity to be calculated;

2. The method for calculating semantic similarity of text combined with grammar according to claim 1, wherein the deep semantic interaction model is a BERT derived model trained using word MASK.

3. The semantic similarity calculation method for text combined with grammar according to claim 1, wherein the process of constructing grammar dependency tree of two sentences includes:

respectively carrying out syntactic analysis on the two sentences;

4. The semantic similarity calculation method for text combined with grammar according to any one of claims 1 to 3, characterized in that the syntactic dependency trees of two sentences are structurally embedded to obtain the syntactic tree feature vectors of the two sentences respectively, and the process includes:

defining a longest sequence length l;

grammar sequence C_pAll the elements in the system are input into a word embedding model for calculation to obtain word embedding; if the grammar sequence C_pThe length of the medium element is less than l, and the word embedding is filled with a zero matrix to reach the longest sequence length; if syntax sequence C_pIf the length of the middle element exceeds l, the elements exceeding the length are truncated, only the first l elements are reserved, and the newly obtained syntax sequence is defined as

Calculating each word in each of two sentences

Get a grammar sequence of two sentences

And

5. The method for calculating semantic similarity of text according to claim 4, wherein the step of inputting the grammar sequences of two sentences into a bi-directional LSTM neural network to obtain the grammar tree embedding of each word in each sentence, and obtaining the grammar tree feature vector of each sentence by splicing comprises the steps of:

wherein the content of the first and second substances,

6. The method for calculating semantic similarity of text according to claim 1, wherein the calculating of the semantic similarity of two sentences based on the final semantic feature vectors of the two sentences includes: