CN115828931A

CN115828931A - Chinese and English semantic similarity calculation method for paragraph-level text

Info

Publication number: CN115828931A
Application number: CN202310085688.7A
Authority: CN
Inventors: 龙军; 唐自强; 杨柳; 黄金彩
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-03-21
Anticipated expiration: 2043-02-09
Also published as: CN115828931B

Abstract

The invention discloses a Chinese and English semantic similarity calculation method for paragraph level texts, which is used for extracting paragraph representation vectors of Chinese paragraphs and English paragraphs respectively, modeling paragraph texts from three levels of subject words, sentences and paragraphs, performing information interaction between the interior of each level and information of each level based on a graph attention network, then fusing the information of the three levels of the subject words, the sentences and the paragraphs to obtain paragraph representation vectors, and calculating the distance of the paragraph representation vectors to obtain the semantic similarity of the Chinese paragraphs and the English paragraphs. According to the method, high-quality paragraph representation vectors are obtained by fusing information of three levels of subject words, sentences and paragraphs, and high-precision calculation of semantic similarity of middle-English cross-language paragraphs can be realized.

Description

Chinese and English semantic similarity calculation method for paragraph level text

Technical Field

The invention relates to the technical field of information, in particular to a Chinese and English semantic similarity calculation method for paragraph level texts.

Background

Now, with the enhancement of the world cultural communication degree and the resource sharing among the cross-languages of various countries, the cross-language scenes become more and more popular, which leads to the increasing urgency of the requirements of the cross-language application. In order to solve the technical barriers brought by the above, and to realize resource sharing between cross-languages, the academic world and the industrial world have been actively exploring cross-language natural language processing technologies. The cross-language paragraph semantic similarity calculation is an important research content, and how to perform semantic similarity calculation on paragraph level texts of different languages is researched, so that the cross-language paragraph semantic similarity calculation plays an important role in many cross-language processing applications and related fields thereof, such as performing semantic similarity calculation on abstracts of different language application books to complete detection and selection of the similar application books; and performing semantic similarity calculation on abstracts of different language papers, thereby realizing similar paper detection and completing cross-language paper recommendation and cross-language plagiarism detection.

The precision of the calculation method for semantic similarity of Chinese-English cross-language paragraphs at the present stage needs to be improved, and the method mainly has the following three defects:

(1) The cross-language paragraph semantic similarity calculation needs to solve the obstacles between languages. However, most of the current cross-language similarity detection uses translation skills to solve the obstacles between languages, including: dictionary-based, parallel corpus-based, machine translation-based methods, etc. The method is equivalent to converting the cross-language problem into a monolingual or intermediate language and then solving the problem of similarity measurement. However, there are defects in the conversion process, such as dictionary, parallel corpus, and machine translation, and it is very likely to lose a part of the deep information of the text.

(2) The negative examples in the chinese-english sentence representation alignment should not be semantically similar to the anchor examples. The Chinese-English sentence representation alignment uses a Chinese-English parallel sentence to train a data set, for the Chinese-English parallel sentence pair data set, when an anchor sample is a Chinese sentence, a positive sample of the anchor sample is an English sentence parallel to the Chinese sentence, and a negative sample needs to be constructed by the anchor sample. If the characterization distance of the anchor sample and the false negative sample is pulled, the alignment of sentence characterization is influenced, and the performance of the Chinese-English sentence characterization model is damaged.

(3) The representation of paragraph text should consider information of three levels of words, sentences and paragraphs at the same time. However, the existing paragraph representation method emphasizes the paragraph representation through the information of sentence level, or emphasizes the paragraph representation through the information of word level, lacks the information interaction and information fusion of different levels of information of the paragraph, and cannot effectively represent the paragraph text completely from three levels of words, sentences and paragraphs.

Disclosure of Invention

The invention provides a Chinese and English semantic similarity calculation method for paragraph level texts, which aims to solve the problem that the existing Chinese and English cross-language paragraph semantic similarity calculation method is low in precision.

In order to achieve the purpose, the invention adopts the following technical scheme.

A Chinese and English semantic similarity calculation method for paragraph-level text comprises the following steps:

the method for extracting paragraph characterization vectors of Chinese paragraphs and English paragraphs respectively comprises the following processes:

constructing a subject word node for each subject word in the Chinese paragraph/English paragraph, constructing a sentence node for each sentence in the Chinese paragraph/English paragraph, and constructing a global paragraph node for the Chinese paragraph/English paragraph;

extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes;

establishing an edge relation between sentence nodes and subject word nodes according to whether the sentence contains the subject word, establishing an edge relation between sentence nodes according to the context relation of the sentence in the corresponding paragraph, establishing an edge relation between subject word nodes according to the co-occurrence relation of the subject word in the sentence, and connecting the global paragraph nodes with each subject word node and the sentence nodes by one edge; each node establishes an edge connected with the node; modeling an edge relation between nodes;

inputting the initialized feature vectors of all nodes and the edge relation among the nodes into a graph attention network, and outputting the feature vectors of all nodes after information interaction;

respectively performing average pooling on feature vectors of subject word nodes and sentence nodes, and then splicing and reducing dimensions on feature vectors of the subject word nodes, the sentence node feature vectors and global paragraph nodes after the average pooling to obtain paragraph feature vectors of Chinese paragraphs/English paragraphs;

and calculating the distance between the paragraph characterization vector of the Chinese paragraph and the paragraph characterization vector of the English paragraph to obtain the semantic similarity of the Chinese paragraph and the English paragraph.

Further, before performing paragraph characterization vector extraction on the chinese paragraph and the english paragraph, respectively, the method further includes:

and training a data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model, wherein the Chinese-English sentence representation alignment model is used for extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes.

Further, the training of the data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model includes:

selecting an anchor sample from the Chinese and English parallel sentence data set, wherein sentences parallel to the anchor sample are positive samples, and other Chinese and English parallel sentence pairs are negative samples of sentences of different languages from the anchor sample;

for each negative sample, if the semantic similarity between the negative sample and the anchor sample is greater than a set threshold, the negative sample is a false negative sample, and the weight of the false negative sample is assigned to be 0; otherwise, assigning the weight of the negative sample to be 1;

training the Chinese and English sentence representation alignment model by using a training set obtained after the negative sample is removed to obtain a Chinese and English sentence representation alignment model after training is completed; the Chinese and English sentence representation and alignment model comprises two feature extraction branches, wherein one branch comprises a Chinese sentence encoder and an average pooling layer, and the other branch comprises an English sentence encoder and an average pooling layer.

Further, for each negative sample, its semantic similarity to the anchor sample is calculated by:

calculating a first semantic similarity between the sentences parallel to the negative sample and the anchor sample by using a single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;

or calculating a second semantic similarity between sentences parallel to the negative sample and the anchor sample by using the single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;

or taking the maximum value of the first semantic similarity and the second semantic similarity as the semantic similarity between the negative sample and the anchor sample.

Further, when the chinese-english sentence is trained to characterize the alignment model, the objective function is expressed as follows:

wherein the content of the first and second substances,

and

the alignment loss function for the anchor sample in chinese and english sentences, respectively, is expressed as follows:

wherein the content of the first and second substances,

representing Chinese sentencesxEnglish sentenceyThe vector similarity of (2);

represents a temperature over-parameter;

representing anchor samples

Negative sample of

The weight of (c); subscripti、jThe numbering of the pairs of english parallel sentences in the representation,Nthe total number of english parallel sentence pairs in the training batch.

Further, the constructing a subject word node for each subject word in the chinese/english paragraphs and constructing a sentence node for each sentence in the chinese/english paragraphs includes:

extracting key words in Chinese/English paragraphs by TF-IDF algorithm to extract top with highest importancekThe key words are used as subject words of Chinese paragraphs/English paragraphs and corresponding subject word nodes are constructed

Separately record thiskSentence sequence number of each subject word and word sequence in the sentence;

dividing Chinese paragraph/English paragraph, and constructing a sentence node for each sentence of Chinese paragraph/English paragraph, wherein the sentence node is expressed as

And c is the number of sentences in Chinese/English paragraphs.

Further, the extracting of the initialized feature vectors of the subject term node, the sentence node, and the global paragraph node includes:

for each subject word node, counting the sequence number of the sentences and the word sequence in the sentences, coding the sentences with corresponding subject words by using a Chinese sentence coder/an English sentence coder in a Chinese-English sentence representation alignment model to obtain the characteristic vector of each word in the sentences, and extracting the characteristic vector of the subject word as the initialized characteristic vector of the subject word node according to the word sequence of the subject word in the sentences; if the subject term appears in the Chinese paragraph/English paragraph for a plurality of times, performing average pooling on the feature vector of the subject term at each position to obtain an initialization feature vector of the subject term node;

for each sentence node, coding a sentence by using a Chinese sentence coder/English sentence coder in the Chinese and English sentence representation alignment model to obtain a feature vector of each word, and then performing average pooling to obtain the feature vector as an initialization feature vector of the sentence node;

and carrying out average pooling on the initialized feature vectors of all subject term nodes and sentence nodes to obtain the initialized feature vectors of the global paragraph nodes.

Further, the edge relationship between the nodes is determined by the following method:

if the sentence A contains the subject word a, establishing an edge between the sentence node corresponding to the sentence A and the subject word node corresponding to the subject word a for connection;

if the sentence A and the sentence B have a context relationship in the Chinese paragraph/English paragraph, establishing an edge between the sentence node corresponding to the sentence A and the sentence node corresponding to the sentence B for connection;

if the subject word a and the subject word b have a co-occurrence relation in the sentences of the Chinese paragraphs/English paragraphs, establishing an edge between the corresponding subject word nodes for connection;

the global paragraph nodes are connected with all sentence nodes and subject term nodes by establishing an edge;

each node establishes an edge connected with the node;

modeling of the edge relationship between nodes is represented as follows:

wherein the content of the first and second substances,

representing nodespAnd nodeqThe relationship between the two edges.

Further, the inputting the initialized feature vectors of all the nodes and the edge relations among the nodes into the attention network of the graph and outputting the feature vectors of the nodes after information interaction includes:

for each layer of graph attention network, the previous layer of graph attention networkNode feature vector of network output

As the input of the attention network of the layer diagram, the feature vector of each node after the information interaction is output

(ii) a Wherein n is the total number of nodes; the input of the first-layer graph attention network is the initialized feature vectors of all nodes and the edge relation among the nodes; wherein, the construction process of the attention network of each layer of the graph is as follows:

learning a shared linear transformation W, and self-attention between nodes to compute the weight of edges between nodes if nodespAnd nodeqIf there is an edge connection between them, the edge relation is

The weight calculation formula of (a) is as follows:

wherein the content of the first and second substances,

representing nodesqTo nodepThe importance of (a) to (b),

representing a similarity function between the two vectors;

the neighbor node information of each node is transmitted to the node, wherein the neighbor nodes comprise the node itself, and the weight of the edge relation of the neighbor nodes of the node is normalized by softmax:

wherein the content of the first and second substances,

is a nodepThe set of neighboring nodes of (a) is,mrepresenting nodesm；

Node pointpUpdated feature vectors

Is represented as follows:

wherein the content of the first and second substances,

representing a non-linear transformation function;

inputting the feature vector of each node finally output by the attention network of each layer of graph into the attention network of the next layer of graph, and obtaining the final feature vector of each node after the output of the attention network of the last layer of graph

。

Further, the paragraph characterization vector of the chinese/english paragraph is obtained by the following method:

by using

Feature vectors representing subject term nodes, using

Feature vectors representing nodes of sentences by

A feature vector representing a global paragraph node; wherein the content of the first and second substances,kis the total number of the nodes of the subject word,cthe total number of sentence nodes;

respectively carrying out average pooling on feature vectors of subject word nodes and sentence nodes, and expressing the feature vectors as follows:

wherein the content of the first and second substances,

respectively representing the subject word node characterization vector and the sentence node characterization vector after the average pooling;

splicing the subject word node characterization vector, sentence node characterization vector and feature vector of global paragraph node after average pooling, and performing dimension compression through two full-connection layers to obtain paragraph characterization vector of Chinese paragraph/English paragraph

Expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

a feature vector representing a node of the global paragraph,

respectively representing the weight matrices of two fully connected layers,

respectively representing the bias coefficients of two fully connected layers.

The invention provides a Chinese and English semantic similarity calculation method for paragraph level texts, and provides a multi-level paragraph representation method, which can be used for modeling paragraph texts from three levels of subject words, sentences and paragraphs, performing information interaction between the interior of each level and information of each level based on a graph attention network, then fusing the information of the subject words, the sentences and the paragraphs to obtain high-quality paragraph representation vectors, and obtaining the semantic similarity of Chinese paragraphs and English paragraphs by calculating the distance of the paragraph representation vectors, so that the method can realize high-precision calculation of the Chinese and English cross-language paragraph semantic similarity, can be applied to similarity calculation of abstracts of different language application books, and can complete detection and selection of similar application books; the method can also be applied to similarity calculation of abstracts of different language papers, thereby realizing similar paper detection and finishing cross-language paper recommendation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a general framework diagram of a chinese-english semantic similarity calculation method for paragraph-level text according to an embodiment of the present invention;

FIG. 2 is a flow chart of Chinese and English sentence characterization alignment for removing false negative samples according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a positive and negative sample construction process provided by an embodiment of the invention;

fig. 4 is a schematic diagram of a chinese-english sentence characterization alignment model framework provided in an embodiment of the present invention, where (a) and (b) are schematic diagrams of chinese-english sentence characterization alignment model frameworks where anchor samples are chinese sentences and english sentences, respectively;

fig. 5 is a diagram illustrating a result of aligning chinese and english sentence representations according to an embodiment of the present invention.

Mode for carrying out the invention

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a Chinese and English semantic similarity calculation method for paragraph level text, which comprises the steps of aligning sentence representations of different languages to a common representation space by a Chinese and English sentence representation alignment method of a past false negative sample, and calculating the distance of sentence vectors to obtain the similarity of sentences in the common representation space, thereby providing a representation basis of word and sentence levels for Chinese and English paragraph representation. And secondly, providing a multi-level paragraph representation method, modeling paragraph texts from three levels of subject terms, sentences and paragraphs, carrying out information interaction in each level and among information of each level based on a graph attention network, then fusing the information of the three levels of the subject terms, the sentences and the paragraphs to obtain high-quality paragraph representation vectors, and calculating the distance of the paragraph representation vectors to obtain the semantic similarity of the Chinese paragraphs and the English paragraphs.

As shown in fig. 1, the method for calculating chinese and english semantic similarity for paragraph-level text provided by this embodiment includes three stages: and (4) aligning the Chinese and English sentence representations of the false negative samples, and calculating the semantic similarity of Chinese and English cross-language paragraphs based on the multi-level Chinese and English paragraph representations of the graph attention network.

S1: in the first stage, chinese and English sentences of the false negative samples are marked and aligned.

The Chinese and English sentence representation alignment method for removing false negative samples selects a Chinese and English parallel sentence pair data set as training data, firstly selects an anchor sample and generates a corresponding negative sample and a positive sample on the basis of the Chinese and English parallel sentence pair data set; then screening the negative samples, screening out false negative samples, and reserving the true negative samples; after the Chinese and English sentence representation and alignment model is constructed, the Chinese and English sentence representation and alignment model can be trained by using an anchor sample, a positive sample and a negative sample; after training is finished, a Chinese and English sentence representation model is obtained, the Chinese and English sentences can be coded and represented in a common semantic representation space, and semantic similarity between the sentences can be obtained by calculating the distance between the representation vectors of the Chinese sentences and the English sentences.

In the feature extraction method based on deep semantics in the embodiment, because a large number of Chinese-English parallel sentences between Chinese and English serve as the supervision information to the material, the sentence semantics of different languages can be aligned to a common characterization space by a supervised sentence characterization alignment method, and the similarity of sentences in the common characterization space can be obtained by calculating the distance of sentence vectors without translating into single language and then calculating, so that the semantic loss possibly caused by translation deviation in the machine translation process is avoided.

As shown in fig. 2, the alignment of chinese and english sentence characterization of the false negative examples includes the following steps.

S11: and selecting anchor samples and generating corresponding negative samples and positive samples on the basis of the Chinese-English parallel sentence pair data set.

For an anchor sample of a chinese sentence, its positive sample is an english sentence with similar semantics, and its negative sample is an english sentence with dissimilar semantics, the quality of the positive sample and the negative sample will have a great influence on the training of the chinese sentence characterization alignment model, so that the characterization capability of the model will be reduced. The currently available training data set is only a Chinese-English parallel sentence pair data set, each Chinese-English parallel sentence pair is composed of a pair of parallel Chinese sentences and English sentences, and it is assumed that there is a training batchNFor Chinese-English parallel sentence pair

For Chinese sentences

In other words, a positive sample is the English sentence parallel to it

The negative sample is English sentence in other parallel sentence pairs in the same training batch

。

In the same way, as English sentence

When the anchor sample is used, the positive sample is a Chinese sentence parallel to the anchor sample

Negative examples are Chinese sentences aligned with other parallel sentences in the same training batch

。

S12: and screening the negative samples, screening out false negative samples, and reserving the true negative samples.

For negative samples

Required to satisfy and anchor the sample

The semantic meaning is not similar to the condition, but the negative sample constructed by other Chinese-English parallel sentence pairs in the same training batch does not necessarily meet the condition, and the negative sample needs to be screened. For negative samples

If it is associated with an anchor sample

Semantic similarity of

Greater than a set threshold

Then, it means that the negative example and the anchor example are semantically similar and do not meet the requirement of the negative example, that is, the negative example is a false negative example. If the characterization distance between the anchor sample and the false negative sample is pulled, the alignment of sentence characterization is affected, and the performance of the Chinese and English sentence characterization alignment model is damaged.

But because of anchor samples

And negative sample

The sentences are sentences of different languages, the existing Chinese and English sentence semantic similarity calculation technology is not mature (the precision is not high), and the similarity of the Chinese and English sentences cannot be directly calculated, so that a method for indirectly calculating the similarity is provided. When the anchor sample is a Chinese sentence

If it is to beEnglish sentence

Calculating similarity by using Chinese sentences

Parallel English sentences

Chinese and English sentences

Calculating the similarity to obtain

In the same way, english sentences can be used

Parallel Chinese sentences

Chinese and Japanese sentences

Calculating the similarity to obtain

Here, the

And

respectively representing the similarity calculation of Chinese sentences and the similarity calculation of English sentences. In specific implementation, the semantic similarity between the anchor sample and the negative sample is calculated by utilizing a mature single-language sentence semantic similarity calculation model, when the Chinese sentence similarity is calculated, a Chinese sentence encoder RoBERTA-wwm-ext (Chinese version) can be utilized to encode the Chinese sentences, then the feature vectors of the words obtained after encoding are subjected to average pooling to obtain the feature vectors of the Chinese sentences, and then the Chinese sentence similarity is obtained by calculating the cosine similarity of the feature vectors of the two Chinese sentences

(ii) a When calculating the similarity of an English sentence, the English sentence may be encoded by using an English sentence encoder bert-base-uncased (English version), then the feature vectors of the words obtained after encoding are averaged and pooled to obtain the feature vectors of the English sentence, and then the similarity of the English sentence is obtained by calculating the cosine similarity of the feature vectors of two English sentences

。

Since the similarity between the negative sample and the anchor sample is greater, which indicates that the negative sample is a false negative sample, in order to improve the robustness of determining the false negative sample, in a preferred embodiment of the present invention, the Chinese sentence

And English sentence

Semantic similarity of

The maximum of the two is taken and is expressed as follows:

。

after obtaining the similarity, a threshold value needs to be set

If the similarity of the negative sample and the anchor sample

Then, it means that this negative sample is a false negative sample, which cannot be used to train the model, and it needs to be weighted to 0 to remove its influence on the model training. The weight assignment for negative examples can be expressed as follows:

wherein the content of the first and second substances,

representing anchor samples

Negative sample of

The weight of (c). Of course, it should be noted that the above description takes the anchor sample as the chinese sentence

For illustration, fig. 3 is a schematic diagram of a positive and negative sample construction process. In the same way, when English sentence is used

When the sample is an anchor sample, the process principle of screening the false negative sample is the same, and only the corresponding adjustment of the processing object is needed.

S13: and constructing a Chinese and English sentence representation alignment model.

The chinese-english sentence representation alignment model is shown in fig. 4, in which fig. 4 (a) is a schematic diagram of a chinese sentence as an anchor sample, and fig. 4 (b) is a schematic diagram of an english sentence as an anchor sample. The Chinese sentences enter a Chinese sentence encoder to obtain the characteristic vectors of the words in the sentences, the characteristic vectors of the Chinese sentences are obtained through an average pooling layer, the English sentences enter an English sentence encoder to obtain the characteristic vectors of the words in the sentences, and the characteristic vectors of the English sentences are obtained through the average pooling layer. The English sentence encoder adopts bert-base-uncased (English version), which is a trained single-language English sentence encoder and can carry out semantic representation on English sentences; the Chinese sentence encoder adopts RoBERTA-wwm-ext (Chinese version), which is a trained single-language Chinese sentence encoder and can perform semantic representation on Chinese sentences.

S14: and training Chinese and English sentences to represent an alignment model.

Training Chinese and English sentence characterization alignment model by using anchor samples, positive samples and negative samplesIn the training process, the loss function adopts the variant of InfonCE loss, when the anchor sample is a Chinese sentence

The positive sample is an English sentence

The negative sample is N-1 English sentences

. Alignment loss function

Comprises the following steps:

representing Chinese sentencesxEnglish sentenceyThe vector similarity of (2);

represents a temperature over-parameter;

representing anchor samples

Negative sample of

The weight of (c); subscripti、jThe numbering of the english parallel sentence pairs in the representation,Nthe total number of english parallel sentence pairs in the training batch. The alignment loss function is equivalent to oneNSoftmax classification of ways in which positive samples are correct and N-1 negative samples are incorrect. The training target of the model is to draw the vector distance between the anchor sample and the positive sample and draw the vector distance between the anchor sample and the negative sample, thereby being capable of drawing the expression of sentences with similar semantemes in the common semantic space and simultaneously drawing the sentences with dissimilar semantemes in the common semantic spaceAnd the representation in the space realizes the aim of aligning to the same characterization space.

Similarly, when the anchor sample is an English sentence

Time of alignment loss function

Comprises the following steps:

wherein the content of the first and second substances,

representing anchor samples

Negative sample of

The weight of (c).

The chinese-english sentence represents the target function of the alignment model as follows:

。

the aim of the overall training is to minimize the distance between vectors of sentences with similar semantemes and maximize the distance between vectors of sentences with dissimilar semantemes, so as to realize the alignment of sentence characterization spaces in different languages.

The objective function includes two alignment penalties, corresponding to alignment from two directions (medium- > english, in-). To take a simple example, there are the following four pairs of Chinese parallel sentence pairs to train the model:

(1) english is the most widely used language. English is the most wild spoke language.

(2) Chinese is the language that is used by the most people. Chinese is the most specific language.

(3) There are about 6909 languages existing in the world. The pure are about 6909 laguages in existence in the world.

(4) English is most widely used. English is the most widely used.

English is the most widely used language when in parallel sentences [ (1) english. When the loss training model is calculated by English is the most with spoke language ]:

1. when "english" is the most widely used language. "when used as an anchor sample, a positive sample is" English is the most recent stroke language ", and a negative sample is" Chinese is the most recent stroke language "," the re are about 6909 languages in experiments in the world ", and" English is the most recent language ". Screening negative samples can judge that { "Chinese is the most spoken language", "the re are about 6909 languages in existence in the world" } is a true negative sample, and "English is the most popular language with" semantics and "English are the most widely used languages. "semantically similar, so are false negative examples. We want to use "english as the most widely used language in the common representation space. "draws closer to" English is the most popular language with the last window language. The two sentences of 'Chinese is the most spoken language', 'the are about 6909 languages in existence in the world' } are pulled away, and the false negative sample is not pulled away.

2. When "English is the most popular language" is used as the anchor sample, the positive sample is "English is the most widely used language. "the negative example is" Chinese is the language with the largest number of people. "," there are about 6909 languages existing in the world. The "and" english language is most widely used. ". Screening the negative samples can judge that the Chinese is the language with the most number of people. "," there are about 6909 languages existing in the world. "is a true negative sample, while" English is most widely used. "semantics are similar to" English is the most with spoke language, "semantics, and so is a false negative sample. We want to associate "English with the most widely used language in the common representation space. "zoom in, while" English is the most recent stroke language "and {" Chinese are the most popular languages. "," there are about 6909 languages existing in the world. "} the two sentences are zoomed out, false negative samples are not zoomed out.

After such two-way training, the most widely used language is for the sentence "english". "the sentence" English is the most widely used language for the sentence "English is the most mobile with a spoken language. "is also closest to it in the characterization space. If only one direction is trained, this may lead to a situation: the sentence "english is the most widely used language. "the sentence" English is the most widely used language for the sentence "English is the most mobile with a spoken language. "is not closest to it in the token space, i.e., there may be other sentences closer to the sentence" English is the most with the stroke language ", which does not achieve the purpose of sentence token alignment, so that bi-directional alignment is required.

Fig. 5 is a diagram showing a result after chinese-english sentence representation alignment (chinese-english sentence representation alignment is shown with an anchor sample of "english is the most widely used language" as an example), before chinese-english sentence representation alignment, an english sentence semantic representation space and a chinese sentence semantic representation space are two independent representation spaces, and a distance between their representation vectors in chinese sentences and english sentences cannot reflect their semantic similarity. After the Chinese and English sentences are characterized and aligned, the characterization distances of the anchor sample and the positive sample are shortened, and the characterization distances of the anchor sample and the negative sample are simultaneously shortened, so that the English sentence semantic characterization space and the Chinese sentence semantic characterization space are aligned to a common sentence semantic characterization space, and the semantic similarity of the Chinese sentence and the English sentence can be reflected by the distance of the characterization vectors of the Chinese sentence and the English sentence in the common sentence semantic characterization space.

S15: and obtaining a Chinese and English sentence representation model.

After the training of the Chinese and English sentence representation and alignment model is completed, the Chinese sentence encoder and the English sentence encoder in the Chinese and English sentence representation and alignment model can be used as Chinese and English sentence representation models, the Chinese sentences and the English sentences are respectively encoded to obtain corresponding vectors, the obtained Chinese sentence vectors and the obtained English sentence vectors are in a common sentence semantic representation space, and the semantic similarity between the sentences can be obtained by calculating the distance between the vectors.

S2: and in the second stage, the multi-level Chinese and English paragraph representation based on the graph attention network.

The multi-level Chinese and English paragraph characterization based on the graph attention network specifically comprises the following processes:

s21: chinese/English paragraphs are modeled from three levels of subject words, sentences and paragraphs (multilevel information modeling of paragraphs).

Firstly, establishing subject word nodes and sentence nodes according to subject words and sentences of a paragraph, adding a global paragraph node, establishing an edge relationship between the sentence nodes and the subject word nodes according to whether the sentence contains the subject words, and simultaneously respectively establishing the edge relationship between the sentence nodes and the edge relationship between the subject word nodes according to the context relationship of the sentence in the paragraph and the co-occurrence relationship of the subject words in the sentence, wherein the global paragraph node is connected with each subject word node and each sentence node by an edge. In this way, paragraphs are modeled from three levels of subject words, sentences and paragraphs into a paragraph graph to be used as input for next Chinese paragraph characterization. The method comprises the following specific steps:

s211: extracting key words in Chinese/English paragraphs by TF-IDF algorithm, wherein the key words are words capable of expressing the central content of the paragraph, and extracting the top word with highest importancekThe key words are used as subject words of the paragraphs and corresponding subject word nodes are constructed

Separately record thiskThe sentence number in which the subject word appears and the word order in the sentence.

S212: segmenting Chinese/English paragraphsSentence, a sentence node is constructed for each sentence in the paragraph, and is expressed as

And c is the number of sentences in the Chinese/English paragraphs.

S213: constructing a global paragraph node for Chinese/English paragraphs

。

S214: and initializing the characteristics of the nodes by using the Chinese and English sentence representation model.

Suppose that the constructed paragraph graph haskThe number of the subject term nodes is equal to the number of the subject term nodes,csentence nodes, 1 global node, in totaln=k +c+1 node, and sequencing the nodes according to the sequence of the subject term node, the sentence node and the global node to obtain

And initializing the characteristics of the nodes, wherein the method comprises the following steps:

for each subject term node

When the subject word is coded, firstly, the sentence with the subject word is input into a Chinese-English sentence representation model, after the sentence is coded by a Chinese/English sentence coder, a characteristic vector of each word is obtained, and the characteristic vector of the subject word is extracted according to the recorded word sequence and is used as an initialization vector of the subject word node; if the subject term appears in the paragraph for many times, the feature vectors of the subject terms at each position are obtained through the method, and then the initialization features of the subject term nodes are obtained by performing average pooling on the vectors.

For thekThe feature vector can be obtained by each subject term node

. The obtained subject word features contain the context information of the sentence, and words in different sentences can have different semantics according to the difference of the context information, so that the problem of ambiguity of one word can be effectively solved.

The initialized feature vector of the sentence node is the feature vector obtained by coding and average pooling of the sentence through the Chinese and English sentence representation model, and for the sentence, the initialized feature vector is the feature vectorcOne sentence node gets

。

The initialized feature vector of the global node is obtained by carrying out average pooling on initialized feature vectors of all other subject word nodes and sentence nodes

。

S215: establishing an edge relation between nodes by the following method:

(1) If the sentence A contains the subject word a, establishing an edge between the sentence node corresponding to the sentence A and the subject word node corresponding to the subject word a for connection;

(2) If the sentence A and the sentence B have a context relationship in the Chinese paragraph/English paragraph, establishing an edge between the sentence node corresponding to the sentence A and the sentence node corresponding to the sentence B for connection;

(3) If the subject word a and the subject word b have a co-occurrence relation in the sentences of the Chinese paragraphs/English paragraphs, establishing an edge between the corresponding subject word nodes for connection;

(4) The global paragraph nodes are connected with all sentence nodes and subject term nodes by establishing an edge;

(5) Each node establishes an edge to which it is connected.

S216: and modeling and representing the edge relation between the nodes.

For nodepAnd nodeqIf there is an edge connection between them, they are each other's neighbor nodes (the node itself also counts as its neighbor node). If nodepAnd nodeqAre mutually adjacent sectionsPoint is then

=1, otherwise

And =0. Modeling of the edge relationship between nodes is represented as follows:

wherein the content of the first and second substances,

representing nodespAnd nodeqThe relationship between the two edges.

S22: information interaction and information fusion based on the graph attention network.

And processing the paragraph graph through a graph attention network, transmitting information among all nodes, and finally generating a paragraph representation vector of a Chinese paragraph/English paragraph by fusing information of three layers of subject words, sentences and global paragraph information.

For each layer of graph attention network, outputting the node feature vector of the attention network of the previous layer of graph

(ii) a Wherein n is the total number of nodes; the input of the first layer graph attention network is initialized feature vectors of all nodes and edge relation between the nodes

. The process of constructing the attention network of each layer is as follows:

to obtain sufficient expressive power, a shared linear transformation W is learned to transform the input features of the nodes into deeper features. Self-attention is then drawn between the nodes to compute the weight of the edges between the nodes if the nodes arepHejie (Chinese character of 'Hejie')DotqIf there is an edge connection between them, the edge relation is

The weight calculation formula of (a) is as follows:

wherein the content of the first and second substances,

representing nodesqTo nodepThe importance of (c);

representing a similarity function between two vectors, here cosine similarity is used; in the information transfer process, the neighbor node information of each node is transferred to the node, wherein the neighbor nodes comprise the node itself, and the weight of the edge relation of the neighbor nodes of the node is normalized by softmax:

wherein the content of the first and second substances,

is a nodepThe set of neighboring nodes of (a) is,mrepresenting nodesm. After the normalized attention weight is obtained, the normalized attention weight can be used for updating the nodepIs characterized by, nodepUpdated feature vectors

Is represented as follows:

wherein the content of the first and second substances,

representing a non-linear transformation function.

Each layer is illustratedThe feature vector of each node finally output by the attention network is input into the attention network of the next layer of graph, and the final feature vector of each node is obtained after the feature vector is output by the attention network of the last layer of graph

。

And (3) fusing the final feature vectors of all the nodes to obtain paragraph characterization vectors of Chinese paragraphs/English paragraphs, and firstly, respectively performing average pooling on the feature vectors of subject word nodes and sentence nodes, wherein the average pooling is expressed as follows:

and respectively representing the subject word node characterization vector and the sentence node characterization vector after the average pooling. Then splicing the subject word node characterization vector, sentence node characterization vector and feature vector of global paragraph node after average pooling, and performing dimension compression through two full-connection layers to obtain paragraph characterization vector of Chinese paragraph/English paragraph

Expressed as follows:

wherein the content of the first and second substances,

a feature vector representing a node of the global paragraph,

respectively representing the weight matrices of two fully connected layers,

respectively representing the bias coefficients of the two fully-connected layers.

S3: and in the third stage, calculating semantic similarity of the Chinese-English cross language paragraphs.

Computing paragraph characterization vectors for chinese paragraphs

Paragraph characterization vector with English paragraph

The semantic similarity between the Chinese paragraph and the English paragraph is obtained according to the distance between the Chinese paragraph and the English paragraphSimilarity：

the cosine similarity between the vectors is calculated.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A Chinese and English semantic similarity calculation method for paragraph level text is characterized by comprising the following steps:

respectively extracting paragraph characterization vectors of Chinese paragraphs and English paragraphs, which comprises the following processes:

establishing an edge relation between sentence nodes according to whether a sentence contains a subject word, establishing an edge relation between sentence nodes according to a context relation of the sentence in a corresponding paragraph, establishing an edge relation between subject word nodes according to a co-occurrence relation of the subject word in the sentence, and connecting the global paragraph nodes with each subject word node and each sentence node by using an edge; each node establishes an edge connected with the node; modeling an edge relation between nodes;

inputting the initialized feature vectors of all nodes and the edge relation among the nodes into the attention network of the graph, and outputting the feature vectors of all the nodes after information interaction;

2. The method for calculating Chinese and English semantic similarity of paragraph-level text according to claim 1, wherein before performing paragraph characterization vector extraction on Chinese paragraphs and English paragraphs, respectively, the method further comprises:

3. The method for calculating Chinese-English semantic similarity for paragraph-level text according to claim 2, wherein the training of the data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model comprises:

selecting an anchor sample from the Chinese and English parallel sentence data set, wherein sentences parallel to the anchor sample are positive samples, and sentences in other Chinese and English parallel sentences, which are different from the anchor sample in language, are negative samples;

4. The method for calculating Chinese and English semantic similarity of paragraph-level text according to claim 3, wherein the semantic similarity with the anchor sample is calculated for each negative sample by the following method:

5. The method for calculating Chinese-English semantic similarity for paragraph-level text according to claim 3, wherein when training a Chinese-English sentence representation alignment model, the objective function is expressed as follows:

min

wherein, in the step (A),

and

，

wherein, in the step (A),

representing the vector similarity of the Chinese sentence x and the English sentence y;

represents a temperature over-parameter;

representing anchor samples

Negative sample of

The weight of (c); the subscripts i, j indicate the number of pairs of middle english parallel sentences and N is the total number of pairs of middle english parallel sentences in the training batch.

6. The method for calculating Chinese-English semantic similarity of a paragraph-level text according to any one of claims 1 to 5, wherein the constructing a subject word node for each subject word in a Chinese/English paragraph and a sentence node for each sentence in the Chinese/English paragraph comprises:

extracting key words in Chinese paragraphs/English paragraphs by TF-IDF algorithm, extracting top k key words with highest importance as subject words of Chinese paragraphs/English paragraphs, and correspondingly constructing subject word nodes

Separately recording the k occurrences of the subject termSentence number and word order in the sentence;

And c is the number of sentences in Chinese/English paragraphs.

7. The method for calculating Chinese-English semantic similarity of paragraph-level text according to any one of claims 2 to 5, wherein the extracting initialized feature vectors of subject term nodes, sentence nodes and global paragraph nodes comprises:

for each subject word node, counting the sequence number of the sentence and the word sequence in the sentence, coding the sentence with the corresponding subject word by using a Chinese sentence coder/English sentence coder in a Chinese-English sentence representation alignment model to obtain the characteristic vector of each word in the sentence, and extracting the characteristic vector of the subject word as the initialization characteristic vector of the subject word node according to the word sequence of the subject word in the sentence; if the subject term appears in the Chinese paragraph/English paragraph for many times, average pooling is carried out on the feature vector of the subject term at each position to obtain an initialized feature vector of the subject term node;

for each sentence node, a Chinese sentence encoder/English sentence encoder in a Chinese and English sentence representation alignment model is used for encoding a sentence to obtain a feature vector of each word, and then average pooling is carried out to obtain the feature vector as an initialization feature vector of the sentence node;

and carrying out average pooling on the initialized feature vectors of all subject word nodes and sentence nodes to obtain the initialized feature vectors of the global paragraph nodes.

8. The method for calculating Chinese and English semantic similarity for paragraph-level text according to any one of claims 1 to 5, wherein the edge relation between nodes is determined by the following method:

each node establishes an edge connected with the node;

modeling of the edge relationship between nodes is represented as follows:

wherein, in the step (A),

representing the edge relationship between node p and node q.

9. The method for calculating similarity of Chinese and English semantics of paragraph level text according to any one of claims 1 to 5, wherein the step of inputting the initialized feature vectors of all nodes and the edge relations between the nodes into a graph attention network and outputting the feature vectors of each node after information interaction comprises:

(ii) a Wherein n is the total number of nodes; first of allThe input of the layer diagram attention network is the initialized characteristic vectors of all nodes and the edge relation among the nodes; wherein, the construction process of the attention network of each layer of the graph is as follows:

learning a shared linear transformation W, self-attention between nodes to compute the weight of edges between nodes, if an edge is connected between node p and node q, the edge relation

The weight calculation formula of (a) is as follows:

wherein, in the step (A),

indicating the importance of node q to node p,

representing a similarity function between the two vectors;

wherein, in the step (A),

is a neighbor node set of node p, and m represents node m;

updated feature vector of node p

Is represented as follows:

wherein, in the step (A),

representing a non-linear transformation function;

。

10. The method for calculating Chinese-English semantic similarity of paragraph-level text according to any one of claims 1 to 5, wherein the paragraph characterization vector of Chinese/English paragraphs is obtained by:

by using

Feature vectors representing subject term nodes, using

Feature vectors representing nodes of sentences by

A feature vector representing a global paragraph node; wherein k is the total number of subject word nodes, and c is the total number of sentence nodes;

，

wherein, in the step (A),

respectively representing the representation vector and sentence node table of the subject word node after average poolingA eigenvector;

Expressed as follows:

wherein, in the step (A),

a feature vector representing a node of the global paragraph,

respectively representing the weight matrices of two fully connected layers,

respectively representing the bias coefficients of two fully connected layers.