CN115828931A - Chinese and English semantic similarity calculation method for paragraph-level text - Google Patents

Chinese and English semantic similarity calculation method for paragraph-level text Download PDF

Info

Publication number
CN115828931A
CN115828931A CN202310085688.7A CN202310085688A CN115828931A CN 115828931 A CN115828931 A CN 115828931A CN 202310085688 A CN202310085688 A CN 202310085688A CN 115828931 A CN115828931 A CN 115828931A
Authority
CN
China
Prior art keywords
sentence
english
chinese
paragraph
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310085688.7A
Other languages
Chinese (zh)
Other versions
CN115828931B (en
Inventor
龙军
唐自强
杨柳
黄金彩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310085688.7A priority Critical patent/CN115828931B/en
Publication of CN115828931A publication Critical patent/CN115828931A/en
Application granted granted Critical
Publication of CN115828931B publication Critical patent/CN115828931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese and English semantic similarity calculation method for paragraph level texts, which is used for extracting paragraph representation vectors of Chinese paragraphs and English paragraphs respectively, modeling paragraph texts from three levels of subject words, sentences and paragraphs, performing information interaction between the interior of each level and information of each level based on a graph attention network, then fusing the information of the three levels of the subject words, the sentences and the paragraphs to obtain paragraph representation vectors, and calculating the distance of the paragraph representation vectors to obtain the semantic similarity of the Chinese paragraphs and the English paragraphs. According to the method, high-quality paragraph representation vectors are obtained by fusing information of three levels of subject words, sentences and paragraphs, and high-precision calculation of semantic similarity of middle-English cross-language paragraphs can be realized.

Description

Chinese and English semantic similarity calculation method for paragraph level text
Technical Field
The invention relates to the technical field of information, in particular to a Chinese and English semantic similarity calculation method for paragraph level texts.
Background
Now, with the enhancement of the world cultural communication degree and the resource sharing among the cross-languages of various countries, the cross-language scenes become more and more popular, which leads to the increasing urgency of the requirements of the cross-language application. In order to solve the technical barriers brought by the above, and to realize resource sharing between cross-languages, the academic world and the industrial world have been actively exploring cross-language natural language processing technologies. The cross-language paragraph semantic similarity calculation is an important research content, and how to perform semantic similarity calculation on paragraph level texts of different languages is researched, so that the cross-language paragraph semantic similarity calculation plays an important role in many cross-language processing applications and related fields thereof, such as performing semantic similarity calculation on abstracts of different language application books to complete detection and selection of the similar application books; and performing semantic similarity calculation on abstracts of different language papers, thereby realizing similar paper detection and completing cross-language paper recommendation and cross-language plagiarism detection.
The precision of the calculation method for semantic similarity of Chinese-English cross-language paragraphs at the present stage needs to be improved, and the method mainly has the following three defects:
(1) The cross-language paragraph semantic similarity calculation needs to solve the obstacles between languages. However, most of the current cross-language similarity detection uses translation skills to solve the obstacles between languages, including: dictionary-based, parallel corpus-based, machine translation-based methods, etc. The method is equivalent to converting the cross-language problem into a monolingual or intermediate language and then solving the problem of similarity measurement. However, there are defects in the conversion process, such as dictionary, parallel corpus, and machine translation, and it is very likely to lose a part of the deep information of the text.
(2) The negative examples in the chinese-english sentence representation alignment should not be semantically similar to the anchor examples. The Chinese-English sentence representation alignment uses a Chinese-English parallel sentence to train a data set, for the Chinese-English parallel sentence pair data set, when an anchor sample is a Chinese sentence, a positive sample of the anchor sample is an English sentence parallel to the Chinese sentence, and a negative sample needs to be constructed by the anchor sample. If the characterization distance of the anchor sample and the false negative sample is pulled, the alignment of sentence characterization is influenced, and the performance of the Chinese-English sentence characterization model is damaged.
(3) The representation of paragraph text should consider information of three levels of words, sentences and paragraphs at the same time. However, the existing paragraph representation method emphasizes the paragraph representation through the information of sentence level, or emphasizes the paragraph representation through the information of word level, lacks the information interaction and information fusion of different levels of information of the paragraph, and cannot effectively represent the paragraph text completely from three levels of words, sentences and paragraphs.
Disclosure of Invention
The invention provides a Chinese and English semantic similarity calculation method for paragraph level texts, which aims to solve the problem that the existing Chinese and English cross-language paragraph semantic similarity calculation method is low in precision.
In order to achieve the purpose, the invention adopts the following technical scheme.
A Chinese and English semantic similarity calculation method for paragraph-level text comprises the following steps:
the method for extracting paragraph characterization vectors of Chinese paragraphs and English paragraphs respectively comprises the following processes:
constructing a subject word node for each subject word in the Chinese paragraph/English paragraph, constructing a sentence node for each sentence in the Chinese paragraph/English paragraph, and constructing a global paragraph node for the Chinese paragraph/English paragraph;
extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes;
establishing an edge relation between sentence nodes and subject word nodes according to whether the sentence contains the subject word, establishing an edge relation between sentence nodes according to the context relation of the sentence in the corresponding paragraph, establishing an edge relation between subject word nodes according to the co-occurrence relation of the subject word in the sentence, and connecting the global paragraph nodes with each subject word node and the sentence nodes by one edge; each node establishes an edge connected with the node; modeling an edge relation between nodes;
inputting the initialized feature vectors of all nodes and the edge relation among the nodes into a graph attention network, and outputting the feature vectors of all nodes after information interaction;
respectively performing average pooling on feature vectors of subject word nodes and sentence nodes, and then splicing and reducing dimensions on feature vectors of the subject word nodes, the sentence node feature vectors and global paragraph nodes after the average pooling to obtain paragraph feature vectors of Chinese paragraphs/English paragraphs;
and calculating the distance between the paragraph characterization vector of the Chinese paragraph and the paragraph characterization vector of the English paragraph to obtain the semantic similarity of the Chinese paragraph and the English paragraph.
Further, before performing paragraph characterization vector extraction on the chinese paragraph and the english paragraph, respectively, the method further includes:
and training a data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model, wherein the Chinese-English sentence representation alignment model is used for extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes.
Further, the training of the data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model includes:
selecting an anchor sample from the Chinese and English parallel sentence data set, wherein sentences parallel to the anchor sample are positive samples, and other Chinese and English parallel sentence pairs are negative samples of sentences of different languages from the anchor sample;
for each negative sample, if the semantic similarity between the negative sample and the anchor sample is greater than a set threshold, the negative sample is a false negative sample, and the weight of the false negative sample is assigned to be 0; otherwise, assigning the weight of the negative sample to be 1;
training the Chinese and English sentence representation alignment model by using a training set obtained after the negative sample is removed to obtain a Chinese and English sentence representation alignment model after training is completed; the Chinese and English sentence representation and alignment model comprises two feature extraction branches, wherein one branch comprises a Chinese sentence encoder and an average pooling layer, and the other branch comprises an English sentence encoder and an average pooling layer.
Further, for each negative sample, its semantic similarity to the anchor sample is calculated by:
calculating a first semantic similarity between the sentences parallel to the negative sample and the anchor sample by using a single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;
or calculating a second semantic similarity between sentences parallel to the negative sample and the anchor sample by using the single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;
or taking the maximum value of the first semantic similarity and the second semantic similarity as the semantic similarity between the negative sample and the anchor sample.
Further, when the chinese-english sentence is trained to characterize the alignment model, the objective function is expressed as follows:
Figure SMS_1
wherein the content of the first and second substances,
Figure SMS_2
and
Figure SMS_3
the alignment loss function for the anchor sample in chinese and english sentences, respectively, is expressed as follows:
Figure SMS_4
Figure SMS_5
wherein the content of the first and second substances,
Figure SMS_6
representing Chinese sentencesxEnglish sentenceyThe vector similarity of (2);
Figure SMS_7
represents a temperature over-parameter;
Figure SMS_8
representing anchor samples
Figure SMS_9
Negative sample of
Figure SMS_10
The weight of (c); subscriptijThe numbering of the pairs of english parallel sentences in the representation,Nthe total number of english parallel sentence pairs in the training batch.
Further, the constructing a subject word node for each subject word in the chinese/english paragraphs and constructing a sentence node for each sentence in the chinese/english paragraphs includes:
extracting key words in Chinese/English paragraphs by TF-IDF algorithm to extract top with highest importancekThe key words are used as subject words of Chinese paragraphs/English paragraphs and corresponding subject word nodes are constructed
Figure SMS_11
Separately record thiskSentence sequence number of each subject word and word sequence in the sentence;
dividing Chinese paragraph/English paragraph, and constructing a sentence node for each sentence of Chinese paragraph/English paragraph, wherein the sentence node is expressed as
Figure SMS_12
And c is the number of sentences in Chinese/English paragraphs.
Further, the extracting of the initialized feature vectors of the subject term node, the sentence node, and the global paragraph node includes:
for each subject word node, counting the sequence number of the sentences and the word sequence in the sentences, coding the sentences with corresponding subject words by using a Chinese sentence coder/an English sentence coder in a Chinese-English sentence representation alignment model to obtain the characteristic vector of each word in the sentences, and extracting the characteristic vector of the subject word as the initialized characteristic vector of the subject word node according to the word sequence of the subject word in the sentences; if the subject term appears in the Chinese paragraph/English paragraph for a plurality of times, performing average pooling on the feature vector of the subject term at each position to obtain an initialization feature vector of the subject term node;
for each sentence node, coding a sentence by using a Chinese sentence coder/English sentence coder in the Chinese and English sentence representation alignment model to obtain a feature vector of each word, and then performing average pooling to obtain the feature vector as an initialization feature vector of the sentence node;
and carrying out average pooling on the initialized feature vectors of all subject term nodes and sentence nodes to obtain the initialized feature vectors of the global paragraph nodes.
Further, the edge relationship between the nodes is determined by the following method:
if the sentence A contains the subject word a, establishing an edge between the sentence node corresponding to the sentence A and the subject word node corresponding to the subject word a for connection;
if the sentence A and the sentence B have a context relationship in the Chinese paragraph/English paragraph, establishing an edge between the sentence node corresponding to the sentence A and the sentence node corresponding to the sentence B for connection;
if the subject word a and the subject word b have a co-occurrence relation in the sentences of the Chinese paragraphs/English paragraphs, establishing an edge between the corresponding subject word nodes for connection;
the global paragraph nodes are connected with all sentence nodes and subject term nodes by establishing an edge;
each node establishes an edge connected with the node;
modeling of the edge relationship between nodes is represented as follows:
Figure SMS_13
wherein the content of the first and second substances,
Figure SMS_14
representing nodespAnd nodeqThe relationship between the two edges.
Further, the inputting the initialized feature vectors of all the nodes and the edge relations among the nodes into the attention network of the graph and outputting the feature vectors of the nodes after information interaction includes:
for each layer of graph attention network, the previous layer of graph attention networkNode feature vector of network output
Figure SMS_15
As the input of the attention network of the layer diagram, the feature vector of each node after the information interaction is output
Figure SMS_16
(ii) a Wherein n is the total number of nodes; the input of the first-layer graph attention network is the initialized feature vectors of all nodes and the edge relation among the nodes; wherein, the construction process of the attention network of each layer of the graph is as follows:
learning a shared linear transformation W, and self-attention between nodes to compute the weight of edges between nodes if nodespAnd nodeqIf there is an edge connection between them, the edge relation is
Figure SMS_17
The weight calculation formula of (a) is as follows:
Figure SMS_18
wherein the content of the first and second substances,
Figure SMS_19
representing nodesqTo nodepThe importance of (a) to (b),
Figure SMS_20
representing a similarity function between the two vectors;
the neighbor node information of each node is transmitted to the node, wherein the neighbor nodes comprise the node itself, and the weight of the edge relation of the neighbor nodes of the node is normalized by softmax:
Figure SMS_21
wherein the content of the first and second substances,
Figure SMS_22
is a nodepThe set of neighboring nodes of (a) is,mrepresenting nodesm
Node pointpUpdated feature vectors
Figure SMS_23
Is represented as follows:
Figure SMS_24
wherein the content of the first and second substances,
Figure SMS_25
representing a non-linear transformation function;
inputting the feature vector of each node finally output by the attention network of each layer of graph into the attention network of the next layer of graph, and obtaining the final feature vector of each node after the output of the attention network of the last layer of graph
Figure SMS_26
Further, the paragraph characterization vector of the chinese/english paragraph is obtained by the following method:
by using
Figure SMS_27
Feature vectors representing subject term nodes, using
Figure SMS_28
Feature vectors representing nodes of sentences by
Figure SMS_29
A feature vector representing a global paragraph node; wherein the content of the first and second substances,kis the total number of the nodes of the subject word,cthe total number of sentence nodes;
respectively carrying out average pooling on feature vectors of subject word nodes and sentence nodes, and expressing the feature vectors as follows:
Figure SMS_30
wherein the content of the first and second substances,
Figure SMS_31
respectively representing the subject word node characterization vector and the sentence node characterization vector after the average pooling;
splicing the subject word node characterization vector, sentence node characterization vector and feature vector of global paragraph node after average pooling, and performing dimension compression through two full-connection layers to obtain paragraph characterization vector of Chinese paragraph/English paragraph
Figure SMS_32
Expressed as follows:
Figure SMS_33
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_34
a feature vector representing a node of the global paragraph,
Figure SMS_35
respectively representing the weight matrices of two fully connected layers,
Figure SMS_36
respectively representing the bias coefficients of two fully connected layers.
The invention provides a Chinese and English semantic similarity calculation method for paragraph level texts, and provides a multi-level paragraph representation method, which can be used for modeling paragraph texts from three levels of subject words, sentences and paragraphs, performing information interaction between the interior of each level and information of each level based on a graph attention network, then fusing the information of the subject words, the sentences and the paragraphs to obtain high-quality paragraph representation vectors, and obtaining the semantic similarity of Chinese paragraphs and English paragraphs by calculating the distance of the paragraph representation vectors, so that the method can realize high-precision calculation of the Chinese and English cross-language paragraph semantic similarity, can be applied to similarity calculation of abstracts of different language application books, and can complete detection and selection of similar application books; the method can also be applied to similarity calculation of abstracts of different language papers, thereby realizing similar paper detection and finishing cross-language paper recommendation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a general framework diagram of a chinese-english semantic similarity calculation method for paragraph-level text according to an embodiment of the present invention;
FIG. 2 is a flow chart of Chinese and English sentence characterization alignment for removing false negative samples according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a positive and negative sample construction process provided by an embodiment of the invention;
fig. 4 is a schematic diagram of a chinese-english sentence characterization alignment model framework provided in an embodiment of the present invention, where (a) and (b) are schematic diagrams of chinese-english sentence characterization alignment model frameworks where anchor samples are chinese sentences and english sentences, respectively;
fig. 5 is a diagram illustrating a result of aligning chinese and english sentence representations according to an embodiment of the present invention.
Mode for carrying out the invention
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a Chinese and English semantic similarity calculation method for paragraph level text, which comprises the steps of aligning sentence representations of different languages to a common representation space by a Chinese and English sentence representation alignment method of a past false negative sample, and calculating the distance of sentence vectors to obtain the similarity of sentences in the common representation space, thereby providing a representation basis of word and sentence levels for Chinese and English paragraph representation. And secondly, providing a multi-level paragraph representation method, modeling paragraph texts from three levels of subject terms, sentences and paragraphs, carrying out information interaction in each level and among information of each level based on a graph attention network, then fusing the information of the three levels of the subject terms, the sentences and the paragraphs to obtain high-quality paragraph representation vectors, and calculating the distance of the paragraph representation vectors to obtain the semantic similarity of the Chinese paragraphs and the English paragraphs.
As shown in fig. 1, the method for calculating chinese and english semantic similarity for paragraph-level text provided by this embodiment includes three stages: and (4) aligning the Chinese and English sentence representations of the false negative samples, and calculating the semantic similarity of Chinese and English cross-language paragraphs based on the multi-level Chinese and English paragraph representations of the graph attention network.
S1: in the first stage, chinese and English sentences of the false negative samples are marked and aligned.
The Chinese and English sentence representation alignment method for removing false negative samples selects a Chinese and English parallel sentence pair data set as training data, firstly selects an anchor sample and generates a corresponding negative sample and a positive sample on the basis of the Chinese and English parallel sentence pair data set; then screening the negative samples, screening out false negative samples, and reserving the true negative samples; after the Chinese and English sentence representation and alignment model is constructed, the Chinese and English sentence representation and alignment model can be trained by using an anchor sample, a positive sample and a negative sample; after training is finished, a Chinese and English sentence representation model is obtained, the Chinese and English sentences can be coded and represented in a common semantic representation space, and semantic similarity between the sentences can be obtained by calculating the distance between the representation vectors of the Chinese sentences and the English sentences.
In the feature extraction method based on deep semantics in the embodiment, because a large number of Chinese-English parallel sentences between Chinese and English serve as the supervision information to the material, the sentence semantics of different languages can be aligned to a common characterization space by a supervised sentence characterization alignment method, and the similarity of sentences in the common characterization space can be obtained by calculating the distance of sentence vectors without translating into single language and then calculating, so that the semantic loss possibly caused by translation deviation in the machine translation process is avoided.
As shown in fig. 2, the alignment of chinese and english sentence characterization of the false negative examples includes the following steps.
S11: and selecting anchor samples and generating corresponding negative samples and positive samples on the basis of the Chinese-English parallel sentence pair data set.
For an anchor sample of a chinese sentence, its positive sample is an english sentence with similar semantics, and its negative sample is an english sentence with dissimilar semantics, the quality of the positive sample and the negative sample will have a great influence on the training of the chinese sentence characterization alignment model, so that the characterization capability of the model will be reduced. The currently available training data set is only a Chinese-English parallel sentence pair data set, each Chinese-English parallel sentence pair is composed of a pair of parallel Chinese sentences and English sentences, and it is assumed that there is a training batchNFor Chinese-English parallel sentence pair
Figure SMS_37
For Chinese sentences
Figure SMS_38
In other words, a positive sample is the English sentence parallel to it
Figure SMS_39
The negative sample is English sentence in other parallel sentence pairs in the same training batch
Figure SMS_40
In the same way, as English sentence
Figure SMS_41
When the anchor sample is used, the positive sample is a Chinese sentence parallel to the anchor sample
Figure SMS_42
Negative examples are Chinese sentences aligned with other parallel sentences in the same training batch
Figure SMS_43
S12: and screening the negative samples, screening out false negative samples, and reserving the true negative samples.
For negative samples
Figure SMS_44
Required to satisfy and anchor the sample
Figure SMS_45
The semantic meaning is not similar to the condition, but the negative sample constructed by other Chinese-English parallel sentence pairs in the same training batch does not necessarily meet the condition, and the negative sample needs to be screened. For negative samples
Figure SMS_46
If it is associated with an anchor sample
Figure SMS_47
Semantic similarity of
Figure SMS_48
Greater than a set threshold
Figure SMS_49
Then, it means that the negative example and the anchor example are semantically similar and do not meet the requirement of the negative example, that is, the negative example is a false negative example. If the characterization distance between the anchor sample and the false negative sample is pulled, the alignment of sentence characterization is affected, and the performance of the Chinese and English sentence characterization alignment model is damaged.
But because of anchor samples
Figure SMS_58
And negative sample
Figure SMS_52
The sentences are sentences of different languages, the existing Chinese and English sentence semantic similarity calculation technology is not mature (the precision is not high), and the similarity of the Chinese and English sentences cannot be directly calculated, so that a method for indirectly calculating the similarity is provided. When the anchor sample is a Chinese sentence
Figure SMS_56
If it is to beEnglish sentence
Figure SMS_55
Calculating similarity by using Chinese sentences
Figure SMS_57
Parallel English sentences
Figure SMS_59
Chinese and English sentences
Figure SMS_64
Calculating the similarity to obtain
Figure SMS_60
In the same way, english sentences can be used
Figure SMS_62
Parallel Chinese sentences
Figure SMS_50
Chinese and Japanese sentences
Figure SMS_53
Calculating the similarity to obtain
Figure SMS_54
Here, the
Figure SMS_63
And
Figure SMS_61
respectively representing the similarity calculation of Chinese sentences and the similarity calculation of English sentences. In specific implementation, the semantic similarity between the anchor sample and the negative sample is calculated by utilizing a mature single-language sentence semantic similarity calculation model, when the Chinese sentence similarity is calculated, a Chinese sentence encoder RoBERTA-wwm-ext (Chinese version) can be utilized to encode the Chinese sentences, then the feature vectors of the words obtained after encoding are subjected to average pooling to obtain the feature vectors of the Chinese sentences, and then the Chinese sentence similarity is obtained by calculating the cosine similarity of the feature vectors of the two Chinese sentences
Figure SMS_65
(ii) a When calculating the similarity of an English sentence, the English sentence may be encoded by using an English sentence encoder bert-base-uncased (English version), then the feature vectors of the words obtained after encoding are averaged and pooled to obtain the feature vectors of the English sentence, and then the similarity of the English sentence is obtained by calculating the cosine similarity of the feature vectors of two English sentences
Figure SMS_51
Since the similarity between the negative sample and the anchor sample is greater, which indicates that the negative sample is a false negative sample, in order to improve the robustness of determining the false negative sample, in a preferred embodiment of the present invention, the Chinese sentence
Figure SMS_66
And English sentence
Figure SMS_67
Semantic similarity of
Figure SMS_68
The maximum of the two is taken and is expressed as follows:
Figure SMS_69
after obtaining the similarity, a threshold value needs to be set
Figure SMS_70
If the similarity of the negative sample and the anchor sample
Figure SMS_71
Then, it means that this negative sample is a false negative sample, which cannot be used to train the model, and it needs to be weighted to 0 to remove its influence on the model training. The weight assignment for negative examples can be expressed as follows:
Figure SMS_72
wherein the content of the first and second substances,
Figure SMS_73
representing anchor samples
Figure SMS_74
Negative sample of
Figure SMS_75
The weight of (c). Of course, it should be noted that the above description takes the anchor sample as the chinese sentence
Figure SMS_76
For illustration, fig. 3 is a schematic diagram of a positive and negative sample construction process. In the same way, when English sentence is used
Figure SMS_77
When the sample is an anchor sample, the process principle of screening the false negative sample is the same, and only the corresponding adjustment of the processing object is needed.
S13: and constructing a Chinese and English sentence representation alignment model.
The chinese-english sentence representation alignment model is shown in fig. 4, in which fig. 4 (a) is a schematic diagram of a chinese sentence as an anchor sample, and fig. 4 (b) is a schematic diagram of an english sentence as an anchor sample. The Chinese sentences enter a Chinese sentence encoder to obtain the characteristic vectors of the words in the sentences, the characteristic vectors of the Chinese sentences are obtained through an average pooling layer, the English sentences enter an English sentence encoder to obtain the characteristic vectors of the words in the sentences, and the characteristic vectors of the English sentences are obtained through the average pooling layer. The English sentence encoder adopts bert-base-uncased (English version), which is a trained single-language English sentence encoder and can carry out semantic representation on English sentences; the Chinese sentence encoder adopts RoBERTA-wwm-ext (Chinese version), which is a trained single-language Chinese sentence encoder and can perform semantic representation on Chinese sentences.
S14: and training Chinese and English sentences to represent an alignment model.
Training Chinese and English sentence characterization alignment model by using anchor samples, positive samples and negative samplesIn the training process, the loss function adopts the variant of InfonCE loss, when the anchor sample is a Chinese sentence
Figure SMS_78
The positive sample is an English sentence
Figure SMS_79
The negative sample is N-1 English sentences
Figure SMS_80
. Alignment loss function
Figure SMS_81
Comprises the following steps:
Figure SMS_82
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_83
representing Chinese sentencesxEnglish sentenceyThe vector similarity of (2);
Figure SMS_84
represents a temperature over-parameter;
Figure SMS_85
representing anchor samples
Figure SMS_86
Negative sample of
Figure SMS_87
The weight of (c); subscriptijThe numbering of the english parallel sentence pairs in the representation,Nthe total number of english parallel sentence pairs in the training batch. The alignment loss function is equivalent to oneNSoftmax classification of ways in which positive samples are correct and N-1 negative samples are incorrect. The training target of the model is to draw the vector distance between the anchor sample and the positive sample and draw the vector distance between the anchor sample and the negative sample, thereby being capable of drawing the expression of sentences with similar semantemes in the common semantic space and simultaneously drawing the sentences with dissimilar semantemes in the common semantic spaceAnd the representation in the space realizes the aim of aligning to the same characterization space.
Similarly, when the anchor sample is an English sentence
Figure SMS_88
Time of alignment loss function
Figure SMS_89
Comprises the following steps:
Figure SMS_90
wherein the content of the first and second substances,
Figure SMS_91
representing anchor samples
Figure SMS_92
Negative sample of
Figure SMS_93
The weight of (c).
The chinese-english sentence represents the target function of the alignment model as follows:
Figure SMS_94
the aim of the overall training is to minimize the distance between vectors of sentences with similar semantemes and maximize the distance between vectors of sentences with dissimilar semantemes, so as to realize the alignment of sentence characterization spaces in different languages.
The objective function includes two alignment penalties, corresponding to alignment from two directions (medium- > english, in-). To take a simple example, there are the following four pairs of Chinese parallel sentence pairs to train the model:
(1) english is the most widely used language. English is the most wild spoke language.
(2) Chinese is the language that is used by the most people. Chinese is the most specific language.
(3) There are about 6909 languages existing in the world. The pure are about 6909 laguages in existence in the world.
(4) English is most widely used. English is the most widely used.
English is the most widely used language when in parallel sentences [ (1) english. When the loss training model is calculated by English is the most with spoke language ]:
1. when "english" is the most widely used language. "when used as an anchor sample, a positive sample is" English is the most recent stroke language ", and a negative sample is" Chinese is the most recent stroke language "," the re are about 6909 languages in experiments in the world ", and" English is the most recent language ". Screening negative samples can judge that { "Chinese is the most spoken language", "the re are about 6909 languages in existence in the world" } is a true negative sample, and "English is the most popular language with" semantics and "English are the most widely used languages. "semantically similar, so are false negative examples. We want to use "english as the most widely used language in the common representation space. "draws closer to" English is the most popular language with the last window language. The two sentences of 'Chinese is the most spoken language', 'the are about 6909 languages in existence in the world' } are pulled away, and the false negative sample is not pulled away.
2. When "English is the most popular language" is used as the anchor sample, the positive sample is "English is the most widely used language. "the negative example is" Chinese is the language with the largest number of people. "," there are about 6909 languages existing in the world. The "and" english language is most widely used. ". Screening the negative samples can judge that the Chinese is the language with the most number of people. "," there are about 6909 languages existing in the world. "is a true negative sample, while" English is most widely used. "semantics are similar to" English is the most with spoke language, "semantics, and so is a false negative sample. We want to associate "English with the most widely used language in the common representation space. "zoom in, while" English is the most recent stroke language "and {" Chinese are the most popular languages. "," there are about 6909 languages existing in the world. "} the two sentences are zoomed out, false negative samples are not zoomed out.
After such two-way training, the most widely used language is for the sentence "english". "the sentence" English is the most widely used language for the sentence "English is the most mobile with a spoken language. "is also closest to it in the characterization space. If only one direction is trained, this may lead to a situation: the sentence "english is the most widely used language. "the sentence" English is the most widely used language for the sentence "English is the most mobile with a spoken language. "is not closest to it in the token space, i.e., there may be other sentences closer to the sentence" English is the most with the stroke language ", which does not achieve the purpose of sentence token alignment, so that bi-directional alignment is required.
Fig. 5 is a diagram showing a result after chinese-english sentence representation alignment (chinese-english sentence representation alignment is shown with an anchor sample of "english is the most widely used language" as an example), before chinese-english sentence representation alignment, an english sentence semantic representation space and a chinese sentence semantic representation space are two independent representation spaces, and a distance between their representation vectors in chinese sentences and english sentences cannot reflect their semantic similarity. After the Chinese and English sentences are characterized and aligned, the characterization distances of the anchor sample and the positive sample are shortened, and the characterization distances of the anchor sample and the negative sample are simultaneously shortened, so that the English sentence semantic characterization space and the Chinese sentence semantic characterization space are aligned to a common sentence semantic characterization space, and the semantic similarity of the Chinese sentence and the English sentence can be reflected by the distance of the characterization vectors of the Chinese sentence and the English sentence in the common sentence semantic characterization space.
S15: and obtaining a Chinese and English sentence representation model.
After the training of the Chinese and English sentence representation and alignment model is completed, the Chinese sentence encoder and the English sentence encoder in the Chinese and English sentence representation and alignment model can be used as Chinese and English sentence representation models, the Chinese sentences and the English sentences are respectively encoded to obtain corresponding vectors, the obtained Chinese sentence vectors and the obtained English sentence vectors are in a common sentence semantic representation space, and the semantic similarity between the sentences can be obtained by calculating the distance between the vectors.
S2: and in the second stage, the multi-level Chinese and English paragraph representation based on the graph attention network.
The multi-level Chinese and English paragraph characterization based on the graph attention network specifically comprises the following processes:
s21: chinese/English paragraphs are modeled from three levels of subject words, sentences and paragraphs (multilevel information modeling of paragraphs).
Firstly, establishing subject word nodes and sentence nodes according to subject words and sentences of a paragraph, adding a global paragraph node, establishing an edge relationship between the sentence nodes and the subject word nodes according to whether the sentence contains the subject words, and simultaneously respectively establishing the edge relationship between the sentence nodes and the edge relationship between the subject word nodes according to the context relationship of the sentence in the paragraph and the co-occurrence relationship of the subject words in the sentence, wherein the global paragraph node is connected with each subject word node and each sentence node by an edge. In this way, paragraphs are modeled from three levels of subject words, sentences and paragraphs into a paragraph graph to be used as input for next Chinese paragraph characterization. The method comprises the following specific steps:
s211: extracting key words in Chinese/English paragraphs by TF-IDF algorithm, wherein the key words are words capable of expressing the central content of the paragraph, and extracting the top word with highest importancekThe key words are used as subject words of the paragraphs and corresponding subject word nodes are constructed
Figure SMS_95
Separately record thiskThe sentence number in which the subject word appears and the word order in the sentence.
S212: segmenting Chinese/English paragraphsSentence, a sentence node is constructed for each sentence in the paragraph, and is expressed as
Figure SMS_96
And c is the number of sentences in the Chinese/English paragraphs.
S213: constructing a global paragraph node for Chinese/English paragraphs
Figure SMS_97
S214: and initializing the characteristics of the nodes by using the Chinese and English sentence representation model.
Suppose that the constructed paragraph graph haskThe number of the subject term nodes is equal to the number of the subject term nodes,csentence nodes, 1 global node, in totaln=k +c+1 node, and sequencing the nodes according to the sequence of the subject term node, the sentence node and the global node to obtain
Figure SMS_98
And initializing the characteristics of the nodes, wherein the method comprises the following steps:
for each subject term node
Figure SMS_99
When the subject word is coded, firstly, the sentence with the subject word is input into a Chinese-English sentence representation model, after the sentence is coded by a Chinese/English sentence coder, a characteristic vector of each word is obtained, and the characteristic vector of the subject word is extracted according to the recorded word sequence and is used as an initialization vector of the subject word node; if the subject term appears in the paragraph for many times, the feature vectors of the subject terms at each position are obtained through the method, and then the initialization features of the subject term nodes are obtained by performing average pooling on the vectors.
For thekThe feature vector can be obtained by each subject term node
Figure SMS_100
. The obtained subject word features contain the context information of the sentence, and words in different sentences can have different semantics according to the difference of the context information, so that the problem of ambiguity of one word can be effectively solved.
The initialized feature vector of the sentence node is the feature vector obtained by coding and average pooling of the sentence through the Chinese and English sentence representation model, and for the sentence, the initialized feature vector is the feature vectorcOne sentence node gets
Figure SMS_101
The initialized feature vector of the global node is obtained by carrying out average pooling on initialized feature vectors of all other subject word nodes and sentence nodes
Figure SMS_102
S215: establishing an edge relation between nodes by the following method:
(1) If the sentence A contains the subject word a, establishing an edge between the sentence node corresponding to the sentence A and the subject word node corresponding to the subject word a for connection;
(2) If the sentence A and the sentence B have a context relationship in the Chinese paragraph/English paragraph, establishing an edge between the sentence node corresponding to the sentence A and the sentence node corresponding to the sentence B for connection;
(3) If the subject word a and the subject word b have a co-occurrence relation in the sentences of the Chinese paragraphs/English paragraphs, establishing an edge between the corresponding subject word nodes for connection;
(4) The global paragraph nodes are connected with all sentence nodes and subject term nodes by establishing an edge;
(5) Each node establishes an edge to which it is connected.
S216: and modeling and representing the edge relation between the nodes.
For nodepAnd nodeqIf there is an edge connection between them, they are each other's neighbor nodes (the node itself also counts as its neighbor node). If nodepAnd nodeqAre mutually adjacent sectionsPoint is then
Figure SMS_103
=1, otherwise
Figure SMS_104
And =0. Modeling of the edge relationship between nodes is represented as follows:
Figure SMS_105
wherein the content of the first and second substances,
Figure SMS_106
representing nodespAnd nodeqThe relationship between the two edges.
S22: information interaction and information fusion based on the graph attention network.
And processing the paragraph graph through a graph attention network, transmitting information among all nodes, and finally generating a paragraph representation vector of a Chinese paragraph/English paragraph by fusing information of three layers of subject words, sentences and global paragraph information.
For each layer of graph attention network, outputting the node feature vector of the attention network of the previous layer of graph
Figure SMS_107
As the input of the attention network of the layer diagram, the feature vector of each node after the information interaction is output
Figure SMS_108
(ii) a Wherein n is the total number of nodes; the input of the first layer graph attention network is initialized feature vectors of all nodes and edge relation between the nodes
Figure SMS_109
. The process of constructing the attention network of each layer is as follows:
to obtain sufficient expressive power, a shared linear transformation W is learned to transform the input features of the nodes into deeper features. Self-attention is then drawn between the nodes to compute the weight of the edges between the nodes if the nodes arepHejie (Chinese character of 'Hejie')DotqIf there is an edge connection between them, the edge relation is
Figure SMS_110
The weight calculation formula of (a) is as follows:
Figure SMS_111
wherein the content of the first and second substances,
Figure SMS_112
representing nodesqTo nodepThe importance of (c);
Figure SMS_113
representing a similarity function between two vectors, here cosine similarity is used; in the information transfer process, the neighbor node information of each node is transferred to the node, wherein the neighbor nodes comprise the node itself, and the weight of the edge relation of the neighbor nodes of the node is normalized by softmax:
Figure SMS_114
wherein the content of the first and second substances,
Figure SMS_115
is a nodepThe set of neighboring nodes of (a) is,mrepresenting nodesm. After the normalized attention weight is obtained, the normalized attention weight can be used for updating the nodepIs characterized by, nodepUpdated feature vectors
Figure SMS_116
Is represented as follows:
Figure SMS_117
wherein the content of the first and second substances,
Figure SMS_118
representing a non-linear transformation function.
Each layer is illustratedThe feature vector of each node finally output by the attention network is input into the attention network of the next layer of graph, and the final feature vector of each node is obtained after the feature vector is output by the attention network of the last layer of graph
Figure SMS_119
And (3) fusing the final feature vectors of all the nodes to obtain paragraph characterization vectors of Chinese paragraphs/English paragraphs, and firstly, respectively performing average pooling on the feature vectors of subject word nodes and sentence nodes, wherein the average pooling is expressed as follows:
Figure SMS_120
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_121
and respectively representing the subject word node characterization vector and the sentence node characterization vector after the average pooling. Then splicing the subject word node characterization vector, sentence node characterization vector and feature vector of global paragraph node after average pooling, and performing dimension compression through two full-connection layers to obtain paragraph characterization vector of Chinese paragraph/English paragraph
Figure SMS_122
Expressed as follows:
Figure SMS_123
wherein the content of the first and second substances,
Figure SMS_124
a feature vector representing a node of the global paragraph,
Figure SMS_125
respectively representing the weight matrices of two fully connected layers,
Figure SMS_126
respectively representing the bias coefficients of the two fully-connected layers.
S3: and in the third stage, calculating semantic similarity of the Chinese-English cross language paragraphs.
Computing paragraph characterization vectors for chinese paragraphs
Figure SMS_127
Paragraph characterization vector with English paragraph
Figure SMS_128
The semantic similarity between the Chinese paragraph and the English paragraph is obtained according to the distance between the Chinese paragraph and the English paragraphSimilarity
Figure SMS_129
Wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_130
the cosine similarity between the vectors is calculated.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A Chinese and English semantic similarity calculation method for paragraph level text is characterized by comprising the following steps:
respectively extracting paragraph characterization vectors of Chinese paragraphs and English paragraphs, which comprises the following processes:
constructing a subject word node for each subject word in the Chinese paragraph/English paragraph, constructing a sentence node for each sentence in the Chinese paragraph/English paragraph, and constructing a global paragraph node for the Chinese paragraph/English paragraph;
extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes;
establishing an edge relation between sentence nodes according to whether a sentence contains a subject word, establishing an edge relation between sentence nodes according to a context relation of the sentence in a corresponding paragraph, establishing an edge relation between subject word nodes according to a co-occurrence relation of the subject word in the sentence, and connecting the global paragraph nodes with each subject word node and each sentence node by using an edge; each node establishes an edge connected with the node; modeling an edge relation between nodes;
inputting the initialized feature vectors of all nodes and the edge relation among the nodes into the attention network of the graph, and outputting the feature vectors of all the nodes after information interaction;
respectively performing average pooling on feature vectors of subject word nodes and sentence nodes, and then splicing and reducing dimensions on feature vectors of the subject word nodes, the sentence node feature vectors and global paragraph nodes after the average pooling to obtain paragraph feature vectors of Chinese paragraphs/English paragraphs;
and calculating the distance between the paragraph characterization vector of the Chinese paragraph and the paragraph characterization vector of the English paragraph to obtain the semantic similarity of the Chinese paragraph and the English paragraph.
2. The method for calculating Chinese and English semantic similarity of paragraph-level text according to claim 1, wherein before performing paragraph characterization vector extraction on Chinese paragraphs and English paragraphs, respectively, the method further comprises:
and training a data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model, wherein the Chinese-English sentence representation alignment model is used for extracting initialized feature vectors of subject word nodes, sentence nodes and global paragraph nodes.
3. The method for calculating Chinese-English semantic similarity for paragraph-level text according to claim 2, wherein the training of the data set based on Chinese-English parallel sentences to obtain a Chinese-English sentence representation alignment model comprises:
selecting an anchor sample from the Chinese and English parallel sentence data set, wherein sentences parallel to the anchor sample are positive samples, and sentences in other Chinese and English parallel sentences, which are different from the anchor sample in language, are negative samples;
for each negative sample, if the semantic similarity between the negative sample and the anchor sample is greater than a set threshold, the negative sample is a false negative sample, and the weight of the false negative sample is assigned to be 0; otherwise, assigning the weight of the negative sample to be 1;
training the Chinese and English sentence representation alignment model by using a training set obtained after the negative sample is removed to obtain a Chinese and English sentence representation alignment model after training is completed; the Chinese and English sentence representation and alignment model comprises two feature extraction branches, wherein one branch comprises a Chinese sentence encoder and an average pooling layer, and the other branch comprises an English sentence encoder and an average pooling layer.
4. The method for calculating Chinese and English semantic similarity of paragraph-level text according to claim 3, wherein the semantic similarity with the anchor sample is calculated for each negative sample by the following method:
calculating a first semantic similarity between the sentences parallel to the negative sample and the anchor sample by using a single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;
or calculating a second semantic similarity between sentences parallel to the negative sample and the anchor sample by using the single-language sentence semantic similarity calculation model as the semantic similarity between the negative sample and the anchor sample;
or taking the maximum value of the first semantic similarity and the second semantic similarity as the semantic similarity between the negative sample and the anchor sample.
5. The method for calculating Chinese-English semantic similarity for paragraph-level text according to claim 3, wherein when training a Chinese-English sentence representation alignment model, the objective function is expressed as follows:
min
Figure QLYQS_1
wherein, in the step (A),
Figure QLYQS_2
and
Figure QLYQS_3
the alignment loss function for the anchor sample in chinese and english sentences, respectively, is expressed as follows:
Figure QLYQS_4
Figure QLYQS_5
wherein, in the step (A),
Figure QLYQS_6
representing the vector similarity of the Chinese sentence x and the English sentence y;
Figure QLYQS_7
represents a temperature over-parameter;
Figure QLYQS_8
representing anchor samples
Figure QLYQS_9
Negative sample of
Figure QLYQS_10
The weight of (c); the subscripts i, j indicate the number of pairs of middle english parallel sentences and N is the total number of pairs of middle english parallel sentences in the training batch.
6. The method for calculating Chinese-English semantic similarity of a paragraph-level text according to any one of claims 1 to 5, wherein the constructing a subject word node for each subject word in a Chinese/English paragraph and a sentence node for each sentence in the Chinese/English paragraph comprises:
extracting key words in Chinese paragraphs/English paragraphs by TF-IDF algorithm, extracting top k key words with highest importance as subject words of Chinese paragraphs/English paragraphs, and correspondingly constructing subject word nodes
Figure QLYQS_11
Separately recording the k occurrences of the subject termSentence number and word order in the sentence;
dividing Chinese paragraph/English paragraph, and constructing a sentence node for each sentence of Chinese paragraph/English paragraph, wherein the sentence node is expressed as
Figure QLYQS_12
And c is the number of sentences in Chinese/English paragraphs.
7. The method for calculating Chinese-English semantic similarity of paragraph-level text according to any one of claims 2 to 5, wherein the extracting initialized feature vectors of subject term nodes, sentence nodes and global paragraph nodes comprises:
for each subject word node, counting the sequence number of the sentence and the word sequence in the sentence, coding the sentence with the corresponding subject word by using a Chinese sentence coder/English sentence coder in a Chinese-English sentence representation alignment model to obtain the characteristic vector of each word in the sentence, and extracting the characteristic vector of the subject word as the initialization characteristic vector of the subject word node according to the word sequence of the subject word in the sentence; if the subject term appears in the Chinese paragraph/English paragraph for many times, average pooling is carried out on the feature vector of the subject term at each position to obtain an initialized feature vector of the subject term node;
for each sentence node, a Chinese sentence encoder/English sentence encoder in a Chinese and English sentence representation alignment model is used for encoding a sentence to obtain a feature vector of each word, and then average pooling is carried out to obtain the feature vector as an initialization feature vector of the sentence node;
and carrying out average pooling on the initialized feature vectors of all subject word nodes and sentence nodes to obtain the initialized feature vectors of the global paragraph nodes.
8. The method for calculating Chinese and English semantic similarity for paragraph-level text according to any one of claims 1 to 5, wherein the edge relation between nodes is determined by the following method:
if the sentence A contains the subject word a, establishing an edge between the sentence node corresponding to the sentence A and the subject word node corresponding to the subject word a for connection;
if the sentence A and the sentence B have a context relationship in the Chinese paragraph/English paragraph, establishing an edge between the sentence node corresponding to the sentence A and the sentence node corresponding to the sentence B for connection;
if the subject word a and the subject word b have a co-occurrence relation in the sentences of the Chinese paragraphs/English paragraphs, establishing an edge between the corresponding subject word nodes for connection;
the global paragraph nodes are connected with all sentence nodes and subject term nodes by establishing an edge;
each node establishes an edge connected with the node;
modeling of the edge relationship between nodes is represented as follows:
Figure QLYQS_13
wherein, in the step (A),
Figure QLYQS_14
representing the edge relationship between node p and node q.
9. The method for calculating similarity of Chinese and English semantics of paragraph level text according to any one of claims 1 to 5, wherein the step of inputting the initialized feature vectors of all nodes and the edge relations between the nodes into a graph attention network and outputting the feature vectors of each node after information interaction comprises:
for each layer of graph attention network, outputting the node feature vector of the attention network of the previous layer of graph
Figure QLYQS_15
As the input of the attention network of the layer diagram, the feature vector of each node after the information interaction is output
Figure QLYQS_16
(ii) a Wherein n is the total number of nodes; first of allThe input of the layer diagram attention network is the initialized characteristic vectors of all nodes and the edge relation among the nodes; wherein, the construction process of the attention network of each layer of the graph is as follows:
learning a shared linear transformation W, self-attention between nodes to compute the weight of edges between nodes, if an edge is connected between node p and node q, the edge relation
Figure QLYQS_17
The weight calculation formula of (a) is as follows:
Figure QLYQS_18
wherein, in the step (A),
Figure QLYQS_19
indicating the importance of node q to node p,
Figure QLYQS_20
representing a similarity function between the two vectors;
the neighbor node information of each node is transmitted to the node, wherein the neighbor nodes comprise the node itself, and the weight of the edge relation of the neighbor nodes of the node is normalized by softmax:
Figure QLYQS_21
wherein, in the step (A),
Figure QLYQS_22
is a neighbor node set of node p, and m represents node m;
updated feature vector of node p
Figure QLYQS_23
Is represented as follows:
Figure QLYQS_24
wherein, in the step (A),
Figure QLYQS_25
representing a non-linear transformation function;
inputting the feature vector of each node finally output by the attention network of each layer of graph into the attention network of the next layer of graph, and obtaining the final feature vector of each node after the output of the attention network of the last layer of graph
Figure QLYQS_26
10. The method for calculating Chinese-English semantic similarity of paragraph-level text according to any one of claims 1 to 5, wherein the paragraph characterization vector of Chinese/English paragraphs is obtained by:
by using
Figure QLYQS_27
Feature vectors representing subject term nodes, using
Figure QLYQS_28
Feature vectors representing nodes of sentences by
Figure QLYQS_29
A feature vector representing a global paragraph node; wherein k is the total number of subject word nodes, and c is the total number of sentence nodes;
respectively carrying out average pooling on feature vectors of subject word nodes and sentence nodes, and expressing the feature vectors as follows:
Figure QLYQS_30
Figure QLYQS_31
wherein, in the step (A),
Figure QLYQS_32
respectively representing the representation vector and sentence node table of the subject word node after average poolingA eigenvector;
splicing the subject word node characterization vector, sentence node characterization vector and feature vector of global paragraph node after average pooling, and performing dimension compression through two full-connection layers to obtain paragraph characterization vector of Chinese paragraph/English paragraph
Figure QLYQS_33
Expressed as follows:
Figure QLYQS_34
wherein, in the step (A),
Figure QLYQS_35
a feature vector representing a node of the global paragraph,
Figure QLYQS_36
respectively representing the weight matrices of two fully connected layers,
Figure QLYQS_37
respectively representing the bias coefficients of two fully connected layers.
CN202310085688.7A 2023-02-09 2023-02-09 Chinese and English semantic similarity calculation method for paragraph level text Active CN115828931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310085688.7A CN115828931B (en) 2023-02-09 2023-02-09 Chinese and English semantic similarity calculation method for paragraph level text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310085688.7A CN115828931B (en) 2023-02-09 2023-02-09 Chinese and English semantic similarity calculation method for paragraph level text

Publications (2)

Publication Number Publication Date
CN115828931A true CN115828931A (en) 2023-03-21
CN115828931B CN115828931B (en) 2023-05-02

Family

ID=85520932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310085688.7A Active CN115828931B (en) 2023-02-09 2023-02-09 Chinese and English semantic similarity calculation method for paragraph level text

Country Status (1)

Country Link
CN (1) CN115828931B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236330A (en) * 2023-11-16 2023-12-15 南京邮电大学 Mutual information and antagonistic neural network based method for enhancing theme diversity

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332219A1 (en) * 2000-09-30 2010-12-30 Weiquan Liu Method and apparatus for determining text passage similarity
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN111967271A (en) * 2020-08-19 2020-11-20 北京大学 Analysis result generation method, device, equipment and readable storage medium
CN112818121A (en) * 2021-01-27 2021-05-18 润联软件系统(深圳)有限公司 Text classification method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332219A1 (en) * 2000-09-30 2010-12-30 Weiquan Liu Method and apparatus for determining text passage similarity
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN111967271A (en) * 2020-08-19 2020-11-20 北京大学 Analysis result generation method, device, equipment and readable storage medium
CN112818121A (en) * 2021-01-27 2021-05-18 润联软件系统(深圳)有限公司 Text classification method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236330A (en) * 2023-11-16 2023-12-15 南京邮电大学 Mutual information and antagonistic neural network based method for enhancing theme diversity
CN117236330B (en) * 2023-11-16 2024-01-26 南京邮电大学 Mutual information and antagonistic neural network based method for enhancing theme diversity

Also Published As

Publication number Publication date
CN115828931B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN109684648B (en) Multi-feature fusion automatic translation method for ancient and modern Chinese
CN113158665B (en) Method for improving dialog text generation based on text abstract generation and bidirectional corpus generation
CN107967262A (en) A kind of neutral net covers Chinese machine translation method
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
JP2023509031A (en) Translation method, device, device and computer program based on multimodal machine learning
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN110532395B (en) Semantic embedding-based word vector improvement model establishing method
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN115687571B (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN116306652A (en) Chinese naming entity recognition model based on attention mechanism and BiLSTM
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN115828931A (en) Chinese and English semantic similarity calculation method for paragraph-level text
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN111428518B (en) Low-frequency word translation method and device
Ma et al. E2timt: Efficient and effective modal adapter for text image machine translation
CN113076744A (en) Cultural relic knowledge relation extraction method based on convolutional neural network
CN117093692A (en) Multi-granularity image-text matching method and system based on depth fusion
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN111597810A (en) Semi-supervised decoupling named entity identification method
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant