CN108509415B

CN108509415B - Sentence similarity calculation method based on word order weighting

Info

Publication number: CN108509415B
Application number: CN201810217211.9A
Authority: CN
Inventors: 王清琛; 沈盛宇
Original assignee: Nanjing Yunwen Network Technology Co ltd
Current assignee: Nanjing Yunwen Network Technology Co ltd
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2021-09-24
Anticipated expiration: 2038-03-16
Also published as: CN108509415A

Abstract

The invention provides a sentence similarity calculation method based on word order weighting. The method comprises the following steps: is obtained in the form of<Label1_i,Sen1_i>Training a corpus A to obtain word vector models of all words in the corpus A; is constructed in the form of<Label2_j,Sen2_j>The method for testing the corpus B adopts an incremental training mode to obtain the Sen2 in the corpus B_jA word vector model for all words; obtaining a sentence Sen1 by adopting a word order weighting mode according to a word vector model obtained by the corpus B_iAnd Sen2_jSentence vector SenVec1_iAnd SenVec2_j(ii) a Calculate a Sen2 one by one_jWith each statement Sen1_iThe sentence Sen1 with the highest similarity is determined_iCorresponding Label1_iAnd Label2_jIf they are identical, then they are correct, otherwise, they are<Sen1_i,Sen2_j>Storing the training corpus C; and further processing the training corpus C to obtain a new word vector model so as to calculate the similarity of the next sentence. The accuracy of sentence similarity calculation is improved through the steps.

Description

Sentence similarity calculation method based on word order weighting

Technical Field

The invention relates to the technical field of natural language processing in the technical field of computers. In particular to a method for calculating sentence similarity based on word order weighting.

Background

Sentence similarity calculation is a very important basic problem in the field of natural language processing, and has wide application in many aspects of the field of natural language processing. For example, in machine-based translation, the degree of substitution of words in text is measured using text similarity; in a question-and-answer system (FAQ), question retrieval is performed using the similarity, and the degree of matching between the question of the question user and knowledge in a knowledge base is calculated. In the related art, similarity calculation has been an important issue of interest to researchers.

In the context of a study on statistical language models, Google corporation opened Word2vec, a software tool for training Word vectors in 2013. Word2vec can express a Word into a vector form quickly and effectively through an optimized training model according to a given corpus, and provides a new tool for application research in the field of natural language processing.

Disclosure of Invention

The invention aims to overcome the problems in the prior art, and provides a method for calculating sentence similarity based on word order weighting.

To achieve the above object, the present invention provides. The method comprises the following steps:

1) obtaining a corpus A by using a web crawler, adding classification labels to all sentences in the corpus A according to semantics to obtain a type<Label1_i,Sen1_i>Corpus of speech, wherein Sen1_iFor single sentence sentences in corpus A, Label1_iIs Sen1_iTraining to obtain Word vector models of all words in the corpus A by using Word2Vec algorithm according to the corresponding category labels;

2) constructing a test corpus B by using the corpus A obtained in the step 1), wherein the form of the test corpus B is<Label2_j,Sen2_j>Wherein Label2_jIs a category in corpus A, Sen2_jBelongs to Label2_jClass and Label2_jSemantic similarities of sentences corresponding to the categories in the corpus A are similar, and then Word vector models of all words in the corpus B are obtained by combining the Word vector model obtained in the step 1) and adopting an incremental training mode and utilizing a Word2Vec algorithm;

3) obtaining a pair of corpora A from the corpus A in the step 1)<Label1₁,Sen1₁>For Sen1₁Performing word segmentation processing, and obtaining a word vector V corresponding to each word segmentation result by using the word vector model obtained in the step 2)_1kAnd k denotes that the word is in the sentence Sen1₁The position of (1);

4) according to the statement Sen1 obtained in step 3)₁Word vector V corresponding to each word in the word list_1kAnd each term is located at Sen1₁Calculating the word order weight value weight of each word according to the position in the Chinese character, and calculating the word order weight value weight of each word according to the word vector V_1kObtaining new word vector V with weight according to corresponding word order weight value weight_1k’；

5) According to the weighted word vector V obtained in the step 4)_1k' acquisition statement Sen1₁Sentence vector SenVec1₁；

6) Repeating the steps 3) to 5), and calculating a sentence vector SenVec1 of all sentences in the corpus A_i；

7) Repeating the steps 3) to 5), and calculating a sentence vector SenVec2 of all sentences in the corpus B_j；

8) Selecting sentences Sen2 in the test corpus B in turn according to the sentence vectors obtained in the steps 6) and 7)_jAnd its corresponding sentence vector SenVec2_jRespectively calculating the sum of the calculated sum and each statement Sen1 in corpus A_iSelecting the sentence Sen1 with the highest similarity ranking_iCorresponding Label1_iAnd Label2_jComparing, if they are identical, indicating that it is correct, otherwise, using result to obtain result<Sen1_i,Sen2_j>Storing the training corpus C;

9) labeling the training corpus C obtained in the step 8) according to the similarity of SemEval-2017, then training by adopting an LSTM regression model to obtain a new word vector, updating the word vector model in the step 2) by using the newly trained word vector, and then executing the step 3) to carry out similarity calculation of the next sentence.

Preferably, the step 1) further comprises, before adding the classification label:

and removing redundant punctuations and webpage labels in the corpus A by adopting a regular matching mode, and only keeping single sentence sentences.

Preferably, a Hanlp open-source word segmentation device pair Sen1 is adopted in the step 3)₁And performing word segmentation processing.

Preferably, the weight value weight of the word order in step 4) is calculated by the following formula:

where k represents the position of the word in the sentence; loc denotes the weighted start position, and λ is a constant with a value in the range of 1-3.

Preferably, the step 5) is calculated by the following formula:

where n represents the total number of words in the sentence, V'_ikRepresenting the weighted word vector of the kth word in the ith sentence.

Preferably, the similarity in step 6) is calculated by the following formula:

wherein SenVec1_iPresentation statement Sen1_iSentence vector of, SenVec2_jPresentation statement Sen2_jThe sentence vector of (2).

The method for calculating the sentence similarity based on the word order weighting improves the accuracy of a task of calculating the sentence similarity by using the word order weighting mode; meanwhile, the accuracy of the task of sentence similarity is further improved by utilizing a supervision mode and increasing the training word vectors.

Drawings

Fig. 1 is a flowchart of a method for calculating sentence similarity based on word order weighting according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly and clearly understood, the technical solutions in the embodiments of the present invention will be described below in detail and completely with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the conventional method for calculating the similarity of sentences, the weighting mode of sentence vectors is complex, and the effect is uneven.

Fig. 1 is a flowchart of a sentence similarity calculation method based on word order weighting according to an embodiment of the present invention.

In step 101, a corpus a with a scale of 8000 is obtained through a web crawler, redundant punctuations and web page tags in the corpus a are removed in a regular matching manner, and only single sentence sentences are retained. Then, 8000 sentences are classified to obtain a corpus A shown in the following table:

Label1_i	single sentence Sen1_i
		Refund shipping costs	How to apply for return freight
Refund application	How to apply for refund
		Refund application	I want to refund
Application for change of goods	How to apply for changing goods
		...	...

And training to obtain Word vector models of all words in the corpus A by using the Word2Vec algorithm.

In step 102, a corpus B of 30000 is constructed, wherein the corpus form in corpus B is<Label2_j,Sen2_j>Wherein Label2_jFrom the classes contained in corpus A, Sen2_jBelongs to Label2_jClass and Label2_jThe sentence semantics corresponding to the category in the corpus A are similar. And (3) obtaining the Word vector models of all words in the corpus B by combining the Word vector model obtained in the step 101 and utilizing a Word2Vec algorithm in an incremental training mode.

The sentence pattern in corpus B is shown in the following table:

Label2_j	Sen2_j
		refund shipping costs	How to withdraw the freight
Refund application	Is the clothes unsuitable and can be returned
		Application for change of goods	The clothes are troublesome to help I change the bar
...	...

The format can be obtained by training:

word W	Word vector V
		Word₁	v₁₁,v₁₂...v_1d
...	...
		Word_n	v_n1,v_n2...v_nd

Where n represents the total number of words and d represents the total dimension of the word vector.

In step 103, a pair of sentences in corpus A is selected<Label1₁,Sen1₁>Such as a pair of sentences in the present embodiment<Refund freight, how to apply for refund freight>Statement Sen1₁How to apply for returning freight "is to perform word segmentation, the word segmentation result is (how, how to apply, return, and freight), a corresponding word vector is obtained from the word vector model trained in step 102, the dimension of the word vector model in this embodiment is 200, and some results are as follows:

word W	Word vector V_1k
		How to do	-0.15166749,0.10850359...-0.097950
Application for	0.099820456,0.11322714...0.06855157
		Retreat	0.04588356,0.08467035...-0.15038626
Freight charges	-0.010142227,-0.02377942...-0.09789387

In step 104, according to the position of each word, the word order weight of each word is calculated by using the following formula:

wherein k represents the position of the word in the sentence and counts from 1; loc denotes the weighted starting position, which can be set to between 1-3, usually when the sentence is short (the number of words contained in the sentence is less than 6), and λ is a constant, which can be set to between 1-3. Test statement Sen1 in this embodiment_i"how to apply for backtracking", Loc is set to 1, λ is set to 1.5, and the word order weight values for the words are shown in the following table:

word W	Word order weight
		How to do	0.8123090300973813
Application for	1.0
		Retreat	1.3004891818915623
Freight charges	1.6149794589701247

Multiplying the word vector obtained in step 103 by the calculated word weight to obtain a new word vector, as shown in the following table:

word W	Word vector V_1k’
		How to do	0.123200872,0.088138446...0.079565669
Application for	0.099820456,0.11322714...0.06855157
		Retreat	0.059671073,0.110112874...-0.195575704
Freight charges	-0.016379488,-0.038403275...-0.158096589

In step 105, the updated word vector V obtained in step 104 is used_1kAdd and average to get the statement "how to apply for retirement charges"sentence vector SenVec1₁The concrete formula is as follows:

where n represents the total number of words in the sentence, V'_ikRepresenting the weighted word vector of the kth word in the ith sentence. With the above formula, the result can be obtained:

sentence Sen1₁	SenVec1₁
		How to apply for return freight	0.066578228,0.068268796...-0.051388763

In step 106, the above steps 103 to 105 are repeated, and sentence vectors of all 8000 sentences in all corpus a are sequentially calculated.

In step 107, the above steps 103 to 105 are repeated, and sentence vectors of all 30000 sentences in all corpus B are sequentially calculated.

In step 108, the sentence vectors of all sentences in corpus a obtained in step 106 and the sentence vectors of all sentences in corpus B obtained in step 107 are combined. Selecting a sentence Sen2 in corpus B in turn_jAnd its sentence vector SenVec2_jAnd using the formula:

compute the statement Sen2 in turn_jHe languageEach statement Sen1 in Material set A_iThe similarity between them. Wherein SenVec1_iRepresents a sentence vector, SenVec2, corresponding to the sentences in each corpus A_jRepresenting a sentence Sen2 in the corpus B_jThe sentence vector of (2).

For example, calculate a sentence Sen2 in corpus B_jHow to quit freight and each statement Sen1 in corpus A_iThe similarity between the two parts can be obtained by using the formula and sorting the parts according to the similarity by using word order weighting, and the results are shown in the following table:

similar sentence Sen1	Class Label1	Using word order weighted similarity
			How to apply for return freight	Refund shipping costs	0.9334955
How to apply for refund	Refund application	0.8528203
			How to apply for changing goods	Application for change of goods	0.7556491
How to purchase freight insurance	Purchasing freight insurance	0.6946413

In the above table, the sentence with the highest similarity is "how to apply for returning the freight", and the corresponding category is "returning the freight" is consistent with the category label "returning the freight" corresponding to "how to return the freight".

The sentences with higher similarity when no word order weighting is adopted are shown in the following table:

similar sentence Sen1	Class Label1	Not using word order weighted similarity
			How to apply for refund	Refund application	0.8910549
How to apply for return freight	Refund shipping costs	0.8341876
			How to apply for changing goods	Application for change of goods	0.7948803
How to purchase freight insurance	Purchasing freight insurance	0.7148501

The sentence with the highest similarity is "how to apply for refund", and the corresponding category label is that "refund application" is inconsistent with the category label "refund freight" corresponding to "how to refund freight". The description is effective to the task when the word order weighting is adopted to calculate the similarity.

Another example is:

for the sentence "telephone filling error" in the sentence "how the order information is filled in incorrectly, telephone filling error" in the corpus B, the calculation results of the related sentences in the corpus a are shown in the following table:

similar sentence Sen1	Class Label1	Using word order weighted similarity
			Goods returned telephone wrongly written	Return information filling error	0.91249734
Information filling error	How to do the order information filling error	0.8882772
			Order number wrongly written	How to do the order information filling error	0.78467226
Address wrongly written	How to do the order information filling error	0.7377515

The Label2 of the sentence Sen2 "telephone filling error" indicates how the order information is wrongly filled, and is inconsistent with the category Label1 "return information filling error" of the sentence "return telephone filling error" with the highest similarity in the final calculation result, and then < return telephone filling error > is stored in the corpus C.

According to the obtained similarity, the overall effect in the test corpus B is shown in the following table:

according to the experiment, the accuracy of calculating the similarity is improved by adopting a word order weighting mode.

In step 109, according to the corpus C obtained in step 108, a Similarity value labeling method of Daniel Cer et al in "SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-Textual Focused Evaluation" is adopted, and the Similarity value standard can be as shown in the following figure:

statement one	Sentence two	Similarity value
			He is bathing	He is in the bath	5
Those two persons walking on the road	The two people take the hands to walk	4
			He is unsuspecting about her	He has little doubt about her	3
Two of it are put together to build a nest	The two walk into the nest together	2
			The girl likes listening to music	He is playing piano	1
One dog runs by oneself	He is flying at high speed	0

The statement obtained as in step 108 will "return call misfilled, call misfilled" which can be labeled as "return call misfilled, 4".

After the corpus C is further processed and labeled, a new word vector is obtained by using the LSTM regression model for training, the word vector model in the step 102 is updated by using the newly trained word vector, then the step 103 is executed, the next sentence similarity calculation is performed according to the updated word vector model when the step 103 is executed, and the overall effect in the test corpus B is shown in the following table:

according to the experimental result, the new word vector is trained by the method, and great help is provided for the task of sentence similarity calculation.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A sentence similarity calculation method based on word order weighting is characterized by comprising the following steps:

1) obtaining a corpus A by using a web crawler, adding classification labels to all sentences in the corpus A according to semantics to obtain the form<Label1_i,Sen1_i>Corpus of speech, wherein Sen1_iFor the ith statement in corpus A, Label1_iIs Sen1_iTraining to obtain Word vector models of all words in the corpus A by using Word2Vec algorithm according to the corresponding category labels;

2) constructing a test corpus B by using the corpus A obtained in the step 1), wherein the form of the test corpus B is<Label2_j,Sen2_j>Wherein Label2_jIs a category in corpus A, Sen2_jFor the jth statement in corpus B, Sen2_jBelongs to Label2_jClass and Label2_jSemantic similarities of sentences corresponding to the categories in the corpus A are similar, and then Word vector models of all words in the corpus B are obtained by combining the Word vector model obtained in the step 1) and adopting an incremental training mode and utilizing a Word2Vec algorithm;

8) Selecting sentences Sen2 in the test corpus B in turn according to the sentence vectors obtained in the steps 6) and 7)_jAnd its corresponding sentence vector SenVec2_jRespectively calculating the sum of the calculated sum and each statement Sen1 in corpus A_iSelecting the sentence Sen1 with the highest similarity ranking_iCorresponding Label1_iAnd Label2_jComparing, if they are identical, indicating that it is correct, otherwise using result to obtain result<Sen1_i,Sen2_j>Storing the training corpus C;

2. The method for calculating sentence similarity based on word-order weighting according to claim 1, wherein the step 1) further comprises, before adding the classification tag:

3. According to claimThe method for calculating sentence similarity based on word order weighting is characterized in that a Hanlp open source word segmentation device pair Sen1 is adopted in the step 3)₁And performing word segmentation processing.

4. The method for calculating sentence similarity based on word-order weighting according to claim 1, wherein the word-order weight in step 4) is calculated by the following formula:

5. The word-order-weighting-based sentence similarity calculation method according to claim 1, wherein the step 5) is calculated by the following formula:

wherein n represents a sentence Sen1_iTotal number of Chinese words, V'_ikPresentation statement Sen1_iAnd (5) weighted word vectors of the kth word.

6. The method for calculating sentence similarity based on word-order weighting according to claim 1, wherein the similarity in step 6) is calculated by the following formula: