CN112733520B

CN112733520B - Text similarity calculation method, system, corresponding equipment and storage medium

Info

Publication number: CN112733520B
Application number: CN202011604778.5A
Authority: CN
Inventors: 张俊锋; 程煜华; 黄俊杰; 侯丹丹; 翟文丽
Original assignee: Wanghai Kangxin Beijing Technology Co ltd
Current assignee: Wanghai Kangxin Beijing Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-07-18
Anticipated expiration: 2040-12-30
Also published as: CN112733520A

Abstract

The application discloses a text similarity calculation method, a text similarity calculation system, corresponding equipment and a storage medium, wherein the method comprises the following steps: chinese word segmentation is carried out on each text, and a word sequence is generated; generating a word vector of fixed length using a trained word2vec model for each word of each word sequence; generating a 2-gram phrase sequence based on each word sequence; generating a word vector of each 2-gram phrase sequence; word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are respectively combined into text vectors; calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain a text space vector; similarity between the respective texts is calculated based on the text space vector. The invention can improve the matching accuracy of the paraphrasing, the synonyms and the related words.

Description

Text similarity calculation method, system, corresponding equipment and storage medium

Technical Field

The present disclosure relates to the field of electronic digital data processing, and in particular, to a text similarity calculation method, a text similarity calculation system, a corresponding device, and a storage medium.

Background

The text similarity calculation is widely applied in the field of computers, and has important application in the technical fields of text retrieval, content recommendation, intelligent question and answer, term labeling, advertisement delivery and the like. The current common text similarity calculation modes include the following: 1) Similarity calculation based on the amount of text word (general steps are: word segmentation; tf-idf; using the distance formula: cosine, pearson, edit distance, jaccard, etc.); 2) Word vector coding models or sentence vector coding models based on word embedding technology (generally the steps are: word segmentation; word2Vec/Gloves generates Word vectors; averaging the word vectors; using the distance formula: cosine, euclidean).

The two algorithms and variants thereof have respective advantages and disadvantages and are suitable for different application scenes. Generally, algorithms based on the amount of text vocabulary have a higher degree of recognition, but cannot understand semantics, while ignoring the order of the words' positions (for text, different orders tend to have different meanings). Word embedding coding often contains general grammar structures and semantic descriptions based on a large amount of corpus learning, and has good performance in classification and clustering problems. But the word embedding nature is a dimension reduction algorithm, the learned word vector does not have too high feature recognition degree, and the application effect is poor in searching and matching scenes.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a text similarity calculation method, a text similarity calculation system, corresponding equipment and a storage medium, which can improve the matching accuracy of hyponyms, synonyms and related words.

In a first aspect of the present invention, there is provided a text similarity calculation method, the method comprising:

chinese word segmentation is carried out on each text, and a word sequence is generated;

generating a word vector of fixed length using a trained word2vec model for each word of each word sequence;

generating a 2-gram phrase sequence based on each word sequence;

generating a word vector of each 2-gram phrase sequence;

word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are respectively combined into text vectors;

calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain a text space vector;

similarity between the respective texts is calculated based on the text space vector.

In an embodiment, the word2vec model includes a Skip-gram model and a CBOW model, wherein a first word vector of a word is generated using the Skip-gram model, and a second word vector of the same word is generated using the CBOW model, wherein the first word vector and the second word vector are connected as the word vector of the word.

In an embodiment, generating a word vector for each 2-gram phrase includes: the word vectors of the two words constituting each 2-gram phrase are added as the word vector of the corresponding 2-gram phrase.

In an embodiment, calculating the similarity between the respective texts based on the text space vectors includes:

for each word vector of the first text space vector, calculating Euclidean distance between each word vector of the second text space vector, respectively determining the word vector with the smallest Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating similarity between the two matched word vectors, and respectively calculating weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors;

and calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words.

In a second aspect of the present invention, there is provided a text similarity calculation system, the system comprising:

the word sequence generating module is used for carrying out Chinese word segmentation on each text to generate a word sequence;

a word-vector generation module for generating a word vector of a fixed length using a trained word2vec model for each word of each word sequence;

the phrase sequence generating module is used for respectively generating 2-gram phrase sequences based on each word sequence;

the phrase word vector generation module is used for generating word vectors of each 2-gram phrase sequence;

the text vector generation module is used for respectively combining word vectors of all words of the same word sequence and word vectors of corresponding 2-gram phrases into text vectors;

the text space vector generation module is used for calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain the text space vector;

and the text similarity calculation module is used for calculating the similarity between the corresponding texts based on the text space vector.

In a third aspect of the invention there is provided a computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect of the invention or implements the functions of the system according to the second aspect of the invention when the computer program is executed by the processor.

According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method according to the first aspect of the present invention or performs the functions of the system according to the second aspect of the present invention.

The invention combines the advantages of similarity calculation based on the literal quantity and encoding vector similarity calculation based on the word embedding model, optimizes the problem of poor semantic description capability of the literal quantity vector of the text by using the word embedding model, and ensures the feature resolution of the generated vector. By constructing a text space vector based on Word vectors based on a Word2Vec Word embedding model, comprehensive distance measurement and calculation are carried out on the basis of tf-idf weights among Word vectors, the problem that hyponyms, synonyms and related words cannot be identified is solved, and text matching accuracy is improved.

Other features and advantages of the present invention will become more apparent from the following detailed description of embodiments of the present invention, which is to be read in connection with the accompanying drawings.

Drawings

FIG. 1 is a flow chart of an embodiment of a method according to the present invention;

fig. 2 is a block diagram of an embodiment of a system according to the present invention.

For the sake of clarity, these figures are schematic and simplified drawings, which only give details which are necessary for an understanding of the invention, while other details are omitted.

Detailed Description

Embodiments and examples of the present invention will be described in detail below with reference to the accompanying drawings.

The scope of applicability of the present invention will become apparent from the detailed description given hereinafter. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only.

Fig. 1 shows a flowchart of a preferred embodiment of a text similarity calculation method according to the present invention. The method of this embodiment comprises:

in step S102, chinese segmentation is performed on each text, generating a word sequence. Each word after segmentation may contain 1 word, 2 words, or more than 3 words. The chinese word segmentation program may be any chinese word segmentation program known in the art. The similarity between the two texts "mixed lymphocyte culture" and "mixed leukocyte reaction" is taken as an example. For example, the text "mixed lymphocyte culture" can be divided into four words, mixed, lymphoid, cell, culture. The text of "mixed leukocyte reaction" can be divided into four words of mixed, leukocyte, cell, and reaction.

In step S104, for each word of each word sequence, a word vector of fixed length is generated using the trained word2vec model.

word2vec models are trained and built based on the relevant corpus. The word2vec model can convert Chinese words into numerical vectors and automatically train related words, paraphrasing words and synonyms. The training corpus may be determined based on the application domain. For example, if applied to the medical field, text contents of the medical field including contents collected from the internet are employed. The chinese word segmentation procedure used in training should be the same as the chinese word segmentation procedure used in the subsequent word segmentation (i.e., the chinese word segmentation procedure used in step S102).

In an embodiment, the word2vec model is a combination of a CBOW model and a skip-gram model. In this case, the CBOW model and skip-gram model are trained using the same training corpus, respectively. Skip-gram is a model that uses a word as input to predict the context around it. CBOW is a model that predicts a word itself using its context as input. The combination of Skip-gram model and CBOW model has better spatial expression capability and resolution.

For example, word2vec is used to generate a word vector, e.g., 400-dimensional in length, for a word that has been segmented. For the same word, a 200-dimensional word vector may be generated using a trained CBOW model, and a 200-dimensional word vector may be generated using a trained skip-gram model, and then the two vectors may be concatenated to generate a 400-dimensional vector as the word vector for the word.

In step S106, for each word sequence of the segmented words, a 2-gram phrase sequence is generated by using a 2-gram grammar, so that the sequence of the constituent words of the sentence can also affect the similarity. For example, "mixed lymphocyte culture" forms word sequences after chinese word segmentation: mixing, lymphocytes, cells, cultures, and then 2-gram generation: mix three phrases of lymph, lymphocyte, cell culture. The "mixed leukocyte reaction" forms word sequences after chinese word segmentation: mixing, white blood cells, reactions, and then 2-gram generation: mixing three phrases of white blood cell, white blood cell and cell reaction.

In step S108, a word vector for each 2-gram phrase of each 2-gram phrase sequence is generated. In an embodiment, the word vectors of the two words constituting each 2-gram phrase may be added as the word vector of the corresponding 2-gram phrase. For example, the word vector of the phrase "lymphocyte" is obtained by adding the word vector of "lymphocyte" to the word vector of "cell". In another embodiment, word2vec models trained using the 2-gram generated phrase sequences as training data may also be used to generate word vectors for 2-gram phrases.

In step S110, word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are combined into text vectors, respectively.

In step S112, tf-idf (term frequency-inverse text frequency index) values of each word and 2-gram phrase of the text vector are calculated as weights, resulting in a text space vector whose values include a word vector part and a weight part. For example, mixing: 0.12, lymph: 0.23, cells: 0.43, culture: 0.32, mixed lymph: 0.82, lymphocytes: 0.56, cell culture: 0.63. the value behind each word or phrase is tf-idf value/weight. In this example, each word of the text space vector has a corresponding word vector of length 400.

In step S114, the text similarity between the text pairs is calculated based on the text space vectors of the text pairs. For text similarity calculation between two texts a and b, the Euclidean distance between each word vector in the text space vector of a and each word vector in the text space vector of b can be calculated respectively, and for each word vector in the text space vector of a, the word vector with the minimum Euclidean distance in the text space vector of b is obtained as the word vector matched with the word vector, and the similarity between the two matched word vectors is calculated respectively as follows: 1-euclidean distance between two word vectors of a match. And then respectively calculating the weighted similarity between the two matched word vectors by taking the tf-idf value of the word/phrase in a and the tf-idf value of the word/phrase in b as weights. For example, the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words/phrases. The final similarity between the a and b texts is calculated as the weighted similarity accumulated value of all matched word vectors divided by the accumulated value of the product of tf-idf weights of the matched words/phrases.

For example, the a text is "mixed lymphocyte culture" and the b text is "mixed leukocyte reaction". According to the above steps, the text space vector of a includes the following word vectors: the text space vector of b includes the following word vectors: mixing, white blood cells, reactions, mixing white blood cells, white blood cell reactions.

First, the euclidean distance between all word vectors of b is calculated by using the "mixed" word vector of a, and one word vector with the smallest euclidean distance in b is selected as the word vector matched with the "mixed" word vector of a. The word vector matched here is of course a "mixed" word vector of b, since the two word vectors are identical, with a euclidean distance of 0, and thus a similarity of 1. The conversion formula of the Euclidean distance and the similarity is as follows: similarity = 1-euclidean distance. The weighted similarity is a "mixed" tf-idf value multiplied by b "mixed" tf-idf value multiplied by the calculated similarity value between the two, i.e., 0.13 x 1 = 0.0169.

And then, respectively calculating Euclidean distances between the word vectors of the a and all word vectors of the b by using the 'lymph' word vector of the a, and selecting one word vector with the smallest Euclidean distance in the b as the word vector matched with the 'lymph' word vector of the a. The final calculation here found that the euclidean distance between the "white blood cell" word vector of b and the "lymph" word vector of a was the smallest, i.e. 0.17, and thus the similarity was 0.83. The weighted similarity is the tf-idf value of "lymph" of a multiplied by the tf-idf value of "white blood cell" of b multiplied by the calculated similarity value between the two, i.e., 0.2 x 0.32 x 0.83= 0.05312.

And so on until all word vectors for a are computed, e.g., resulting in the results shown in the table below.

Then, the text similarity of a and b is calculated as: the accumulated value of all weighted similarity divided by the product of the weights of the matched words/phrases, i.e

(0.0169+0.05312+0.07182+0.0576+0.12288+0.04128+0.162196)/(0.13*0.13+0.2*0.32+0.42*0.38+0.24*0.24+0.4*0.32+0.15*0.43+0.41*0.43)＝0.525796/0.6669＝0.7884。

In the example of determining the similarity of the text a and the text b, the similarity calculation is performed on the basis of the text literal quantity, and the similarity obtained by using the Jaccard distance is not more than 0.5, which is far different from the actual situation (the two are actually very similar). By the similarity calculation algorithm, the word2vec model obtains higher scene co-occurrence probability of cell culture and cell reaction in the learning of a large amount of medical corpus, and the lymphocyte and the leucocyte are a pair of synonyms. The final calculated similarity is close to 0.8 and is consistent with the actual situation, so that the matching accuracy of texts, especially synonyms, hyponyms and related words, is improved.

FIG. 2 shows a block diagram of a preferred embodiment of a text similarity calculation system according to the present invention, the system of which comprises:

a word sequence generation module 202, configured to perform chinese word segmentation on each text, and generate a word sequence;

a word-vector generation module 204 for generating a word vector of fixed length using the trained word2vec model for each word of each word sequence;

a phrase sequence generating module 206, configured to generate a 2-gram phrase sequence based on each word sequence;

a phrase word vector generation module 208, configured to generate a word vector of each 2-gram phrase sequence;

a text vector generation module 210, configured to combine word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases into text vectors;

the text space vector generation module 212 is configured to calculate tf-idf values of each word and 2-gram phrase of the text vector as weights, so as to obtain the text space vector;

the text similarity calculation module 214 is configured to calculate a similarity between the corresponding texts based on the text space vector.

In another embodiment, the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method embodiment or other corresponding method embodiments shown and described in connection with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiment shown and described in connection with fig. 2, which are not described here again.

In another embodiment, the present invention provides a computer device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the processor implements the steps of the method embodiment or other corresponding method embodiment shown and described in connection with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiment shown and described in connection with fig. 2 when the processor executes the computer program, which is not described herein.

The various embodiments described herein, or particular features, structures, or characteristics thereof, may be combined as suitable in one or more embodiments of the invention. In addition, in some cases, the order of steps described in the flowcharts and/or flow-line processes may be modified as appropriate and need not be performed in exactly the order described. Additionally, various aspects of the invention may be implemented using software, hardware, firmware, or a combination thereof and/or other computer-implemented modules or devices that perform the described functions. A software implementation of the present invention may include executable code stored in a computer readable medium and executed by one or more processors. The computer-readable medium may include a computer hard drive, ROM, RAM, flash memory, a portable computer storage medium such as CD-ROM, DVD-ROM, flash drives and/or other devices having a Universal Serial Bus (USB) interface, and/or any other suitable tangible or non-transitory computer-readable medium or computer memory on which executable code may be stored and executed by a processor. The invention may be used in connection with any suitable operating system.

As used herein, the singular forms "a", "an" and "the" include plural referents (i.e., having the meaning of "at least one") unless otherwise indicated. It will be further understood that the terms "has," "comprises," "including" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

While the foregoing is directed to some preferred embodiments of the present invention, it should be emphasized that the present invention is not limited to these embodiments, but may be embodied in other forms within the scope of the inventive subject matter. Various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, and these changes or modifications still fall within the scope of the invention.

Claims

1. A text similarity calculation method, the method comprising:

generating a word vector of fixed length using a trained word2vec model for each word of each word sequence; the word2vec model comprises a Skip-gram model and a CBOW model, wherein the Skip-gram model is used for generating a first word vector of a word, the CBOW model is used for generating a second word vector of the same word, and the first word vector and the second word vector are connected to be used as the word vector of the word;

generating a 2-gram phrase sequence based on each word sequence;

generating a word vector of each 2-gram phrase sequence;

for each word vector of the first text space vector, respectively calculating Euclidean distance between each word vector of the second text space vector, respectively determining the word vector with the minimum Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating the similarity between the two matched word vectors, and respectively calculating the weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors; wherein the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words;

and calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words, wherein the text similarity between the first text and the second text is calculated as follows: an accumulated value of all weighted similarities of the first and second text space vectors divided by an accumulated value of a product of weights of the matched words of the first and second text space vectors.

2. The method of claim 1, wherein generating a word vector for each 2-gram phrase comprises: the word vectors of the two words constituting each 2-gram phrase are added as the word vector of the corresponding 2-gram phrase.

3. The method of claim 1, wherein calculating a similarity between the matched two word vectors comprises calculating the similarity as: 1-euclidean distance between two word vectors of a match.

4. A text similarity calculation system, the system comprising:

a word-vector generation module for generating a word vector of a fixed length using a trained word2vec model for each word of each word sequence; the word2vec model comprises a Skip-gram model and a CBOW model, wherein the Skip-gram model is used for generating a first word vector of a word, the CBOW model is used for generating a second word vector of the same word, and the first word vector and the second word vector are connected to be used as the word vector of the word;

the text similarity calculation module is used for respectively calculating Euclidean distance between each word vector of the first text space vector and each word vector of the second text space vector, respectively determining the word vector with the minimum Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating the similarity between the two matched word vectors, and respectively calculating the weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors; calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words; wherein the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words; wherein the text similarity between the first and second text is calculated as: an accumulated value of all weighted similarities of the first and second text space vectors divided by an accumulated value of a product of weights of the matched words of the first and second text space vectors.

5. A computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method according to claim 1 when the computer program is executed by the processor.

6. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to claim 1.