CN111144104B

CN111144104B - Text similarity determination method, device and computer readable storage medium

Info

Publication number: CN111144104B
Application number: CN201811297685.5A
Authority: CN
Inventors: 路绪海; 马怡安; 黄挺
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2023-06-20
Anticipated expiration: 2038-11-02
Also published as: CN111144104A

Abstract

The disclosure relates to a method, a device and a computer-readable storage medium for determining text similarity, and relates to the technical field of artificial intelligence. The method comprises the following steps: calculating the degree of correlation between each word in the first text and the second text by using the word vector as a first similarity, wherein the number of words in the first text is smaller than that of words in the second text; selecting a corresponding number of words from the second text as target words according to the number of words of the first text; calculating the correlation degree of each target word and the first text by using the word vector as a second similarity; and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text. According to the technical scheme, accuracy of text similarity can be improved.

Description

Text similarity determination method, device and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method for determining text similarity, a device for determining text similarity, and a computer readable storage medium.

Background

In the field of artificial intelligence, text similarity calculation is a typical application of weak artificial intelligence and is the basis for interaction between a robot and a user. How to determine the similarity of texts is one of the directions of intense research in the field.

In the related art, words in a target text are compared with words in a comparison text one by one to determine text similarity.

Disclosure of Invention

The inventors of the present disclosure found that the above-described related art has the following problems: the accuracy of the determined text similarity is low, particularly in the case where the lengths differ between texts or there is an inclusion relationship between texts, which is greatly affected by the lengths of the texts.

In view of this, the present disclosure proposes a technical solution for determining text similarity, which can improve accuracy of text similarity.

According to some embodiments of the present disclosure, there is provided a method for determining text similarity, including: calculating the degree of correlation between each word in a first text and a second text by using word vectors to serve as a first similarity, wherein the number of words in the first text is smaller than that of words in the second text; selecting a corresponding number of words from the second text as target words according to the number of words of the first text; calculating the correlation degree of each target word and the first text by using the word vector as a second similarity; and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text.

In some embodiments, vector distances between word vectors of words in the first text and word vectors of words in the second text are calculated; taking the smallest one of all the vector distances as a correlation coefficient of the word; and determining the first similarity according to the weighted sum of the correlation coefficients of all words in the first text.

In some embodiments, the difference between the number of words of the first text and the number of words of the second text is taken as a target number; and selecting the words of the target number from the second text as the target words.

In some embodiments, the last N words are selected as the target words in the second text, where N is the target number.

In some embodiments, vector distances between word vectors of target words in the second text and word vectors of words in the first text are calculated; taking the smallest one of all the vector distances as a correlation coefficient of the target word; and determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text.

In some embodiments, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with a length of the second text.

According to other embodiments of the present disclosure, a calculating unit is provided, configured to calculate, using a word vector, a degree of correlation between each word in a first text and a second text as a first similarity, where the number of words in the first text is smaller than the number of words in the second text, calculate, using a word vector, a degree of correlation between each target word and the first text as a second similarity, and calculate, based on the first similarity, the second similarity, and a length of the second text, a comprehensive similarity between the first text and the second text; and the selecting unit is used for selecting the corresponding number of words in the second text as the target words according to the number of words in the first text.

In some embodiments, the calculating unit calculates a vector distance between a word vector of a word in the first text and a word vector of each word in the second text, uses a minimum one of all the vector distances as a correlation coefficient of the word, and determines the first similarity according to a weighted sum of correlation coefficients of all the words in the first text.

In some embodiments, the selecting unit uses a difference value between the number of words of the first text and the number of words of the second text as a target number, and selects the target number of words in the second text as the target word.

In some embodiments, the selecting unit selects the last N words in the second text as the target words, where N is the target number.

In some embodiments, the calculating unit calculates a vector distance between a word vector of the target word in the second text and a word vector of each word in the first text, uses a minimum one of all the vector distances as a correlation coefficient of the target word, and determines the second similarity according to a weighted sum of correlation coefficients of all target words in the second text.

According to still further embodiments of the present disclosure, there is provided a text similarity determining apparatus, including: a memory; and a processor coupled to the memory, the processor configured to perform the method of determining text similarity in any of the embodiments described above based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of determining text similarity in any of the above embodiments.

In the above embodiment, not only the degree of correlation between the words in the short text and the long text is considered, but also the degree of correlation between the corresponding number of words in the long text and the short text is considered according to the difference in length between the texts, so that the comprehensive similarity is obtained, and the comprehensive similarity is processed according to the length of the text. Therefore, the adaptability of the method to the text length can be enhanced, and unstable or inaccurate calculation caused by the text length difference is avoided, so that the accuracy of the text similarity is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates a flow chart of some embodiments of a method of determining text similarity of the present disclosure;

FIG. 2 illustrates a flow chart of some embodiments of step 110 of FIG. 1;

FIG. 3 illustrates a flow chart of some embodiments of step 120 of FIG. 1;

FIG. 4 illustrates a block diagram of some embodiments of a determination apparatus of text similarity of the present disclosure;

FIG. 5 illustrates a block diagram of further embodiments of a text similarity determination apparatus of the present disclosure;

fig. 6 shows a block diagram of still further embodiments of a determination device of text similarity of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Fig. 1 illustrates a flow chart of some embodiments of a method of determining text similarity of the present disclosure.

As shown in fig. 1, the method includes: step 110, calculating a first similarity; step 120, selecting a target word; step 130, calculating a second similarity; and step 140, calculating the comprehensive similarity.

In step 110, the degree of correlation between each word in the first text and the second text is calculated as a first similarity by using the word vector, and the number of words in the first text is smaller than that of words in the second text.

In some embodiments, word segmentation may be performed on the first text and the second text to obtain each Word in the first text and the second text, and then a skip-gram model of Word2vec is used to calculate a Word vector of each Word. On the one hand, the calculation efficiency can be improved by taking the shorter text as the processing object, and on the other hand, the correlation degree of the words in the longer text to the shorter text can be further obtained in the subsequent step according to the text length difference, so that the accuracy of the similarity is improved.

In some embodiments, step 110 may be performed by the embodiment of fig. 2.

Fig. 2 illustrates a flow chart of some embodiments of step 110 of fig. 1.

As shown in fig. 2, step 110 includes: step 1110, calculating the distance of each vector; step 1120, determining a correlation coefficient; and step 1130, determining a first similarity.

In step 1110, a vector distance between a word vector of words in the first text and a word vector of words in the second text is calculated. For example, the first text contains a total of L words: w (w) _1,1 、w _1,2 ……w _1,l ……w _1,L The corresponding word vectors are: v _1,1 、v _1,2 ……v _1,l ……v _1,L The method comprises the steps of carrying out a first treatment on the surface of the The second text contains M words in total: w (w) _2,1 、w _2,2 ……w _2,m ……w _1,M The corresponding word vectors are: v _2,1 、v _2,2 ……v _2,m ……v _2,M 。

In some embodiments, w may be calculated separately _1,1 Word vector v of (a) _1,1 And v _2,1 、v _2,2 ……v _2,m ……v _2,M Euclidean distance d between _1,1 、d _1,2 ……d _1,m ……d _1,M 。

In step 1120, the smallest one of all vector distances is taken as the correlation coefficient for the term. For example, W ₁₁ Is d ₁ ＝min(d _1,1 、d _1,2 ……d _1,m ……d _1,M ). And then the correlation coefficients of L words in the first text can be obtained: d, d ₁ 、d ₂ ……d _l ……d _L 。

In step 1130, a first similarity is determined based on a weighted sum of the correlation coefficients of all words in the first text. For example, the first similarity may be determined to be

Can also be applied to d as required _l Weighted summation is performed.

After the first similarity is determined, the integrated similarity may be calculated by the remaining steps in FIG. 1.

In step 120, according to the number of words in the first text, a corresponding number of words are selected as target words in the second text. For example, step 120 may be performed by the embodiment of fig. 3.

Fig. 3 illustrates a flow chart of some embodiments of step 120 of fig. 1.

As shown in fig. 3, step 120 includes: step 1210, determining a target number; step 1220, select the target word.

In step 1210, the difference between the number of words of the first text and the number of words of the second text is taken as the target number.

In step 1220, a target number of terms are selected as target terms in the second text. For example, the last N words may be selected in the second text as target words, N being the target number.

In some embodiments, there are 10 words in the first text, 50 words in the second text, and 11 th to 50 th words in the second text may be selected as target words. The target words with corresponding numbers can be selected randomly according to the requirement.

In the case that the longer second text contains words in the shorter first text, but the meaning of the second text is completely different from that of the first text, the target word selected in this way can more accurately express the true meaning of the second text, so that the accuracy of the determined text similarity is improved.

After the target word is selected, the comprehensive similarity can be calculated through the rest steps in fig. 1.

In step 130, the degree of relevance of each target word to the first text is calculated as a second similarity using the word vector. For example, a vector distance between a word vector of the target word in the second text and a word vector of each word in the first text is calculated. The smallest one of all vector distances is taken as the correlation coefficient of the target word. And determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text.

In some embodiments, N words in the second text are selectedAs a target word, n=m-L. The N target words may be the (l+1) th word to the (l+n) th word in the second text: w (w) _2,L+1 、w _2,L+2 ……w _1,L+N The corresponding word vectors are: v _2,L+1 、v _2,L+2 ……v _2,L+N . Respectively calculate w _2,L+1 Word vector v of (a) _2,L+1 And v _1,1 、v _1,2 ……v _1,l ……v _1,L Euclidean distance d between _L+1,1 、d _L+1,2 ……d _L+1,L 。

The smallest one of all vector distances may be taken as the correlation coefficient for the target word. For example, w _2,L+1 Is d _L+1 ＝min(d _L+1,1 、d _L+1,2 ……d _L+1,L ). And then, the correlation coefficients of the total N target words in the second text can be obtained: d, d _L+1 、d _L+2 ……d _L+N 。

And determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text. For example, the second similarity may be determined to be

Can also be applied to d as required _L+n Weighted summation is performed.

In step 140, a combined similarity of the first text and the second text is calculated based on the first similarity, the second similarity, and the length of the second text. For example, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

In some embodiments, where the second relatively long text contains M words in total, the integrated similarity S may be determined according to the following formula:

S＝(D ₁ +D ₂ )/M

in some embodiments, a synonym table and a paraphrasing table may be preset, synonyms and paraphrasing in the first text and the second text are determined according to the synonym table and the paraphrasing table, and the distance between the synonym and the paraphrasing vector is set to 0.

Fig. 4 illustrates a block diagram of some embodiments of a determination apparatus of text similarity of the present disclosure.

As shown in fig. 4, the text similarity determining device 4 includes a calculating unit 41 and a selecting unit 42.

The calculation unit 41 calculates, as the first similarity, the degree of correlation of each word in the first text with the second text using the word vector, the number of words in the first text being smaller than the number of words in the second text.

In some embodiments, the calculation unit 41 calculates a vector distance between a word vector of words in the first text and a word vector of words in the second text. The calculation unit 41 takes the smallest one of all vector distances as the correlation coefficient of the word. The calculation unit 41 determines the first similarity from a weighted sum of the correlation coefficients of all the words in the first text.

The selecting unit 42 selects a corresponding number of words in the second text as target words according to the number of words in the first text.

In some embodiments, the selection unit 42 takes the difference between the number of words of the first text and the number of words of the second text as the target number N, and selects the target number of words in the second text as the target words. For example, the selection unit 42 selects the last N words in the second text as target words.

The calculation unit 41 calculates the degree of correlation of each target word with the first text as the second degree of similarity using the word vector. For example, the calculation unit 41 calculates a vector distance between a word vector of a target word in the second text and a word vector of each word in the first text, takes the smallest one of all vector distances as a correlation coefficient of the target word, and determines a second similarity from a weighted sum of correlation coefficients of all target words in the second text.

The calculation unit 41 calculates the integrated similarity of the first text and the second text based on the first similarity, the second similarity, and the length of the second text. For example, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

Fig. 5 shows a block diagram of further embodiments of a device for determining text similarity of the present disclosure.

As shown in fig. 5, the text similarity determining device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to perform the method of determining text similarity in any of the embodiments of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.

As shown in fig. 6, the text similarity determining device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the method of determining text similarity in any of the foregoing embodiments based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot loader (BootLoader), and other programs.

The text similarity determining device 6 may further include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the memory 610 and processor 620 may be connected by, for example, a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. Network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Heretofore, a determination method of text similarity, a determination apparatus of text similarity, and a computer-readable storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.

The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method for determining text similarity comprises the following steps:

calculating the degree of correlation between each word in a first text and a second text by using word vectors to serve as a first similarity, wherein the number of words in the first text is smaller than that of words in the second text;

selecting a corresponding number of words from the second text as target words according to the number of words of the first text;

calculating the correlation degree of each target word and the first text by using the word vector as a second similarity;

calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text;

the selecting the corresponding number of words in the second text as the target words according to the number of words in the first text comprises:

taking the difference value of the word number of the first text and the word number of the second text as a target number;

and selecting the words of the target number from the second text as the target words.

2. The determining method according to claim 1, wherein the calculating, using the word vector, a degree of relevance of each word in the first text to the second text as the first similarity includes:

calculating vector distances between word vectors of words in the first text and word vectors of words in the second text;

taking the smallest one of all the vector distances as a correlation coefficient of the word;

and determining the first similarity according to the weighted sum of the correlation coefficients of all words in the first text.

3. The determining method of claim 1, wherein the selecting the target number of words in the second text as the target words comprises:

and selecting the last N words in the second text as the target words, wherein N is the target number.

4. The determining method of claim 1, wherein the calculating, using the word vector, a degree of relevance of each target word to the first text as a second degree of similarity includes:

calculating vector distances between word vectors of target words in the second text and word vectors of words in the first text;

taking the smallest one of all the vector distances as a correlation coefficient of the target word;

and determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text.

5. The method for determining according to any one of claims 1 to 4, wherein,

the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

6. A text similarity determining device includes:

the computing unit is used for computing the degree of correlation between each word in a first text and a second text by using word vectors to be used as a first similarity, wherein the number of words in the first text is smaller than that of words in the second text, computing the degree of correlation between each target word and the first text by using word vectors to be used as a second similarity, and computing the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text;

the selecting unit is used for selecting the corresponding number of words in the second text as the target words according to the number of words in the first text, and selecting the difference between the number of words in the first text and the number of words in the second text as the target number and selecting the target number of words in the second text as the target words.

7. The determining apparatus according to claim 6, wherein,

the calculating unit calculates a vector distance between a word vector of a word in the first text and a word vector of each word in the second text, takes the smallest one of all the vector distances as a correlation coefficient of the word, and determines the first similarity according to a weighted sum of the correlation coefficients of all the words in the first text.

8. The determining apparatus according to claim 6, wherein,

and the selecting unit selects the last N words in the second text as the target words, wherein N is the target number.

9. The determining apparatus according to claim 6, wherein,

the calculating unit calculates a vector distance between a word vector of the target word in the second text and a word vector of each word in the first text, takes the smallest one of all the vector distances as a correlation coefficient of the target word, and determines the second similarity according to a weighted sum of the correlation coefficients of all the target words in the second text.

10. The determining apparatus according to any one of claims 6 to 9, wherein,

11. A text similarity determining device includes:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of determining text similarity of any of claims 1-5 based on instructions stored in the memory device.

12. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of determining text similarity according to any of claims 1-5.