CN111144104A

CN111144104A - Text similarity determination method and device and computer readable storage medium

Info

Publication number: CN111144104A
Application number: CN201811297685.5A
Authority: CN
Inventors: 路绪海; 马怡安; 黄挺
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-05-12
Anticipated expiration: 2038-11-02
Also published as: CN111144104B

Abstract

The disclosure relates to a method and a device for determining text similarity and a computer readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: calculating the correlation degree of each word in the first text and the second text by using the word vector as a first similarity, wherein the number of the words in the first text is less than that of the words in the second text; selecting a corresponding number of words from the second text as target words according to the number of the words of the first text; calculating the degree of correlation between each target word and the first text by using the word vector as a second similarity; and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text. The technical scheme of the text similarity improving method and device can improve accuracy of text similarity.

Description

Text similarity determination method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a computer-readable storage medium for determining text similarity.

Background

In the field of artificial intelligence, the calculation of text similarity is a typical application of weak artificial intelligence and is the basis for interaction between a robot and a user. How to determine the similarity of texts is one of the popular research directions in the field.

In the related art, words in the target text are compared with words in the comparison text one by one to determine the text similarity.

Disclosure of Invention

The inventors of the present disclosure found that the following problems exist in the above-described related art: the method is greatly influenced by the text length, and particularly in the case that the text length is different or the texts have inclusion relations, the determined text similarity is low in accuracy.

In view of this, the present disclosure provides a technical solution for determining text similarity, which can improve the accuracy of text similarity.

According to some embodiments of the present disclosure, there is provided a text similarity determination method including: calculating the correlation degree of each word in the first text and the second text by using a word vector to serve as a first similarity, wherein the number of the words in the first text is smaller than that of the words in the second text; according to the number of the words of the first text, selecting a corresponding number of words from the second text as target words; calculating the degree of correlation between each target word and the first text by using the word vector to serve as a second similarity; and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text.

In some embodiments, a vector distance between a word vector of a word in the first text and a word vector of each word in the second text is calculated; taking the minimum one of all the vector distances as a correlation coefficient of the word; and determining the first similarity according to the weighted sum of the correlation coefficients of all the words in the first text.

In some embodiments, a difference between the number of words of the first text and the number of words of the second text is taken as a target number; selecting the target number of words in the second text as the target words.

In some embodiments, the last N words are selected as the target words in the second text, where N is the target number.

In some embodiments, calculating a vector distance between a word vector of a target word in the second text and a word vector of each word in the first text; taking the minimum one of all the vector distances as a correlation coefficient of the target word; and determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text.

In some embodiments, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with a length of the second text.

According to other embodiments of the present disclosure, a calculating unit is provided, configured to calculate, using a word vector, a degree of correlation between each word in a first text and a second text as a first similarity, where the number of words in the first text is smaller than the number of words in the second text, calculate, using the word vector, a degree of correlation between each target word and the first text as a second similarity, and calculate, according to the first similarity, the second similarity, and a length of the second text, a comprehensive similarity between the first text and the second text; and the selecting unit is used for selecting words with corresponding number in the second text as the target words according to the number of the words in the first text.

In some embodiments, the calculation unit calculates a vector distance between a word vector of a word in the first text and a word vector of each word in the second text, takes the smallest one of all the vector distances as a correlation coefficient of the word, and determines the first similarity according to a weighted sum of the correlation coefficients of all the words in the first text.

In some embodiments, the selecting unit selects a difference between the number of words in the first text and the number of words in the second text as a target number, and selects the target number of words in the second text as the target word.

In some embodiments, the selecting unit selects the last N words in the second text as the target words, where N is the target number.

In some embodiments, the calculation unit calculates a vector distance between a word vector of a target word in the second text and a word vector of each word in the first text, takes the smallest one of all the vector distances as a correlation coefficient of the target word, and determines the second similarity according to a weighted sum of the correlation coefficients of all the target words in the second text.

According to still other embodiments of the present disclosure, there is provided a device for determining text similarity, including: a memory; and a processor coupled to the memory, the processor configured to perform the method for determining text similarity in any of the above embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of determining text similarity in any of the above embodiments.

In the above embodiment, not only the degree of correlation between the words in the short text and the long text is considered, but also the degree of correlation between a corresponding number of words in the long text and the short text is considered according to the length difference between the texts, so as to obtain the comprehensive similarity, and the comprehensive similarity is processed according to the text length. Therefore, the adaptability of the method to the text length can be enhanced, unstable or inaccurate calculation caused by text length difference is avoided, and the accuracy of text similarity is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of some embodiments of a text similarity determination method of the present disclosure;

FIG. 2 illustrates a flow diagram of some embodiments of step 110 of FIG. 1;

FIG. 3 illustrates a flow diagram of some embodiments of step 120 of FIG. 1;

FIG. 4 illustrates a block diagram of some embodiments of a text similarity determination apparatus of the present disclosure;

FIG. 5 shows a block diagram of further embodiments of a text similarity determination apparatus of the present disclosure;

fig. 6 shows a block diagram of further embodiments of the text similarity determination apparatus of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Fig. 1 illustrates a flow diagram of some embodiments of a text similarity determination method of the present disclosure.

As shown in fig. 1, the method includes: step 110, calculating a first similarity; step 120, selecting a target word; step 130, calculating a second similarity; and step 140, calculating the comprehensive similarity.

In step 110, the word vectors are used to calculate the degree of correlation between each word in the first text and the second text as a first similarity, and the number of words in the first text is smaller than that in the second text.

In some embodiments, Word segmentation processing may be performed on the first text and the second text to obtain each Word in the first text and the second text, and then a Word vector of each Word is calculated by using a skip-gram model of Word2 vec. On one hand, the calculation efficiency can be improved by taking the shorter text as a processing object, and on the other hand, the correlation degree of the words in the longer text to the shorter text can be further obtained according to the text length difference in the subsequent steps, so that the accuracy of the similarity is improved.

In some embodiments, step 110 may be performed by the embodiment in fig. 2.

FIG. 2 illustrates a flow diagram of some embodiments of step 110 of FIG. 1.

As shown in fig. 2, step 110 includes: step 1110, calculating each vector distance; step 1120, determining a correlation coefficient; and step 1130, determining the first similarity.

In step 1110, a vector distance between the word vector of the word in the first text and the word vector of each word in the second text is calculated. For example, the first text contains L words in total: w is a_1,1、w_1,2……w_1,l……w_1,LThe corresponding word vector is: v. of_1,1、v_1,2……v_1,l……v_1,L(ii) a The second text contains M words in total: w is a_2,1、w_2,2……w_2,m……w_1,MThe corresponding word vector is: v. of_2,1、v_2,2……v_2,m……v_2,M。

In some embodiments, w may be calculated separately_1,1Word vector v_1,1And v_2,1、v_2,2……v_2,m……v_2,MHas a Euclidean distance d between_1,1、d_1,2……d_1,m……d_1,M。

In step 1120, the smallest of all vector distances is taken as the correlation coefficient for the word. For example, W₁₁Has a correlation coefficient of d₁＝min(d_1,1、d_1,2……d_1,m……d_1,M). Further, the correlation coefficients of the L words in the first text can be obtained: d₁、d₂……d_l……d_L。

In step 1130, a first similarity is determined based on a weighted sum of the correlation coefficients of all words in the first text. For example, the first similarity may be determined as

D can also be adjusted according to requirements_lAnd summing after weighting.

After the first similarity is determined, the overall similarity may be calculated by the remaining steps in fig. 1.

In step 120, according to the number of words in the first text, a corresponding number of words are selected as target words in the second text. Step 120 may be performed, for example, by the embodiment in fig. 3.

Fig. 3 illustrates a flow diagram of some embodiments of step 120 of fig. 1.

As shown in fig. 3, step 120 includes: step 1210, determining a target number; step 1220, select the target word.

In step 1210, the difference between the number of words of the first text and the number of words of the second text is taken as a target number.

In step 1220, a target number of words are selected in the second text as target words. For example, the last N words may be selected as target words in the second text, where N is the target number.

In some embodiments, there are 10 words in the first text and 50 words in the second text, and the 11 th to 50 th words in the second text may be selected as the target words. And a corresponding number of target words can be randomly selected according to the requirement.

Under the condition that the longer second text contains words in the shorter first text, but the meaning of the second text is completely different from that of the first text, the selected target words can more accurately express the true meaning of the second text, so that the accuracy of the determined text similarity is improved.

After the target word is selected, the comprehensive similarity can be calculated through the other steps in fig. 1.

In step 130, the degree of correlation between each target word and the first text is calculated as a second similarity using the word vector. For example, a vector distance between the word vector of the target word in the second text and the word vector of each word in the first text is calculated. And taking the smallest one of all vector distances as the correlation coefficient of the target word. And determining the second similarity according to the weighted sum of the correlation coefficients of all the target words in the second text.

In some embodiments, N words in the second text are selected as target words, where N is M-L. The N target words may be the L +1 th word to the bottom L + N words in the second text: w is a_2,L+1、w_2,L+2……w_1,L+NThe corresponding word vector is: v. of_2,L+1、v_2,L+2……v_2,L+N. Separately calculate w_2,L+1Word vector v_2,L+1And v_1,1、v_1,2……v_1,l……v_1,LHas a Euclidean distance d between_L+1,1、d_L+1,2……d_L+1,L。

The smallest of all vector distances may be taken as the correlation coefficient for the target word. For example, w_2,L+1Has a correlation coefficient of d_L+1＝min(d_L+1,1、d_L+1,2……d_L+1,L). Further, the correlation coefficients of the total N target words in the second text can be obtained: d_L+1、d_L+2……d_L+N。

And determining the second similarity according to the weighted sum of the correlation coefficients of all the target words in the second text. For example, the second similarity may be determined as

D can also be adjusted according to requirements_L+nAnd summing after weighting.

In step 140, a comprehensive similarity of the first text and the second text is calculated according to the first similarity, the second similarity and the length of the second text. For example, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

In some embodiments, where the relatively long second text contains M words in total, the overall similarity S may be determined according to the following formula:

S＝(D₁+D₂)/M

in some embodiments, the synonym table and the near-sense table may be preset, the synonyms and the near-senses in the first text and the second text are determined according to the synonym table and the near-sense table, and the word vector distance of the synonyms and the near-senses is set to 0.

Fig. 4 illustrates a block diagram of some embodiments of a text similarity determination apparatus of the present disclosure.

As shown in fig. 4, the text similarity determination device 4 includes a calculation unit 41 and a selection unit 42.

The calculation unit 41 calculates the degree of correlation of each word in the first text with the second text as the first similarity using the word vector, the number of words in the first text being smaller than the number of words in the second text.

In some embodiments, the calculation unit 41 calculates a vector distance between the word vector of the word in the first text and the word vector of each word in the second text. The calculation unit 41 takes the smallest one of all vector distances as the correlation coefficient for the word. The calculation unit 41 determines the first similarity from a weighted sum of the correlation coefficients of all words in the first text.

The selecting unit 42 selects a corresponding number of words in the second text as target words according to the number of words in the first text.

In some embodiments, the selecting unit 42 uses the difference between the number of words in the first text and the number of words in the second text as the target number N, and selects a target number of words in the second text as the target words. For example, the selecting unit 42 selects the last N words in the second text as the target words.

The calculation unit 41 calculates the degree of correlation of each target word with the first text as the second similarity using the word vector. For example, the calculation unit 41 calculates a vector distance between the word vector of the target word in the second text and the word vector of each word in the first text, takes the smallest one of all the vector distances as the correlation coefficient of the target word, and determines the second similarity degree according to a weighted sum of the correlation coefficients of all the target words in the second text.

The calculation unit 41 calculates the integrated similarity of the first text and the second text according to the first similarity, the second similarity, and the length of the second text. For example, the integrated similarity is positively correlated with a weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

Fig. 5 shows a block diagram of further embodiments of the text similarity determination apparatus of the present disclosure.

As shown in fig. 5, the text similarity determination device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute the text similarity determination method in any one embodiment of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

As shown in fig. 6, the text similarity determination device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, wherein the processor 620 is configured to execute the text similarity determination method in any of the foregoing embodiments based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a BootLoader (BootLoader), and other programs.

The text similarity determination apparatus 6 may further include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

So far, the text similarity determination method, the text similarity determination apparatus, and the computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A text similarity determination method comprises the following steps:

calculating the correlation degree of each word in the first text and the second text by using a word vector to serve as a first similarity, wherein the number of the words in the first text is smaller than that of the words in the second text;

according to the number of the words of the first text, selecting a corresponding number of words from the second text as target words;

calculating the degree of correlation between each target word and the first text by using the word vector to serve as a second similarity;

and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text.

2. The determination method according to claim 1, wherein the calculating, as the first similarity, a degree of correlation of each word in the first text with the second text using the word vector comprises:

calculating a vector distance between a word vector of a word in the first text and a word vector of each word in the second text;

taking the minimum one of all the vector distances as a correlation coefficient of the word;

and determining the first similarity according to the weighted sum of the correlation coefficients of all the words in the first text.

3. The determination method according to claim 1, wherein the selecting a corresponding number of words as target words in the second text according to the number of words in the first text comprises:

taking a difference value between the number of words of the first text and the number of words of the second text as a target number;

selecting the target number of words in the second text as the target words.

4. The determination method of claim 3, wherein the selecting the target number of words as the target words in the second text comprises:

and selecting the last N words in the second text as the target words, wherein N is the target number.

5. The determination method according to claim 1, wherein the calculating, as the second similarity, the degree of correlation of each target word with the first text using the word vector comprises:

calculating a vector distance between a word vector of a target word in the second text and a word vector of each word in the first text;

taking the minimum one of all the vector distances as a correlation coefficient of the target word;

and determining the second similarity according to the weighted sum of the correlation coefficients of all target words in the second text.

6. The determination method according to any one of claims 1 to 5,

the integrated similarity is positively correlated with the weighted sum of the first similarity and the second similarity, and negatively correlated with the length of the second text.

7. A device for determining text similarity, comprising:

the calculation unit is used for calculating the correlation degree of each word in a first text and a second text as a first similarity by using a word vector, wherein the number of the words in the first text is smaller than that of the words in the second text, calculating the correlation degree of each target word and the first text as a second similarity by using the word vector, and calculating the comprehensive similarity of the first text and the second text according to the first similarity, the second similarity and the length of the second text;

and the selecting unit is used for selecting words with corresponding number in the second text as the target words according to the number of the words in the first text.

8. The determination apparatus according to claim 7,

the calculation unit calculates vector distances between word vectors of words in the first text and word vectors of words in the second text, takes the smallest one of all the vector distances as a correlation coefficient of the word, and determines the first similarity according to a weighted sum of the correlation coefficients of all the words in the first text.

9. The determination apparatus according to claim 7,

the selecting unit takes the difference value between the number of the words of the first text and the number of the words of the second text as a target number, and selects the words of the target number as the target words in the second text.

10. The determination apparatus according to claim 9,

the selecting unit selects the last N words in the second text as the target words, wherein N is the target number.

11. The determination apparatus according to claim 7,

the calculation unit calculates vector distances between word vectors of target words in the second text and word vectors of words in the first text, takes the smallest one of all the vector distances as a correlation coefficient of the target words, and determines the second similarity according to a weighted sum of the correlation coefficients of all the target words in the second text.

12. The determination apparatus according to any one of claims 7 to 11,

13. A device for determining text similarity, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of determining text similarity of any of claims 1-6 based on instructions stored in the memory device.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of determining text similarity according to any one of claims 1 to 6.