CN114757299A

CN114757299A - Text similarity judgment method and device and storage medium

Info

Publication number: CN114757299A
Application number: CN202210469090.3A
Authority: CN
Inventors: 蔡娟; 张国超; 张术芬; 李俊杰; 翁冠; 张俊
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-15

Abstract

The method extracts paragraph features of paragraphs of two texts for text similarity judgment, further determines first similarity between the paragraphs of the two texts based on the paragraph features, further extracts keywords of the paragraphs if the first similarity between the paragraphs is greater than a threshold value, and determines second similarity of the paragraphs based on the keywords, so that the text similarity of the two texts is determined according to the first similarity and the second similarity, and the problem that the existing text similarity judgment is poor in adaptability is solved, for example, when text paragraph positions are exchanged and sentence pattern conversion is performed, the text similarity judgment accuracy is improved.

Description

Text similarity judgment method and device and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining text similarity, and a storage medium.

Background

With the rapid development of the internet and the arrival of the big data era, the data volume of various texts increases exponentially, various forms of reference also exist, and higher requirements are provided for the identification of similar contents and the accuracy of similarity judgment.

In the related art, text similarity judgment refers to measurement of similarity between two texts, and the text similarity judgment has wide application in multiple fields. For example, in information retrieval, similar words can be identified by using similarity, and the recall rate is improved. Existing text similarity determination typically analyzes similarity using sentences in paragraphs in the text.

However, the existing text similarity judgment has poor adaptability, such as text paragraph position change, sentence pattern conversion and the like, and the accuracy of text similarity judgment is low.

Disclosure of Invention

The application provides a text similarity judgment method, a text similarity judgment device and a storage medium, which are used for solving the problems that the existing text similarity judgment is poor in adaptability and low in accuracy of text similarity judgment.

In a first aspect, an embodiment of the present application provides a method for determining text similarity, including:

determining a first text and a second text for text similarity judgment, and extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text;

determining a first similarity between a paragraph i of the first text and a paragraph j of the second text based on paragraph features of the paragraph i and the paragraph j, wherein the paragraph i is any one paragraph in the first text, i is 1,2, …, m is an integer, m is determined according to a total number of paragraphs in the first text, the paragraph j is any one paragraph in the second text, j is 1,2, …, n is an integer, and n is determined according to a total number of paragraphs in the second text;

if the first similarity is larger than a first preset threshold value, extracting keywords of the paragraph i and the paragraph j respectively;

determining a second similarity of the paragraph i and the paragraph j based on the keywords of the paragraph i and the paragraph j;

and determining the text similarity of the first text and the second text according to the first similarity and the second similarity.

In one possible implementation, the determining a first similarity of the paragraph i and the paragraph j based on the paragraph features of the paragraph i of the first text and the paragraph features of the paragraph j of the second text includes:

performing word segmentation processing on the paragraph features of the paragraph i and the paragraph features of the paragraph j respectively to obtain a first cluster and a second cluster;

and calculating the intersection ratio of the first cluster and the second cluster, and determining the first similarity of the paragraph i and the paragraph j based on the intersection ratio.

In a possible implementation manner, before the calculating the intersection ratio of the first cluster and the second cluster, the method further includes:

determining a non-intersecting vocabulary for the paragraph i and the paragraph j according to the first cluster and the second cluster;

the calculating the intersection-to-parallel ratio of the first cluster and the second cluster comprises:

and if the number of the negative words in the non-intersection vocabulary of the paragraph i and the paragraph j is an even number, calculating the intersection ratio of the first cluster and the second cluster.

In a possible implementation manner, before the extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text, the method further includes:

extracting text features of the first text and text features of the second text respectively;

determining a third similarity of the first text and the second text based on the text features of the first text and the text features of the second text;

the extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text comprises:

if the third similarity is larger than a second preset threshold, extracting paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text.

In a possible implementation manner, the determining the text similarity between the first text and the second text according to the first similarity and the second similarity includes:

and determining the text similarity of the first text and the second text according to the first similarity, the second similarity and the third similarity.

In a possible implementation manner, the determining the text similarity between the first text and the second text according to the first similarity, the second similarity, and the third similarity includes:

according to the first text and the second text, respectively determining a first coefficient corresponding to the first similarity, a second coefficient corresponding to the second similarity and a third coefficient corresponding to the third similarity;

obtaining paragraph similarity of the first text and the second text based on the first similarity, the first coefficient, the second similarity, the second coefficient, the third similarity and the third coefficient;

and determining the text similarity of the first text and the second text according to the paragraph similarity of the first text and the second text.

In one possible implementation manner, the determining the text similarity between the first text and the second text according to the paragraph similarity between the first text and the second text includes:

determining paragraph weights corresponding to the paragraph similarity according to the paragraphs of the first text and the paragraphs of the second text;

and determining the text similarity between the first text and the second text based on the paragraph similarity and the paragraph weight corresponding to the paragraph similarity.

judging whether the type of the first text is a preset text type or not, and judging whether the type of the second text is the preset text type or not;

if the type of the first text is the preset text type and the type of the second text is the preset text type, extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text.

In a second aspect, an embodiment of the present application provides a text similarity determination apparatus, including:

the first feature extraction module is used for determining a first text and a second text for text similarity judgment and extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text;

a first similarity determining module, configured to determine a first similarity between a paragraph i in the first text and a paragraph j in the second text based on a paragraph feature of the paragraph i and a paragraph feature of the paragraph j in the second text, where i is any one of the paragraphs in the first text, i is 1,2, …, m is an integer, m is determined according to a total number of paragraphs in the first text, j is any one of the paragraphs in the second text, j is 1,2, …, n, n is an integer, and n is determined according to a total number of paragraphs in the second text;

the second feature extraction module is used for respectively extracting the keywords of the paragraph i and the keywords of the paragraph j if the first similarity is greater than a first preset threshold value;

a second similarity determining module, configured to determine a second similarity between the paragraph i and the paragraph j based on the keywords of the paragraph i and the paragraph j;

and the text similarity judging module is used for determining the text similarity between the first text and the second text according to the first similarity and the second similarity.

In a possible implementation manner, the first similarity determining module is specifically configured to:

In a possible implementation manner, the first feature extraction module is specifically configured to:

In a possible implementation manner, the text similarity determination module is specifically configured to:

In a possible implementation manner, the text similarity determining module is specifically configured to:

In a third aspect, an embodiment of the present application provides a text similarity determining apparatus, including:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method according to the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes computer instructions for executing the method of the first aspect by a processor.

According to the text similarity judging method, the text similarity judging device and the storage medium, paragraph features of paragraphs of two texts for text similarity judgment are extracted, further, first similarity between the paragraphs of the two texts is determined based on the paragraph features, if the first similarity between the paragraphs is larger than a threshold value, keywords of the paragraphs are further extracted, and second similarity of the paragraphs is determined based on the keywords, so that the text similarity of the two texts is determined according to the first similarity and the second similarity, the problem that the existing text similarity judgment adaptability is poor is solved, and text similarity judgment accuracy is improved when text paragraph positions are exchanged and sentence pattern conversion is carried out.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments of the present application, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.

Fig. 1 is a schematic diagram of a text similarity determination system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a text similarity determination method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another text similarity determination method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another text similarity determination method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a text similarity determination apparatus according to an embodiment of the present application;

fig. 6 is a schematic diagram of a basic hardware architecture of a text similarity determination device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In addition, in the technical scheme of the application, the collection, storage, use, processing, transmission, provision, public and other processing of the related information such as financial data or user data and the like all meet the regulations of related laws and regulations and do not violate the good custom of the public order.

In the related art, text similarity judgment refers to measurement of similarity between two texts, and the text similarity judgment has wide application in multiple fields. In information retrieval, similar words can be identified by using similarity, so that the recall rate is improved; in the automatic question-answering scene, the similarity can be used for calculating the matching degree of the question sentences in natural language of the user and the questions in the corpus, and the answer corresponding to the question with the highest matching degree is returned as the response. The existing text similarity judgment generally utilizes sentences in each paragraph in the text to analyze the similarity. However, the existing text similarity judgment has poor adaptability, such as text paragraph position change, sentence pattern conversion and the like, and the accuracy of text similarity judgment is low.

In order to solve the above problem, an embodiment of the present application provides a text similarity determination method, which determines a first similarity between paragraphs of two texts by extracting paragraph features of the paragraphs, further extracts a keyword of each paragraph based on the first similarity between the paragraphs, and determines a second similarity of each paragraph based on the keyword, so that the text similarity of the two texts is determined according to the first similarity and the second similarity, a problem of poor adaptability of existing text similarity determination is solved, and accuracy of text similarity determination is improved.

Optionally, the text similarity determination method provided by the present application may be applied to the schematic diagram of the text similarity determination system shown in fig. 1, and as shown in fig. 1, the system may include a receiving device 101, a processing device 102, and a display device 103.

In a specific implementation process, the receiving device 101 may be an input/output interface, or may be a communication interface, and may be configured to receive a text for performing text similarity determination.

The processing device 102 may obtain a text for text similarity determination through the receiving device 101, further extract paragraph features of paragraphs of the text, determine a first similarity between paragraphs of the text based on the paragraph features, extract keywords of the paragraphs based on the first similarity between the paragraphs, and determine a second similarity of the paragraphs based on the keywords, so as to determine the text similarity between the texts according to the first similarity and the second similarity, thereby improving the accuracy of text similarity determination.

The display device 103 may be configured to display the first similarity, the second similarity, the text similarity, and the like.

The display device may also be a touch display screen for receiving user instructions while displaying the above-mentioned content to enable interaction with a user.

It should be understood that the processing device may be implemented by a processor reading instructions in a memory and executing the instructions, or may be implemented by a chip circuit.

The system is only an exemplary system, and can be set according to application requirements when being implemented specifically.

It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation to the text similarity determination system architecture. In other possible embodiments of the present application, the architecture may include more or fewer components than those shown in the drawings, or combine some components, or split some components, or arrange different components, which may be determined according to an actual application scenario and is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.

In addition, the system architecture described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not form a limitation on the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that the technical solution provided in the embodiment of the present application is also applicable to similar technical problems along with the evolution of the system architecture and the appearance of new service scenarios.

The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a schematic flowchart of a text similarity determination method according to an embodiment of the present application, where an execution subject in this embodiment may be the processing device in fig. 1, and a specific execution subject may be determined according to an actual application scenario, which is not limited in this embodiment of the present application. As shown in fig. 2, the text similarity determination method provided in the embodiment of the present application may include the following steps:

s201: determining a first text and a second text for text similarity judgment, and extracting paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text.

Here, after determining the first text and the second text for text similarity determination, the processing device may determine whether the type of the first text is a preset text type, and determine whether the type of the second text is the preset text type, and if the type of the first text is the preset text type and the type of the second text is the preset text type, extract paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text.

The preset text type can be determined according to actual conditions, for example, a file such as txt. In this embodiment of the application, the processing device first determines a type of a text for text similarity determination, and if the type of the text is a preset text type, directly performs subsequent steps, and if the type of the text is not the preset text type, converts the preset text type into the preset text type and then performs the subsequent steps, so as to facilitate subsequent text processing.

Optionally, when the processing device extracts the paragraph features of each paragraph of the first text and the paragraph features of each paragraph of the second text, the processing device may extract the paragraph features by using a pre-trained paragraph feature extraction model. Here, the paragraph feature extraction model is used to extract paragraph features of each paragraph of text. The paragraph features can be understood as paragraph abstract, paragraph main content, etc.

S202: determining a first similarity between a paragraph i and a paragraph j based on paragraph features of the paragraph i and paragraph features of the paragraph j of the second text, wherein the paragraph i is any one of the paragraphs in the first text, i is 1,2, …, m is an integer, m is determined according to the total number of paragraphs in the first text, the paragraph j is any one of the paragraphs in the second text, j is 1,2, …, n is an integer, and n is determined according to the total number of paragraphs in the second text.

For example, the processing device may perform word segmentation processing on the paragraph feature of the paragraph i and the paragraph feature of the paragraph j respectively to obtain a first cluster and a second cluster, further calculate an intersection ratio of the first cluster and the second cluster, and determine a first similarity between the paragraph i and the paragraph j based on the intersection ratio.

For example, the first cluster and the second cluster are respectively denoted as E1 and E2, and when the processing device calculates the intersection ratio of the first cluster and the second cluster, the processing device may calculate the intersection of E1 and E2, which is denoted as E^∩And calculating the union set of E1 and E2, and recording as E^∪Then calculate the intersection ratio sim ═ E of the two clusters^∩/E^∪。

Let E1 ═ A₁，A₂，A₃,.. } and E2 { (B)₁，B₂，B₃,... }, the total number of words in E1 is Count_AThe total number of vocabulary in E2 is Count_BCalculating the intersection of E1 and E2, namely that a certain vocabulary exists in the two sets at the same time, adding the vocabulary into the intersection and marking as E^∩The number of words in the simultaneous intersection is increased by 1, and the union refers to the number of words left after the duplication removal in the two sets E1 and E2 and is marked as E^∪The union ratio is calculated as the ratio of the number of words in the union to the number of words in the union.

E.g. E1 { "friend", "xiaoming", "minired", "become" },

e2 { "friend", "xiaoming", "floret", "yes", then

E^∪The words are divided into four parts, namely { "Xiaoming", "Xiaohong", "Xiaohua", "become", "Ye", "friend" }, and the number of words is 6, E^∩The number of words in the intersection is 2, and the intersection ratio 2/6 of the two clusters is 1/3.

After the cross-over ratio is calculated, the processing device may use the cross-over ratio as a first similarity between paragraph i and paragraph j.

Before calculating the intersection ratio of the first cluster and the second cluster, the processing device further determines the non-intersecting vocabulary of the paragraph i and the paragraph j according to the first cluster and the second cluster, and calculates the intersection ratio of the first cluster and the second cluster if the number of the negative vocabularies in the non-intersecting vocabulary of the paragraph i and the paragraph j is an even number.

Here, the processing apparatus considers that a negative word occurs in a non-intersecting vocabulary between clusters, if the number of total negative vocabularies in the non-intersecting vocabulary of the paragraph i and the paragraph j is an even number, it indicates that the paragraph i and the paragraph j are highly likely to be similar, then further calculates an intersection ratio between the first cluster and the second cluster, and determines a first similarity between the paragraph i and the paragraph j based on the intersection ratio, thereby performing subsequent text similarity determination according to the first similarity.

S203: and if the first similarity is greater than a first preset threshold value, extracting keywords of the paragraph i and the paragraph j respectively.

Here, if the first similarity is greater than a first preset threshold, it indicates that there is a high possibility that the paragraph i and the paragraph j are similar, and in order to further improve the accuracy of determining the similarity of the subsequent text, the processing device further extracts the keywords of the paragraph i and the paragraph j when the first similarity is greater than the first preset threshold. The first preset threshold may be determined according to actual conditions, such as 1/2.

Optionally, when extracting the keywords of the paragraphs i and j, the processing apparatus may extract the keywords by using a keyword extraction model trained in advance. Here, the keyword extraction model is used to extract keywords of each paragraph of text. The keyword may be determined according to actual conditions, for example, a word with a high association degree with the paragraph content.

S204: determining a second similarity of paragraph i and paragraph j based on the keywords of paragraph i and paragraph j.

For example, the processing device may perform word segmentation processing on the keyword of the paragraph i and the keyword of the paragraph j respectively to obtain a third cluster and a fourth cluster, further calculate an intersection ratio of the third cluster and the fourth cluster, and determine a second similarity between the paragraph i and the paragraph j based on the intersection ratio. For example, the processing device uses the cross-correlation as the second similarity between paragraph i and paragraph j.

Before calculating the intersection and comparison of the third cluster and the fourth cluster, the processing device may further determine, according to the third cluster and the fourth cluster, a keyword non-intersection vocabulary of the paragraph i and the paragraph j, and if the number of the negative vocabularies in the non-intersection vocabulary of the paragraph i and the paragraph j is an even number, calculate the intersection and comparison of the third cluster and the fourth cluster.

Here, the processing apparatus considers that a negative word occurs in a non-intersecting vocabulary between clusters, and if the number of total negative words in the non-intersecting vocabulary of the paragraph i and the paragraph j is an even number, it indicates that the paragraph i and the paragraph j are highly likely to be similar, and here, further calculates an intersection ratio of the third cluster and the fourth cluster, and determines a second similarity of the paragraph i and the paragraph j based on the intersection ratio, thereby performing subsequent text similarity determination according to the second similarity.

S205: and determining the text similarity between the first text and the second text according to the first similarity and the second similarity.

In this embodiment, the processing device may determine, according to the first text and the second text, a first coefficient corresponding to the first similarity and a second coefficient corresponding to the second similarity, and further obtain a paragraph similarity between the first text and the second text based on the first similarity, the first coefficient, the second similarity, and the second coefficient, so as to determine, according to the paragraph similarity, a text similarity between the first text and the second text. For example, the processing device multiplies the first similarity, the first coefficient, the second similarity, and the second coefficient, and obtains the paragraph similarity between the first text and the second text based on the multiplication result.

The processing device may determine a paragraph weight corresponding to the paragraph similarity according to each paragraph of the first text and each paragraph of the second text, and then determine the text similarity between the first text and the second text based on the paragraph similarity and the paragraph weight. For example, the processing device multiplies the paragraph similarity by the paragraph weight, and determines the text similarity between the first text and the second text based on the multiplication result.

According to the method and the device, the paragraph features of the paragraphs of the two texts for text similarity judgment are extracted, the first similarity between the paragraphs of the two texts is determined based on the paragraph features, the keywords of the paragraphs are further extracted if the first similarity between the paragraphs is larger than a threshold value, and the second similarity of the paragraphs is determined based on the keywords, so that the text similarity of the two texts is determined according to the first similarity and the second similarity, the problem of poor adaptability of the existing text similarity judgment is solved, and the text similarity judgment accuracy is improved during text paragraph position exchange and sentence pattern conversion.

In addition, before extracting the paragraph features of each paragraph of the first text and the paragraph features of each paragraph of the second text, the processing device further considers that the text features of the first text and the text features of the second text are respectively extracted, determines a third similarity between the first text and the second text based on the text features, and performs the step of extracting the paragraph features of each paragraph of the first text and the paragraph features of each paragraph of the second text when the third similarity is greater than a second preset threshold, so as to further improve the accuracy of judging the similarity of the subsequent texts. And subsequently determining the text similarity between the first text and the second text, wherein the processing device may consider a third similarity in addition to the first similarity and the second similarity, that is, determine the text similarity between the first text and the second text according to the first similarity, the second similarity and the third similarity, thereby improving the accuracy of text similarity determination. Fig. 3 is a schematic flowchart of another text similarity determination method provided in an embodiment of the present application, and as shown in fig. 3, the method may include:

s301: determining a first text and a second text for text similarity judgment, and respectively extracting text features of the first text and the second text.

Here, the processing device may extract the text feature of the first text and the text feature of the second text by using a pre-trained text feature extraction model. The text feature extraction model is used for extracting paragraph features of a text. The text features can be understood as text abstract, text main content and the like

S302: determining a third similarity of the first text and the second text based on the text features of the first text and the text features of the second text.

For example, the processing device may perform word segmentation on the text features of the first text and the text features of the second text respectively to obtain a fifth cluster and a sixth cluster, further calculate an intersection-to-parallel ratio of the fifth cluster and the sixth cluster, and determine a third similarity between the first text and the second text based on the intersection-to-parallel ratio.

Before calculating the intersection ratio of the fifth cluster and the sixth cluster, the processing device further considers that full-text non-intersection words of the first text and the second text are determined according to the fifth cluster and the sixth cluster, and if the number of negative words in the full-text non-intersection words of the first text and the second text is an even number, the intersection ratio of the fifth cluster and the sixth cluster is calculated.

Here, the processing device considers that a negative word occurs in the non-intersecting words between the clusters, and if the number of the total negative words in the non-intersecting words of the first text and the second text is an even number, it indicates that the first text is highly likely to be similar to the second text, and here, further calculates an intersection ratio of the fifth cluster and the sixth cluster, and determines a third similarity of the first text and the second text based on the intersection ratio, thereby performing subsequent text similarity determination according to the third similarity.

S303: if the third similarity is greater than a second preset threshold, paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text are extracted.

Here, if the third similarity is greater than a second preset threshold, which indicates that the first text and the second text are very similar, in order to further improve the accuracy of determining the similarity of subsequent texts, the processing device further extracts paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text when the third similarity is greater than the second preset threshold.

S304: determining a first similarity between a paragraph i and a paragraph j based on paragraph features of the paragraph i and paragraph features of the paragraph j of the second text, wherein the paragraph i is any one of the paragraphs in the first text, i is 1,2, …, m is an integer, m is determined according to the total number of paragraphs in the first text, the paragraph j is any one of the paragraphs in the second text, j is 1,2, …, n is an integer, and n is determined according to the total number of paragraphs in the second text.

S305: and if the first similarity is greater than a first preset threshold value, extracting keywords of the paragraph i and the paragraph j respectively.

S306: determining a second similarity of paragraph i and paragraph j based on the keywords of paragraph i and paragraph j.

The implementation of steps S304-S306 is the same as the implementation of steps S202-S204, and is not described herein again.

S307: and determining the text similarity of the first text and the second text according to the first similarity, the second similarity and the third similarity.

For example, the processing device may determine a first coefficient corresponding to the first similarity, a second coefficient corresponding to the second similarity, and a third coefficient corresponding to the third similarity according to the first text and the second text, respectively, further obtain a paragraph similarity between the first text and the second text based on the first similarity, the first coefficient, the second similarity, the second coefficient, the third similarity, and the third coefficient, and determine the text similarity between the first text and the second text according to the paragraph similarity. For example, the processing device multiplies the first similarity, the first coefficient, the second similarity, the second coefficient, the third similarity, and the third coefficient, and obtains the paragraph similarity between the first text and the second text based on the multiplication result.

Here, the first coefficient may be determined according to the number of the negative words in the non-intersecting words corresponding to the paragraph, the second coefficient may be determined according to the number of the negative words in the non-intersecting words corresponding to the paragraph, and the third coefficient may be determined according to the number of the negative words in the full-text non-intersecting words.

For example, the paragraph similarity between the first text and the second text may be according to the expression:

S_{paragraph (b)}＝S_{Third degree of similarity}*(2-2^{Full text non-intersecting negation word% 2})*S_{First degree of similarity}*(2-2^{Paragraph non-intersecting negation word% 2})

*S_{Second degree of similarity}*(2-2^{Paragraph keywords non-intersecting negative words% 2})

Is determined, wherein S_ParagraphIndicates the similarity of the above paragraphs, S_{First degree of similarity}2-2 representing the first similarity^{Paragraph non-intersecting negation word% 2}Represents the first coefficient, S_{Second degree of similarity}Represents the above second degree of similarity, 2-2^{Paragraph keywords non-intersecting negative words% 2}Represents the above second coefficient, S_{Third degree of similarity}Represents the third similarity, 2-2^{Full text non-intersecting negation word% 2}Representing the third coefficient.

Optionally, the processing device may determine a paragraph weight corresponding to the paragraph similarity according to each paragraph of the first text and each paragraph of the second text, so that the text similarity between the first text and the second text is determined based on the paragraph similarity and the paragraph weight. For example, the processing device multiplies the paragraph similarity by the paragraph weight, and determines the text similarity between the first text and the second text based on the multiplication result.

Before extracting paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text, the embodiment of the present application also considers that the text features of the first text and the text features of the second text are extracted, a third similarity between the first text and the second text is determined based on the text features, and when the third similarity is greater than a second preset threshold, the step of extracting the paragraph features of each paragraph of the first text and the paragraph features of each paragraph of the second text is performed, so as to further improve the accuracy of subsequent text similarity determination. In addition, in the subsequent determination of the text similarity between the first text and the second text, in the embodiment of the present application, in addition to the first similarity and the second similarity, a third similarity may also be considered, that is, the text similarity between the first text and the second text is determined according to the first similarity, the second similarity, and the third similarity, so that the accuracy of text similarity determination is also improved, and the problem of poor adaptability of the existing text similarity determination is solved.

Here, as shown in fig. 4, fig. 4 shows another schematic diagram of determining text similarity, where the same or similar contents as those in the embodiments of fig. 2 and fig. 3 are referred to above and are not repeated herein. As shown in fig. 4, after determining the first text and the second text for text similarity determination, the processing device first determines whether the type of the first text is a preset text type, and determines whether the type of the second text is the preset text type, if the type of the first text is the preset text type, and the type of the second text is the preset text type, the subsequent steps are directly executed, otherwise, the processing device performs text preprocessing, and converts the type of the text that does not meet the requirement into the preset text type. Then, the processing device may extract text features of the first text and text features of the second text, respectively, determine a third similarity between the first text and the second text based on the text features of the first text and the text features of the second text, and further extract paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text if the third similarity is greater than a second preset threshold. Further, the processing device determines a first similarity between a paragraph i and a paragraph j based on a paragraph feature of the paragraph i in the first text and a paragraph feature of the paragraph j in the second text, where the paragraph i is any one of the paragraphs in the first text, i is 1,2, …, m, m is an integer, m is determined according to the total number of paragraphs in the first text, the paragraph j is any one of the paragraphs in the second text, j is 1,2, …, n, n is an integer, and n is determined according to the total number of paragraphs in the second text. If the first similarity is greater than the first preset threshold, the processing device may further extract keywords of the paragraph i and the paragraph j, determine a second similarity between the paragraph i and the paragraph j based on the keywords of the paragraph i and the paragraph j, and determine a text similarity between the first text and the second text according to the first similarity, the second similarity, and the third similarity.

Taking the determination of the first similarity as an example, when determining the first similarity of the paragraph i and the paragraph j, the processing device may perform word segmentation processing on the paragraph feature of the paragraph i and the paragraph feature of the paragraph j respectively to obtain a first cluster and a second cluster, further calculate an intersection ratio of the first cluster and the second cluster, and determine the first similarity of the paragraph i and the paragraph j based on the intersection ratio.

Here, before calculating the intersection ratio of the first cluster and the second cluster, the processing apparatus may further determine, according to the first cluster and the second cluster, a non-intersecting vocabulary of the paragraph i and the paragraph j, and calculate the intersection ratio of the first cluster and the second cluster if the number of negative vocabularies in the non-intersecting vocabulary of the paragraph i and the paragraph j is an even number.

In summary, compared with the prior art, in the embodiment of the present application, by extracting the text features of the first text and the text features of the second text, determining the third similarity between the first text and the second text based on the text features, further extracting the paragraph features of each paragraph of the two texts based on the third similarity, determining the first similarity between each paragraph of the two texts, extracting keywords of each paragraph based on the first similarity between each paragraph, and determining the second similarity of each paragraph based on the keywords, thereby, the text similarity of the two texts is determined according to the first similarity, the second similarity and the third similarity, the problem of poor adaptability of the existing text similarity judgment is solved, if the positions of text paragraphs are exchanged, sentence pattern conversion is performed, negative words appear in non-intersection words and the like, the accuracy rate of text similarity judgment is improved.

Fig. 5 is a schematic structural diagram of a text similarity determination apparatus according to an embodiment of the present application, which corresponds to the text similarity determination method according to the foregoing embodiment. For convenience of explanation, only portions related to the embodiments of the present application are shown. Fig. 5 is a schematic structural diagram of a text similarity determination apparatus according to an embodiment of the present application, where the text similarity determination apparatus 50 includes: a first feature extraction module 501, a first similarity determination module 502, a second feature extraction module 503, a second similarity determination module 504, and a text similarity determination module 505. The text similarity determination means may be the processing means itself, or a chip or an integrated circuit that realizes the functions of the processing means. It should be noted here that the division of the first feature extraction module, the first similarity determination module, the second feature extraction module, the second similarity determination module, and the text similarity determination module is only a division of a logic function, and the two may be integrated or independent physically.

The first feature extraction module 501 is configured to determine a first text and a second text for text similarity determination, and extract paragraph features of paragraphs of the first text and paragraph features of paragraphs of the second text.

A first similarity determining module 502, configured to determine a first similarity between a paragraph i in the first text and a paragraph j in the second text, where the paragraph i is any one paragraph in the first text, i is 1,2, …, m is an integer, m is determined according to a total number of paragraphs in the first text, the paragraph j is any one paragraph in the second text, j is 1,2, …, n, n is an integer, and n is determined according to a total number of paragraphs in the second text.

A second feature extraction module 503, configured to extract the keywords of the paragraph i and the paragraph j respectively if the first similarity is greater than a first preset threshold.

A second similarity determining module 504, configured to determine a second similarity between the paragraph i and the paragraph j based on the keywords of the paragraph i and the paragraph j.

A text similarity determining module 505, configured to determine a text similarity between the first text and the second text according to the first similarity and the second similarity.

In a possible implementation manner, the first similarity determining module 502 is specifically configured to:

In a possible implementation manner, the first feature extraction module 501 is specifically configured to:

In a possible implementation manner, the text similarity determining module 505 is specifically configured to:

The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.

Optionally, fig. 6 schematically provides a schematic diagram of a possible basic hardware architecture of the text similarity determination device according to the present application.

Referring to fig. 6, the text similarity determination apparatus includes at least one processor 601 and a communication interface 603. Further optionally, a memory 602 and a bus 604 may also be included.

In the text similarity determination device, the number of the processors 601 may be one or more, and fig. 6 only illustrates one of the processors 601. Alternatively, the processor 601 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Digital Signal Processor (DSP). If the text similarity determination apparatus has a plurality of processors 601, the types of the plurality of processors 601 may be different, or may be the same. Alternatively, the plurality of processors 601 of the text similarity determination device may also be integrated into a multi-core processor.

Memory 602 stores computer instructions and data; the memory 602 may store computer instructions and data required to implement the text similarity determination method provided herein, for example, the memory 602 stores instructions for implementing the steps of the text similarity determination method. The memory 602 may be any one or any combination of the following storage media: nonvolatile Memory (e.g., Read-Only Memory (ROM), Solid State Disk (SSD), Hard Disk Drive (HDD), optical disc), and volatile Memory.

The communication interface 603 may provide information input/output for the at least one processor. Any one or any combination of the following devices may also be included: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.

Optionally, the communication interface 603 may also be used for data communication between the text similarity determination device and other computing devices or terminals.

Further alternatively, fig. 6 shows the bus 604 as a thick line. The bus 604 may connect the processor 601 with the memory 602 and the communication interface 603. Thus, via bus 604, processor 601 may access memory 602 and may also interact with other computing devices or terminals using communication interface 603.

In the present application, the text similarity determining apparatus executes the computer instruction in the memory 602, so that the text similarity determining apparatus implements the text similarity determining method provided in the present application, or the text similarity determining apparatus deploys the text similarity determining device.

In view of logic function division, as shown in fig. 6, for example, the memory 602 may include a first feature extraction module 501, a first similarity determination module 502, a second feature extraction module 503, a second similarity determination module 504, and a text similarity determination module 505. The embodiments described herein include only instructions stored in a memory, and when executed, may implement the functions of the first feature extraction module, the first similarity determination module, the second feature extraction module, the second similarity determination module, and the text similarity determination module, respectively, without limiting to a physical structure.

The present application provides a computer-readable storage medium, and the computer program product includes computer instructions for instructing a computing device to execute the text similarity determination method provided in the present application.

The application provides a computer program product, which comprises computer instructions, wherein the computer instructions are executed by a processor to execute the text similarity judgment method.

The present application provides a chip comprising at least one processor and a communication interface providing information input and/or output for the at least one processor. Further, the chip may also include at least one memory for storing computer instructions. The at least one processor is used for calling and running the computer instructions to execute the text similarity judging method provided by the application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

Claims

1. A text similarity judging method is characterized by comprising the following steps:

2. The method of claim 1, wherein determining a first similarity of paragraph i to paragraph j of the first text based on paragraph features of paragraph i and paragraph features of paragraph j of the second text comprises:

3. The method of claim 2, further comprising, prior to said computing the intersection ratio of the first cluster and the second cluster:

the calculating the intersection ratio of the first cluster and the second cluster comprises:

4. The method according to any one of claims 1 to 3, wherein before the extracting paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text, further comprising:

5. The method of claim 4, wherein determining the text similarity between the first text and the second text according to the first similarity and the second similarity comprises:

6. The method of claim 5, wherein determining the text similarity between the first text and the second text according to the first similarity, the second similarity, and the third similarity comprises:

7. The method of claim 6, wherein determining the text similarity between the first text and the second text according to the paragraph similarity between the first text and the second text comprises:

8. The method according to any one of claims 1 to 3, wherein before the extracting paragraph features of each paragraph of the first text and paragraph features of each paragraph of the second text, further comprising:

9. A text similarity determination device, comprising:

a first similarity determination module, configured to determine a first similarity between a paragraph i in the first text and a paragraph j in the second text based on a paragraph feature of the paragraph i and a paragraph feature of the paragraph j in the second text, where the paragraph i is any one paragraph in the first text, i is 1,2, …, m is an integer, m is determined according to a total number of paragraphs in the first text, the paragraph j is any one paragraph in the second text, j is 1,2, …, n, n is an integer, and n is determined according to a total number of paragraphs in the second text;

the second feature extraction module is used for respectively extracting the keywords of the paragraph i and the paragraph j if the first similarity is greater than a first preset threshold value;

10. The apparatus of claim 9, wherein the first similarity determination module is specifically configured to:

11. The apparatus of claim 10, wherein the first similarity determination module is specifically configured to:

12. The apparatus according to any one of claims 9 to 11, wherein the first feature extraction module is specifically configured to:

13. A text similarity determination device characterized by comprising:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-8.

14. A computer-readable storage medium, characterized in that it stores a computer program that causes a server to execute the method of any one of claims 1-8.

15. A computer program product comprising computer instructions for executing the method of any one of claims 1 to 8 by a processor.