CN111737997A

CN111737997A - Text similarity determination method, text similarity determination equipment and storage medium

Info

Publication number: CN111737997A
Application number: CN202010559737.2A
Authority: CN
Inventors: 刘桂鹏; 陈运文; 桂洪冠; 谭新; 纪达麒; 连明杰
Original assignee: Datagrand Tech Inc
Current assignee: Datagrand Tech Inc
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2020-10-02

Abstract

The embodiment of the invention discloses a text similarity determination method, text similarity determination equipment and a storage medium, wherein the method comprises the following steps: performing text preprocessing on two preset texts to be subjected to text similarity comparison; respectively extracting keywords of the two preprocessed texts; expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors; based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix; extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity; and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity. According to the technical scheme of the embodiment of the invention, the accuracy and the recall rate of the text similarity calculation method are improved.

Description

Text similarity determination method, text similarity determination equipment and storage medium

Technical Field

The embodiment of the invention relates to a computer text information processing technology, in particular to a text similarity determining method, text similarity determining equipment and a storage medium.

Background

The text similarity calculation is an important algorithm in text mining and is a link for linking basic research such as text modeling and representation and upper-layer application research of potential information of texts.

In the prior art, a Vector Space Model (VSM) is the most common text representation method, and text similarity is described through the VSM and then measured by using a similarity coefficient, a similarity distance and the like, so that the text similarity is relatively intuitive. The method has the difficulty that a vector space model is constructed, the term frequency-inverse document frequency (TF-IDF) is the most widely used weight calculation method, but the TF-IDF method ignores the relation between words, and important information is often lost in the task of calculating text similarity. The TextRank method constructs a network through adjacent relations between words, is similar to the PageRank algorithm, iteratively calculates the rank value of each node, then obtains the keywords of the text through ranking the rank values, and considers the relations between the words on the basis of considering the word frequency. Cosine similarity is an important method for calculating text similarity, and the aim is to calculate an included angle between vectors after text is vectorized through a vector space model, wherein the larger the cosine value of the included angle is, the smaller the included angle between two vectors is, the higher the similarity between two texts is. As can be seen from the above description, the method for calculating text similarity based on TextRank algorithm and VSM focuses more on using the same words appearing in the text as reference indexes, and although the position relationship between words is considered, the semantic association between words is not deeply considered, in real life, the method for calculating text similarity only considering word frequency and ignoring semantic information between words is useless, for example, when facing two articles that also describe natural language processing technology, it cannot distinguish the similarity between the article describing information extraction technology and the article describing entity extraction technology.

The text similarity calculation method in the prior art has low accuracy and recall rate and cannot meet the requirements of practical application.

Disclosure of Invention

The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, and aims to improve the accuracy and the recall rate of a text similarity calculation method.

In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:

performing text preprocessing on two preset texts to be subjected to text similarity comparison;

respectively extracting keywords of the two preprocessed texts;

expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;

based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;

extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;

and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.

In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, including:

the text preprocessing module is used for performing text preprocessing on two preset texts to be subjected to text similarity comparison;

the keyword extraction module is used for respectively extracting keywords of the two pre-set texts after text preprocessing;

the text vector similarity determining module is used for representing the two preset texts by a vector space model according to the vocabulary weight of the keyword and determining the text vector similarity;

the similarity matrix determining module is used for carrying out semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;

the text semantic similarity determining module is used for extracting word similarity values of the two preset texts according to the similarity matrix and determining text semantic similarity;

and the mixed text similarity determining module is used for determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.

In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a text similarity determination method as provided by any of the embodiments of the invention.

In a fourth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the text similarity determination method according to any of the embodiments of the present invention.

The embodiment of the invention calculates the text similarity by fusing the vector space model and the word semantics, solves the problem of low accuracy and recall rate of the text similarity calculation method, and realizes the effect of improving the accuracy and recall rate of the text similarity calculation method.

Drawings

Fig. 1 is a flowchart of a text similarity determining method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a text similarity determination method in the second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a text similarity determination apparatus in a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus in the fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a text similarity determining method according to an embodiment of the present invention, where the present embodiment is applicable to a case of calculating text similarity of a preset text, and the method may be executed by a text similarity determining apparatus, where the apparatus may be implemented by hardware and/or software, and the method specifically includes the following steps:

step 110, performing text preprocessing on two preset texts to be subjected to text similarity comparison;

the preset text is a Chinese text, wherein the sentence consists of words and phrases, the text preprocessing is required to be carried out on the preset text, and the text preprocessing operation comprises word segmentation, part of speech tagging and word stop operation on the preset text. Only words of a specified part of speech, such as nouns, verbs, adjectives, etc., are retained.

Step 120, extracting keywords of the two pre-set texts after text preprocessing respectively;

for the preset text which completes the text preprocessing, respective keywords need to be extracted, so that the subsequent text similarity calculation can be performed according to the keywords. Optionally, keywords of two pre-set texts after text pre-processing are respectively extracted through a TextRank algorithm.

Step 130, representing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;

the word weight of the keyword of each preset text is obtained through a TextRank algorithm, the feature word weight corresponding to each preset text is calculated respectively, the text vector of each preset text is constructed through a vector space model, and the text vector similarity of the two preset texts is calculated according to the text vectors of the two preset texts.

140, performing semantic similarity calculation on words of two preset texts based on a preset semantic network to obtain a similarity matrix;

the preset semantic network may be the known network (the name of the internet is HowNet), and the known network is a common knowledge base which uses concepts represented by words of chinese and english as description objects and discloses relationships between the concepts and attributes of the concepts as basic contents. HowNet takes the idea of a reduced theory, and considers that vocabulary/word meaning can be described by a smaller semantic unit. This semantic unit is called "Sememe" (Sememe), which is, as the name implies, atomic semantics, the most basic, smallest semantic unit that is not amenable to subdivision. In the process of continuous labeling, HowNet gradually constructs a set of fine semantic system. HowNet accumulates semantic information labeled with hundreds of thousands of words/senses based on this sense system. Complex semantic relations such as host, modifier, belong and the like are marked among the sememes, so that semantic information of the word senses can be accurately represented. Two main attributes are defined in the Homing network: "concepts", which are abstract profiles of what words are meant, a single word may correspond to one or more "concepts" depending on the context of the word; a "semantic" is a unit of meaning that is the smallest fundamental unit used to characterize a "concept" in a word. The sememes form a tree hierarchy, and the similarity calculation between the sememes is performed on the basis of the hierarchy. And taking the key words of each preset text as characteristic items of the preset text, and constructing a similarity matrix according to the similarity between each characteristic item in one preset text and each characteristic item in the other preset text.

Step 150, extracting word similarity values of two preset texts according to the similarity matrix, and determining text semantic similarity;

and step 160, determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.

Whether the two preset texts are similar or not can be judged according to the similarity of the mixed texts of the two preset texts, the similarity of the mixed texts can be compared with a preset threshold, when the similarity of the mixed texts is larger than the preset threshold, the two preset texts are considered to be similar, and otherwise, the two texts are judged to be dissimilar.

According to the technical scheme, the text similarity is calculated by fusing the vector space model and the word semantics, the problem that the accuracy and the recall rate of the text similarity calculation method are not high is solved, and the effect of improving the accuracy and the recall rate of the text similarity calculation method is achieved.

Example two

Fig. 2 is a flowchart of a text similarity determining method according to a second embodiment of the present invention, where the technical solution of this embodiment is further refined on the basis of the above technical solution, and specifically includes:

step 201, performing text preprocessing on two preset texts to be subjected to text similarity comparison;

step 202, extracting keywords of two preset texts after text preprocessing respectively;

the TextRank algorithm is used to extract keywords from two preset texts, for example, in this embodiment, 20 keywords are reserved in the two preset texts.

Step 203, taking a union set of keyword sets of two preset texts as feature words of a vector space model; wherein the feature word is expressed as F ═ { F ═ F₁，f₂，...，f_nThe keyword sets of the two preset texts are respectively represented as

And

n is speciallyThe number of tokens;

step 204, carrying out first characteristic words F on a first text of two preset texts₁The weight calculation method comprises the following steps:

wherein f is_1iIs the ith feature word of the first text,

is the set of keyword weight values of the first text, m is the number of keywords of the first text, ω_1jIs f_1iIn that

Weight value of the corresponding keyword in (1), w_1iIs f_1iIf the weighted value is greater than the weighted value, the text vector expression of the first text is T₁＝[w₁₁，w₁₂，...，w_1n]；

Step 205, for the second feature word F of the second text of the two preset texts₂The weight calculation method comprises the following steps:

wherein f is_2iIs the ith feature word of the second text,

is the set of keyword weight values of the second text, m is the number of keywords of the second text, ω_2jIs f_2iIn that

Weight value of the corresponding keyword in (1), w_2iIs f_2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T₂＝[w₂₁，w₂₂，...，w_2n]。

Step 206, the text vector similarity calculation formula is:

wherein, T₁、T₂The first text and the second text are in turn. n denotes the feature word vector dimension, i.e. the number of elements of the keyword set of two predetermined texts, w_1i、w_2iIs the feature word weight.

Step 207, based on the known network, the calculation formula of the similarity of the sememes is as follows:

wherein S is₁、S₂The expression of two of the sense-elements,

is two sememes S₁、S₂The depth of the hierarchical tree in which each is located; distance (S)₁，S₂) α is a variable parameter, set to 0.5 in this example;

step 208, calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;

the similarity of the real word concepts is calculated as follows:

wherein C is₁、C₂Representing two concepts β_i(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β₁+β₂+β₃+β₄1 and β₁≥β₂≥β₃≥β₄(ii) a According to the difference in the sense, Sim₁(S₁，S₂) Is the similarity of the first independent primitive description, Sim₂(S₁，S₂) Is the similarity of other independent sense description formulas, Sim₃(S₁，S₂) Are similarities of relational primitive descriptionsDegree, Sim₄(S₁，S₂) Is the similarity of the symbolic primitive description;

step 209, assume W₁And W₂Are two Chinese words, W₁There are l concepts: c₁₁，C₁₂，...，C_1l；W₂There are k concepts: c₂₁，C₂₂，...，C_2k(ii) a Then W is₁And W₂Can pass through C_1iAnd C_2jThe maximum value of similarity in all combinations represents, i.e.:

Sim(W₁，W₂)＝MAX(Sim(C_1i，C_2j))，

wherein, i is 1, 2,. and l; j ═ 1, 2,. k;

step 210, the feature items of the two preset texts are respectively the keywords W of the two preset texts,

the similarity matrix M is then as follows:

wherein, Sim (W)_1x，W_2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of the keywords extracted from the texts.

Step 211, taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrix; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the text_HowNetThe calculation formula is as follows:

wherein m is the number of keywords extracted from the text, T₁、T₂The first text and the second text are in turn.

Step 212, the calculation formula of the similarity S of the mixed text is as follows:

S＝γ*S_VSM(T₁，T₂)+(1-γ)*S_HowNet(T₁，T₂)，

wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.

The text similarity determining method provided by the embodiment is a text similarity calculating method fusing a vector space model and word semantics, comprehensively considers the position relation of key words in a text and the semantic relation among words, extracts text keywords on the basis of a TextRank algorithm, comprehensively calculates the semantic relation among the keywords by combining the vector space model and a HowNet model, pays more importance on the semantic correlation among the texts than a general text similarity calculating method, considers the difference among different semantic correlation calculating methods instead of simply considering the coincidence degree of words appearing at high frequency or calculating the semantic correlation in a single mode, and can well meet the requirements of practical application.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a text similarity determining apparatus according to a third embodiment of the present invention, where the text similarity determining apparatus includes:

the text preprocessing module 310 is configured to perform text preprocessing on two preset texts to be subjected to text similarity comparison;

the keyword extraction module 320 is configured to extract keywords of two preset texts after text preprocessing respectively;

the text vector similarity determining module 330 is configured to represent the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determine a text vector similarity;

the similarity matrix determination module 340 is configured to perform semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;

the text semantic similarity determining module 350 is configured to extract word similarity values of two preset texts according to the similarity matrix, and determine text semantic similarity;

and a mixed text similarity determining module 360, configured to determine a mixed text similarity between the two preset texts according to the text vector similarity and the text semantic similarity.

Optionally, the text preprocessing module 310 is specifically configured to:

and performing word segmentation, part of speech tagging and word stop operation on the preset text.

Optionally, the keyword extraction module 320 is specifically configured to:

and respectively extracting keywords of the two preset texts after the texts are preprocessed through a TextRank algorithm.

Optionally, the text vector similarity determining module 330 is specifically configured to:

taking a union set of keyword sets of two preset texts as feature words of a vector space model; wherein the feature word is expressed as F ═ { F ═ F₁，f₂，...，f_nThe keyword sets of the two preset texts are respectively represented as

And

n is the number of the feature words;

for the first character word F of the first text of the two preset texts₁The weight calculation method comprises the following steps:

wherein f is_1iIs the ith feature word of the first text,

set of keyword weight values for a first textM is the number of keywords of the first text, ω_1jIs f_1iIn that

For the second characteristic word F of the second text of the two preset texts₂The weight calculation method comprises the following steps:

wherein f is_2iIs the ith feature word of the second text,

Optionally, the text vector similarity determining module 330 is further specifically configured to:

the text vector similarity calculation formula is as follows:

wherein, T₁、T₂The first text and the second text are in turn.

Optionally, the similarity matrix determining module 340 is specifically configured to:

based on the known network, the calculation formula of the similarity of the sememes is as follows:

wherein S is₁、S₂The expression of two of the sense-elements,

is two sememes S₁、S₂The depth of the hierarchical tree in which each is located; distance (S)₁，S₂) The path length of two sememes in the hierarchical tree, α is a variable parameter;

calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;

the similarity of the real word concepts is calculated as follows:

wherein C is₁、C₂Representing two concepts β_i(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β₁+β₂+β₃+β₄1 and β₁≥β₂≥β₃≥β₄(ii) a According to the difference in the sense, Sim₁(S₁，S₂) Is the similarity of the first independent primitive description, Sim₂(S₁，S₂) Is the similarity of other independent sense description formulas, Sim₃(S₁，S₂) Is the similarity of the relational semantic description, Sim₄(S₁，S₂) Is the similarity of the symbolic primitive description;

suppose W₁And W₂Are two Chinese words, W₁There are l concepts: c₁₁，C₁₂，...，C_1l；W₂There are k concepts: c₂₁，C₂₂，...，C_2k(ii) a Then W is₁And W₂Can pass through C_1iAnd C_2jThe maximum value of similarity in all combinations represents, i.e.:

Sim(W₁，W₂)＝MAX(Sim(C_1i，C_2j))，

wherein, i is 1, 2,. and l; j ═ 1, 2,. k;

the feature items of the two preset texts are respectively keywords W of the two preset texts,

the similarity matrix M is then as follows:

Optionally, the text semantic similarity determining module 350 is specifically configured to:

taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrixes; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the text_HowNetThe calculation formula is as follows:

Optionally, the mixed text similarity determining module 360 is specifically configured to:

the calculation formula of the similarity S of the mixed text is as follows:

S＝γ*S_VSM(T₁，T₂)+(1-γ)*S_HowNet(T₁，T₂)，

The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.

Example four

Fig. 4 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the apparatus includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the apparatus may be connected by a bus or other means, for example, in fig. 4.

The memory 420 serves as a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, the text preprocessing module 310, the keyword extraction module 320, the text vector similarity determination module 330, the similarity matrix determination module 340, the text semantic similarity determination module 350, and the mixed text similarity determination module 360 in the text similarity determination device). The processor 410 executes various functional applications of the device and data processing by executing software programs, instructions, and modules stored in the memory 420, that is, implements the text similarity determination method described above.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 440 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a text similarity determination method, and the method includes:

respectively extracting keywords of the two preprocessed texts;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A text similarity determination method is characterized by comprising the following steps:

respectively extracting keywords of the two preprocessed texts;

2. The method according to claim 1, wherein the text preprocessing is performed on two preset texts to be subjected to text similarity comparison, and comprises:

3. The method according to claim 1, wherein the extracting keywords of the two pre-set texts after text pre-processing respectively comprises:

and respectively extracting keywords of the two pre-set texts after text preprocessing by a TextRank algorithm.

4. The method according to claim 1, wherein said representing two of said predetermined texts in a vector space model according to the vocabulary weight of said keyword comprises:

taking a union set of keyword sets of the two preset texts as feature words of the vector space model; wherein the feature word is represented as F ═ { F₁，f₂，...，f_nAnd the keyword sets of the two preset texts are respectively represented as

And

n is the number of the feature words;

for the first text of the two preset textsCharacteristic word F₁The weight calculation method comprises the following steps:

wherein f is_1iFor the ith said feature word of said first text,

a set of keyword weight values for the first text, m being the number of keywords of the first text, ω_1jIs f_1iIn that

Weight value of the corresponding keyword in (1), w_1iIs f_1iIf the weight value is greater than the threshold value, the text vector expression of the first text is T₁＝[w₁₁，w₁₂，...，w_1n]；

wherein f is_2iFor the ith feature word of the second text,

a set of keyword weight values for the second text, m being the number of keywords of the second text, ω_2jIs f_2iIn that

5. The method of claim 4, wherein determining the text vector similarity comprises:

the text vector similarity calculation formula is as follows:

wherein, T₁、T₂The first text and the second text are in turn.

6. The method according to claim 5, wherein the semantic similarity calculation is performed on words of two preset texts based on a preset semantic network to obtain a similarity matrix, and the similarity matrix comprises:

wherein S is₁、S₂The expression of two of the sense-elements,

the similarity of the real word concepts is calculated as follows:

Sim(W₁，W₂)＝MAX(Sim(C_1i，C_2j))，

wherein, i is 1, 2,. and l; j ═ 1, 2,. k;

then the similarity matrix

The following were used:

wherein, Sim (W)_1x，W_2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of keywords extracted from the texts.

7. The method according to claim 6, wherein the extracting word similarity values of two preset texts according to the similarity matrix to determine semantic similarity of the texts comprises:

8. The method of claim 7, wherein determining a mixed text similarity of two preset texts according to the text vector similarity and the text semantic similarity comprises:

the calculation formula of the similarity S of the mixed text is as follows:

S＝γ*S_VSM(T₁，T₂)+(1-γ)*S_HowNet(T₁，T₂)，

9. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the text similarity determination method of any one of claims 1-8.

10. A storage medium containing computer-executable instructions for performing the text similarity determination method according to any one of claims 1 to 8 when executed by a computer processor.