CN111737997A - Text similarity determination method, text similarity determination equipment and storage medium - Google Patents

Text similarity determination method, text similarity determination equipment and storage medium Download PDF

Info

Publication number
CN111737997A
CN111737997A CN202010559737.2A CN202010559737A CN111737997A CN 111737997 A CN111737997 A CN 111737997A CN 202010559737 A CN202010559737 A CN 202010559737A CN 111737997 A CN111737997 A CN 111737997A
Authority
CN
China
Prior art keywords
text
similarity
texts
preset
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010559737.2A
Other languages
Chinese (zh)
Inventor
刘桂鹏
陈运文
桂洪冠
谭新
纪达麒
连明杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Tech Inc
Original Assignee
Datagrand Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Tech Inc filed Critical Datagrand Tech Inc
Priority to CN202010559737.2A priority Critical patent/CN111737997A/en
Publication of CN111737997A publication Critical patent/CN111737997A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the invention discloses a text similarity determination method, text similarity determination equipment and a storage medium, wherein the method comprises the following steps: performing text preprocessing on two preset texts to be subjected to text similarity comparison; respectively extracting keywords of the two preprocessed texts; expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors; based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix; extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity; and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity. According to the technical scheme of the embodiment of the invention, the accuracy and the recall rate of the text similarity calculation method are improved.

Description

Text similarity determination method, text similarity determination equipment and storage medium
Technical Field
The embodiment of the invention relates to a computer text information processing technology, in particular to a text similarity determining method, text similarity determining equipment and a storage medium.
Background
The text similarity calculation is an important algorithm in text mining and is a link for linking basic research such as text modeling and representation and upper-layer application research of potential information of texts.
In the prior art, a Vector Space Model (VSM) is the most common text representation method, and text similarity is described through the VSM and then measured by using a similarity coefficient, a similarity distance and the like, so that the text similarity is relatively intuitive. The method has the difficulty that a vector space model is constructed, the term frequency-inverse document frequency (TF-IDF) is the most widely used weight calculation method, but the TF-IDF method ignores the relation between words, and important information is often lost in the task of calculating text similarity. The TextRank method constructs a network through adjacent relations between words, is similar to the PageRank algorithm, iteratively calculates the rank value of each node, then obtains the keywords of the text through ranking the rank values, and considers the relations between the words on the basis of considering the word frequency. Cosine similarity is an important method for calculating text similarity, and the aim is to calculate an included angle between vectors after text is vectorized through a vector space model, wherein the larger the cosine value of the included angle is, the smaller the included angle between two vectors is, the higher the similarity between two texts is. As can be seen from the above description, the method for calculating text similarity based on TextRank algorithm and VSM focuses more on using the same words appearing in the text as reference indexes, and although the position relationship between words is considered, the semantic association between words is not deeply considered, in real life, the method for calculating text similarity only considering word frequency and ignoring semantic information between words is useless, for example, when facing two articles that also describe natural language processing technology, it cannot distinguish the similarity between the article describing information extraction technology and the article describing entity extraction technology.
The text similarity calculation method in the prior art has low accuracy and recall rate and cannot meet the requirements of practical application.
Disclosure of Invention
The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, and aims to improve the accuracy and the recall rate of a text similarity calculation method.
In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, including:
the text preprocessing module is used for performing text preprocessing on two preset texts to be subjected to text similarity comparison;
the keyword extraction module is used for respectively extracting keywords of the two pre-set texts after text preprocessing;
the text vector similarity determining module is used for representing the two preset texts by a vector space model according to the vocabulary weight of the keyword and determining the text vector similarity;
the similarity matrix determining module is used for carrying out semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;
the text semantic similarity determining module is used for extracting word similarity values of the two preset texts according to the similarity matrix and determining text semantic similarity;
and the mixed text similarity determining module is used for determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a memory for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a text similarity determination method as provided by any of the embodiments of the invention.
In a fourth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the text similarity determination method according to any of the embodiments of the present invention.
The embodiment of the invention calculates the text similarity by fusing the vector space model and the word semantics, solves the problem of low accuracy and recall rate of the text similarity calculation method, and realizes the effect of improving the accuracy and recall rate of the text similarity calculation method.
Drawings
Fig. 1 is a flowchart of a text similarity determining method according to a first embodiment of the present invention;
fig. 2 is a flowchart of a text similarity determination method in the second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text similarity determination apparatus in a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a text similarity determining method according to an embodiment of the present invention, where the present embodiment is applicable to a case of calculating text similarity of a preset text, and the method may be executed by a text similarity determining apparatus, where the apparatus may be implemented by hardware and/or software, and the method specifically includes the following steps:
step 110, performing text preprocessing on two preset texts to be subjected to text similarity comparison;
the preset text is a Chinese text, wherein the sentence consists of words and phrases, the text preprocessing is required to be carried out on the preset text, and the text preprocessing operation comprises word segmentation, part of speech tagging and word stop operation on the preset text. Only words of a specified part of speech, such as nouns, verbs, adjectives, etc., are retained.
Step 120, extracting keywords of the two pre-set texts after text preprocessing respectively;
for the preset text which completes the text preprocessing, respective keywords need to be extracted, so that the subsequent text similarity calculation can be performed according to the keywords. Optionally, keywords of two pre-set texts after text pre-processing are respectively extracted through a TextRank algorithm.
Step 130, representing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
the word weight of the keyword of each preset text is obtained through a TextRank algorithm, the feature word weight corresponding to each preset text is calculated respectively, the text vector of each preset text is constructed through a vector space model, and the text vector similarity of the two preset texts is calculated according to the text vectors of the two preset texts.
140, performing semantic similarity calculation on words of two preset texts based on a preset semantic network to obtain a similarity matrix;
the preset semantic network may be the known network (the name of the internet is HowNet), and the known network is a common knowledge base which uses concepts represented by words of chinese and english as description objects and discloses relationships between the concepts and attributes of the concepts as basic contents. HowNet takes the idea of a reduced theory, and considers that vocabulary/word meaning can be described by a smaller semantic unit. This semantic unit is called "Sememe" (Sememe), which is, as the name implies, atomic semantics, the most basic, smallest semantic unit that is not amenable to subdivision. In the process of continuous labeling, HowNet gradually constructs a set of fine semantic system. HowNet accumulates semantic information labeled with hundreds of thousands of words/senses based on this sense system. Complex semantic relations such as host, modifier, belong and the like are marked among the sememes, so that semantic information of the word senses can be accurately represented. Two main attributes are defined in the Homing network: "concepts", which are abstract profiles of what words are meant, a single word may correspond to one or more "concepts" depending on the context of the word; a "semantic" is a unit of meaning that is the smallest fundamental unit used to characterize a "concept" in a word. The sememes form a tree hierarchy, and the similarity calculation between the sememes is performed on the basis of the hierarchy. And taking the key words of each preset text as characteristic items of the preset text, and constructing a similarity matrix according to the similarity between each characteristic item in one preset text and each characteristic item in the other preset text.
Step 150, extracting word similarity values of two preset texts according to the similarity matrix, and determining text semantic similarity;
and step 160, determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
Whether the two preset texts are similar or not can be judged according to the similarity of the mixed texts of the two preset texts, the similarity of the mixed texts can be compared with a preset threshold, when the similarity of the mixed texts is larger than the preset threshold, the two preset texts are considered to be similar, and otherwise, the two texts are judged to be dissimilar.
According to the technical scheme, the text similarity is calculated by fusing the vector space model and the word semantics, the problem that the accuracy and the recall rate of the text similarity calculation method are not high is solved, and the effect of improving the accuracy and the recall rate of the text similarity calculation method is achieved.
Example two
Fig. 2 is a flowchart of a text similarity determining method according to a second embodiment of the present invention, where the technical solution of this embodiment is further refined on the basis of the above technical solution, and specifically includes:
step 201, performing text preprocessing on two preset texts to be subjected to text similarity comparison;
step 202, extracting keywords of two preset texts after text preprocessing respectively;
the TextRank algorithm is used to extract keywords from two preset texts, for example, in this embodiment, 20 keywords are reserved in the two preset texts.
Step 203, taking a union set of keyword sets of two preset texts as feature words of a vector space model; wherein the feature word is expressed as F ═ { F ═ F1,f2,...,fnThe keyword sets of the two preset texts are respectively represented as
Figure BDA0002545645360000061
And
Figure BDA0002545645360000062
n is speciallyThe number of tokens;
step 204, carrying out first characteristic words F on a first text of two preset texts1The weight calculation method comprises the following steps:
Figure BDA0002545645360000071
wherein f is1iIs the ith feature word of the first text,
Figure BDA0002545645360000077
is the set of keyword weight values of the first text, m is the number of keywords of the first text, ω1jIs f1iIn that
Figure BDA0002545645360000072
Weight value of the corresponding keyword in (1), w1iIs f1iIf the weighted value is greater than the weighted value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
Step 205, for the second feature word F of the second text of the two preset texts2The weight calculation method comprises the following steps:
Figure BDA0002545645360000073
wherein f is2iIs the ith feature word of the second text,
Figure BDA0002545645360000078
is the set of keyword weight values of the second text, m is the number of keywords of the second text, ω2jIs f2iIn that
Figure BDA0002545645360000074
Weight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
Step 206, the text vector similarity calculation formula is:
Figure BDA0002545645360000075
wherein, T1、T2The first text and the second text are in turn. n denotes the feature word vector dimension, i.e. the number of elements of the keyword set of two predetermined texts, w1i、w2iIs the feature word weight.
Step 207, based on the known network, the calculation formula of the similarity of the sememes is as follows:
Figure BDA0002545645360000076
wherein S is1、S2The expression of two of the sense-elements,
Figure BDA0002545645360000081
is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) α is a variable parameter, set to 0.5 in this example;
step 208, calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;
the similarity of the real word concepts is calculated as follows:
Figure BDA0002545645360000082
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β12341 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Are similarities of relational primitive descriptionsDegree, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
step 209, assume W1And W2Are two Chinese words, W1There are l concepts: c11,C12,...,C1l;W2There are k concepts: c21,C22,...,C2k(ii) a Then W is1And W2Can pass through C1iAnd C2jThe maximum value of similarity in all combinations represents, i.e.:
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
step 210, the feature items of the two preset texts are respectively the keywords W of the two preset texts,
Figure BDA0002545645360000083
the similarity matrix M is then as follows:
Figure BDA0002545645360000084
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of the keywords extracted from the texts.
Step 211, taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrix; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the textHowNetThe calculation formula is as follows:
Figure BDA0002545645360000091
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
Step 212, the calculation formula of the similarity S of the mixed text is as follows:
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
The text similarity determining method provided by the embodiment is a text similarity calculating method fusing a vector space model and word semantics, comprehensively considers the position relation of key words in a text and the semantic relation among words, extracts text keywords on the basis of a TextRank algorithm, comprehensively calculates the semantic relation among the keywords by combining the vector space model and a HowNet model, pays more importance on the semantic correlation among the texts than a general text similarity calculating method, considers the difference among different semantic correlation calculating methods instead of simply considering the coincidence degree of words appearing at high frequency or calculating the semantic correlation in a single mode, and can well meet the requirements of practical application.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a text similarity determining apparatus according to a third embodiment of the present invention, where the text similarity determining apparatus includes:
the text preprocessing module 310 is configured to perform text preprocessing on two preset texts to be subjected to text similarity comparison;
the keyword extraction module 320 is configured to extract keywords of two preset texts after text preprocessing respectively;
the text vector similarity determining module 330 is configured to represent the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determine a text vector similarity;
the similarity matrix determination module 340 is configured to perform semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;
the text semantic similarity determining module 350 is configured to extract word similarity values of two preset texts according to the similarity matrix, and determine text semantic similarity;
and a mixed text similarity determining module 360, configured to determine a mixed text similarity between the two preset texts according to the text vector similarity and the text semantic similarity.
According to the technical scheme, the text similarity is calculated by fusing the vector space model and the word semantics, the problem that the accuracy and the recall rate of the text similarity calculation method are not high is solved, and the effect of improving the accuracy and the recall rate of the text similarity calculation method is achieved.
Optionally, the text preprocessing module 310 is specifically configured to:
and performing word segmentation, part of speech tagging and word stop operation on the preset text.
Optionally, the keyword extraction module 320 is specifically configured to:
and respectively extracting keywords of the two preset texts after the texts are preprocessed through a TextRank algorithm.
Optionally, the text vector similarity determining module 330 is specifically configured to:
taking a union set of keyword sets of two preset texts as feature words of a vector space model; wherein the feature word is expressed as F ═ { F ═ F1,f2,...,fnThe keyword sets of the two preset texts are respectively represented as
Figure BDA0002545645360000101
And
Figure BDA0002545645360000102
n is the number of the feature words;
for the first character word F of the first text of the two preset texts1The weight calculation method comprises the following steps:
Figure BDA0002545645360000111
wherein f is1iIs the ith feature word of the first text,
Figure BDA0002545645360000112
set of keyword weight values for a first textM is the number of keywords of the first text, ω1jIs f1iIn that
Figure BDA0002545645360000118
Weight value of the corresponding keyword in (1), w1iIs f1iIf the weighted value is greater than the weighted value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
For the second characteristic word F of the second text of the two preset texts2The weight calculation method comprises the following steps:
Figure BDA0002545645360000113
wherein f is2iIs the ith feature word of the second text,
Figure BDA0002545645360000114
is the set of keyword weight values of the second text, m is the number of keywords of the second text, ω2jIs f2iIn that
Figure BDA0002545645360000119
Weight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
Optionally, the text vector similarity determining module 330 is further specifically configured to:
the text vector similarity calculation formula is as follows:
Figure BDA0002545645360000115
wherein, T1、T2The first text and the second text are in turn.
Optionally, the similarity matrix determining module 340 is specifically configured to:
based on the known network, the calculation formula of the similarity of the sememes is as follows:
Figure BDA0002545645360000116
wherein S is1、S2The expression of two of the sense-elements,
Figure BDA0002545645360000117
is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) The path length of two sememes in the hierarchical tree, α is a variable parameter;
calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;
the similarity of the real word concepts is calculated as follows:
Figure BDA0002545645360000121
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β12341 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Is the similarity of the relational semantic description, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
suppose W1And W2Are two Chinese words, W1There are l concepts: c11,C12,...,C1l;W2There are k concepts: c21,C22,...,C2k(ii) a Then W is1And W2Can pass through C1iAnd C2jThe maximum value of similarity in all combinations represents, i.e.:
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
the feature items of the two preset texts are respectively keywords W of the two preset texts,
Figure BDA0002545645360000122
Figure BDA0002545645360000123
the similarity matrix M is then as follows:
Figure BDA0002545645360000124
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of the keywords extracted from the texts.
Optionally, the text semantic similarity determining module 350 is specifically configured to:
taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrixes; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the textHowNetThe calculation formula is as follows:
Figure BDA0002545645360000131
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
Optionally, the mixed text similarity determining module 360 is specifically configured to:
the calculation formula of the similarity S of the mixed text is as follows:
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.
Example four
Fig. 4 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the apparatus includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the apparatus may be connected by a bus or other means, for example, in fig. 4.
The memory 420 serves as a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, the text preprocessing module 310, the keyword extraction module 320, the text vector similarity determination module 330, the similarity matrix determination module 340, the text semantic similarity determination module 350, and the mixed text similarity determination module 360 in the text similarity determination device). The processor 410 executes various functional applications of the device and data processing by executing software programs, instructions, and modules stored in the memory 420, that is, implements the text similarity determination method described above.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 440 may include a display device such as a display screen.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a text similarity determination method, and the method includes:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A text similarity determination method is characterized by comprising the following steps:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
2. The method according to claim 1, wherein the text preprocessing is performed on two preset texts to be subjected to text similarity comparison, and comprises:
and performing word segmentation, part of speech tagging and word stop operation on the preset text.
3. The method according to claim 1, wherein the extracting keywords of the two pre-set texts after text pre-processing respectively comprises:
and respectively extracting keywords of the two pre-set texts after text preprocessing by a TextRank algorithm.
4. The method according to claim 1, wherein said representing two of said predetermined texts in a vector space model according to the vocabulary weight of said keyword comprises:
taking a union set of keyword sets of the two preset texts as feature words of the vector space model; wherein the feature word is represented as F ═ { F1,f2,...,fnAnd the keyword sets of the two preset texts are respectively represented as
Figure FDA0002545645350000021
And
Figure FDA0002545645350000022
n is the number of the feature words;
for the first text of the two preset textsCharacteristic word F1The weight calculation method comprises the following steps:
Figure FDA0002545645350000023
wherein f is1iFor the ith said feature word of said first text,
Figure FDA0002545645350000024
a set of keyword weight values for the first text, m being the number of keywords of the first text, ω1jIs f1iIn that
Figure FDA0002545645350000025
Weight value of the corresponding keyword in (1), w1iIs f1iIf the weight value is greater than the threshold value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
For the second characteristic word F of the second text of the two preset texts2The weight calculation method comprises the following steps:
Figure FDA0002545645350000026
wherein f is2iFor the ith feature word of the second text,
Figure FDA0002545645350000027
a set of keyword weight values for the second text, m being the number of keywords of the second text, ω2jIs f2iIn that
Figure FDA0002545645350000028
Weight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
5. The method of claim 4, wherein determining the text vector similarity comprises:
the text vector similarity calculation formula is as follows:
Figure FDA0002545645350000029
wherein, T1、T2The first text and the second text are in turn.
6. The method according to claim 5, wherein the semantic similarity calculation is performed on words of two preset texts based on a preset semantic network to obtain a similarity matrix, and the similarity matrix comprises:
based on the known network, the calculation formula of the similarity of the sememes is as follows:
Figure FDA0002545645350000031
wherein S is1、S2The expression of two of the sense-elements,
Figure FDA0002545645350000032
is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) The path length of two sememes in the hierarchical tree, α is a variable parameter;
calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;
the similarity of the real word concepts is calculated as follows:
Figure FDA0002545645350000033
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β12341 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Is the similarity of the relational semantic description, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
suppose W1And W2Are two Chinese words, W1There are l concepts: c11,C12,...,C1l;W2There are k concepts: c21,C22,...,C2k(ii) a Then W is1And W2Can pass through C1iAnd C2jThe maximum value of similarity in all combinations represents, i.e.:
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
the feature items of the two preset texts are respectively keywords W of the two preset texts,
Figure FDA0002545645350000034
Figure FDA0002545645350000035
then the similarity matrix
Figure FDA0002545645350000036
The following were used:
Figure FDA0002545645350000041
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of keywords extracted from the texts.
7. The method according to claim 6, wherein the extracting word similarity values of two preset texts according to the similarity matrix to determine semantic similarity of the texts comprises:
taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrixes; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the textHowNetThe calculation formula is as follows:
Figure FDA0002545645350000042
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
8. The method of claim 7, wherein determining a mixed text similarity of two preset texts according to the text vector similarity and the text semantic similarity comprises:
the calculation formula of the similarity S of the mixed text is as follows:
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
9. An apparatus, characterized in that the apparatus comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the text similarity determination method of any one of claims 1-8.
10. A storage medium containing computer-executable instructions for performing the text similarity determination method according to any one of claims 1 to 8 when executed by a computer processor.
CN202010559737.2A 2020-06-18 2020-06-18 Text similarity determination method, text similarity determination equipment and storage medium Pending CN111737997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010559737.2A CN111737997A (en) 2020-06-18 2020-06-18 Text similarity determination method, text similarity determination equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010559737.2A CN111737997A (en) 2020-06-18 2020-06-18 Text similarity determination method, text similarity determination equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111737997A true CN111737997A (en) 2020-10-02

Family

ID=72649742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010559737.2A Pending CN111737997A (en) 2020-06-18 2020-06-18 Text similarity determination method, text similarity determination equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111737997A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214558A (en) * 2020-11-18 2021-01-12 国家计算机网络与信息安全管理中心 Theme correlation degree judging method and device
CN112364620A (en) * 2020-11-06 2021-02-12 中国平安人寿保险股份有限公司 Text similarity judgment method and device and computer equipment
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN113837772A (en) * 2021-09-24 2021-12-24 支付宝(杭州)信息技术有限公司 Method, device and equipment for auditing marketing information
CN115688771A (en) * 2023-01-05 2023-02-03 京华信息科技股份有限公司 Document content comparison performance improving method and system
CN117273015A (en) * 2023-11-22 2023-12-22 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
JP2019086412A (en) * 2017-11-07 2019-06-06 大日本印刷株式会社 Inspection system, inspection method and method for manufacturing inspection system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
JP2019086412A (en) * 2017-11-07 2019-06-06 大日本印刷株式会社 Inspection system, inspection method and method for manufacturing inspection system
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANSHIKA PAL: "Effective Focused Crawling Based on Content and Link Structure Analysis", 《(IJCSIS) INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND INFORMATION SECURITY》, vol. 2, no. 1, pages 1 - 5 *
冯高磊: "基于向量空间模型结合语义的文本相似度算法", 《现代电子技术》, no. 11, pages 157 - 161 *
冯高磊;高嵩峰;: "基于向量空间模型结合语义的文本相似度算法", 现代电子技术, no. 11, pages 157 - 161 *
李周平: "《网络数据爬取与分析实务》", vol. 1, 北京理工大学出版社, pages: 173 - 174 *
黄承慧: "一种结合词项语义信息和TF-IDF方法的文本相似度量方法", 计算机学报, no. 05, pages 98 - 106 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364620A (en) * 2020-11-06 2021-02-12 中国平安人寿保险股份有限公司 Text similarity judgment method and device and computer equipment
CN112364620B (en) * 2020-11-06 2024-04-05 中国平安人寿保险股份有限公司 Text similarity judging method and device and computer equipment
CN112214558A (en) * 2020-11-18 2021-01-12 国家计算机网络与信息安全管理中心 Theme correlation degree judging method and device
CN112214558B (en) * 2020-11-18 2023-08-15 国家计算机网络与信息安全管理中心 Theme relevance discriminating method and device
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112364947B (en) * 2021-01-14 2021-06-29 北京育学园健康管理中心有限公司 Text similarity calculation method and device
CN113837772A (en) * 2021-09-24 2021-12-24 支付宝(杭州)信息技术有限公司 Method, device and equipment for auditing marketing information
CN115688771A (en) * 2023-01-05 2023-02-03 京华信息科技股份有限公司 Document content comparison performance improving method and system
CN115688771B (en) * 2023-01-05 2023-03-21 京华信息科技股份有限公司 Document content comparison performance improving method and system
CN117273015A (en) * 2023-11-22 2023-12-22 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis
CN117273015B (en) * 2023-11-22 2024-02-13 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis

Similar Documents

Publication Publication Date Title
US11775760B2 (en) Man-machine conversation method, electronic device, and computer-readable medium
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN105095204B (en) The acquisition methods and device of synonym
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
Nayak et al. Survey on pre-processing techniques for text mining
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
KR20170004154A (en) Method and system for automatically summarizing documents to images and providing the image-based contents
CN111539197A (en) Text matching method and device, computer system and readable storage medium
US20200073890A1 (en) Intelligent search platforms
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
CN110717038A (en) Object classification method and device
Kokane et al. Word sense disambiguation: a supervised semantic similarity based complex network approach
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN106951511A (en) A kind of Text Clustering Method and device
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
CN111985217B (en) Keyword extraction method, computing device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination