CN111737997A - Text similarity determination method, text similarity determination equipment and storage medium - Google Patents
Text similarity determination method, text similarity determination equipment and storage medium Download PDFInfo
- Publication number
- CN111737997A CN111737997A CN202010559737.2A CN202010559737A CN111737997A CN 111737997 A CN111737997 A CN 111737997A CN 202010559737 A CN202010559737 A CN 202010559737A CN 111737997 A CN111737997 A CN 111737997A
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- texts
- preset
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 56
- 238000004364 calculation method Methods 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 23
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The embodiment of the invention discloses a text similarity determination method, text similarity determination equipment and a storage medium, wherein the method comprises the following steps: performing text preprocessing on two preset texts to be subjected to text similarity comparison; respectively extracting keywords of the two preprocessed texts; expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors; based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix; extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity; and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity. According to the technical scheme of the embodiment of the invention, the accuracy and the recall rate of the text similarity calculation method are improved.
Description
Technical Field
The embodiment of the invention relates to a computer text information processing technology, in particular to a text similarity determining method, text similarity determining equipment and a storage medium.
Background
The text similarity calculation is an important algorithm in text mining and is a link for linking basic research such as text modeling and representation and upper-layer application research of potential information of texts.
In the prior art, a Vector Space Model (VSM) is the most common text representation method, and text similarity is described through the VSM and then measured by using a similarity coefficient, a similarity distance and the like, so that the text similarity is relatively intuitive. The method has the difficulty that a vector space model is constructed, the term frequency-inverse document frequency (TF-IDF) is the most widely used weight calculation method, but the TF-IDF method ignores the relation between words, and important information is often lost in the task of calculating text similarity. The TextRank method constructs a network through adjacent relations between words, is similar to the PageRank algorithm, iteratively calculates the rank value of each node, then obtains the keywords of the text through ranking the rank values, and considers the relations between the words on the basis of considering the word frequency. Cosine similarity is an important method for calculating text similarity, and the aim is to calculate an included angle between vectors after text is vectorized through a vector space model, wherein the larger the cosine value of the included angle is, the smaller the included angle between two vectors is, the higher the similarity between two texts is. As can be seen from the above description, the method for calculating text similarity based on TextRank algorithm and VSM focuses more on using the same words appearing in the text as reference indexes, and although the position relationship between words is considered, the semantic association between words is not deeply considered, in real life, the method for calculating text similarity only considering word frequency and ignoring semantic information between words is useless, for example, when facing two articles that also describe natural language processing technology, it cannot distinguish the similarity between the article describing information extraction technology and the article describing entity extraction technology.
The text similarity calculation method in the prior art has low accuracy and recall rate and cannot meet the requirements of practical application.
Disclosure of Invention
The embodiment of the invention provides a text similarity determination method, a text similarity determination device, text similarity determination equipment and a storage medium, and aims to improve the accuracy and the recall rate of a text similarity calculation method.
In a first aspect, an embodiment of the present invention provides a method for determining text similarity, including:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
In a second aspect, an embodiment of the present invention further provides a text similarity determining apparatus, including:
the text preprocessing module is used for performing text preprocessing on two preset texts to be subjected to text similarity comparison;
the keyword extraction module is used for respectively extracting keywords of the two pre-set texts after text preprocessing;
the text vector similarity determining module is used for representing the two preset texts by a vector space model according to the vocabulary weight of the keyword and determining the text vector similarity;
the similarity matrix determining module is used for carrying out semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;
the text semantic similarity determining module is used for extracting word similarity values of the two preset texts according to the similarity matrix and determining text semantic similarity;
and the mixed text similarity determining module is used for determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a memory for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a text similarity determination method as provided by any of the embodiments of the invention.
In a fourth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the text similarity determination method according to any of the embodiments of the present invention.
The embodiment of the invention calculates the text similarity by fusing the vector space model and the word semantics, solves the problem of low accuracy and recall rate of the text similarity calculation method, and realizes the effect of improving the accuracy and recall rate of the text similarity calculation method.
Drawings
Fig. 1 is a flowchart of a text similarity determining method according to a first embodiment of the present invention;
fig. 2 is a flowchart of a text similarity determination method in the second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text similarity determination apparatus in a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a text similarity determining method according to an embodiment of the present invention, where the present embodiment is applicable to a case of calculating text similarity of a preset text, and the method may be executed by a text similarity determining apparatus, where the apparatus may be implemented by hardware and/or software, and the method specifically includes the following steps:
the preset text is a Chinese text, wherein the sentence consists of words and phrases, the text preprocessing is required to be carried out on the preset text, and the text preprocessing operation comprises word segmentation, part of speech tagging and word stop operation on the preset text. Only words of a specified part of speech, such as nouns, verbs, adjectives, etc., are retained.
for the preset text which completes the text preprocessing, respective keywords need to be extracted, so that the subsequent text similarity calculation can be performed according to the keywords. Optionally, keywords of two pre-set texts after text pre-processing are respectively extracted through a TextRank algorithm.
the word weight of the keyword of each preset text is obtained through a TextRank algorithm, the feature word weight corresponding to each preset text is calculated respectively, the text vector of each preset text is constructed through a vector space model, and the text vector similarity of the two preset texts is calculated according to the text vectors of the two preset texts.
140, performing semantic similarity calculation on words of two preset texts based on a preset semantic network to obtain a similarity matrix;
the preset semantic network may be the known network (the name of the internet is HowNet), and the known network is a common knowledge base which uses concepts represented by words of chinese and english as description objects and discloses relationships between the concepts and attributes of the concepts as basic contents. HowNet takes the idea of a reduced theory, and considers that vocabulary/word meaning can be described by a smaller semantic unit. This semantic unit is called "Sememe" (Sememe), which is, as the name implies, atomic semantics, the most basic, smallest semantic unit that is not amenable to subdivision. In the process of continuous labeling, HowNet gradually constructs a set of fine semantic system. HowNet accumulates semantic information labeled with hundreds of thousands of words/senses based on this sense system. Complex semantic relations such as host, modifier, belong and the like are marked among the sememes, so that semantic information of the word senses can be accurately represented. Two main attributes are defined in the Homing network: "concepts", which are abstract profiles of what words are meant, a single word may correspond to one or more "concepts" depending on the context of the word; a "semantic" is a unit of meaning that is the smallest fundamental unit used to characterize a "concept" in a word. The sememes form a tree hierarchy, and the similarity calculation between the sememes is performed on the basis of the hierarchy. And taking the key words of each preset text as characteristic items of the preset text, and constructing a similarity matrix according to the similarity between each characteristic item in one preset text and each characteristic item in the other preset text.
and step 160, determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
Whether the two preset texts are similar or not can be judged according to the similarity of the mixed texts of the two preset texts, the similarity of the mixed texts can be compared with a preset threshold, when the similarity of the mixed texts is larger than the preset threshold, the two preset texts are considered to be similar, and otherwise, the two texts are judged to be dissimilar.
According to the technical scheme, the text similarity is calculated by fusing the vector space model and the word semantics, the problem that the accuracy and the recall rate of the text similarity calculation method are not high is solved, and the effect of improving the accuracy and the recall rate of the text similarity calculation method is achieved.
Example two
Fig. 2 is a flowchart of a text similarity determining method according to a second embodiment of the present invention, where the technical solution of this embodiment is further refined on the basis of the above technical solution, and specifically includes:
the TextRank algorithm is used to extract keywords from two preset texts, for example, in this embodiment, 20 keywords are reserved in the two preset texts.
wherein f is1iIs the ith feature word of the first text,is the set of keyword weight values of the first text, m is the number of keywords of the first text, ω1jIs f1iIn thatWeight value of the corresponding keyword in (1), w1iIs f1iIf the weighted value is greater than the weighted value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
wherein f is2iIs the ith feature word of the second text,is the set of keyword weight values of the second text, m is the number of keywords of the second text, ω2jIs f2iIn thatWeight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
wherein, T1、T2The first text and the second text are in turn. n denotes the feature word vector dimension, i.e. the number of elements of the keyword set of two predetermined texts, w1i、w2iIs the feature word weight.
wherein S is1、S2The expression of two of the sense-elements,is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) α is a variable parameter, set to 0.5 in this example;
the similarity of the real word concepts is calculated as follows:
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β1+β2+β3+β41 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Are similarities of relational primitive descriptionsDegree, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of the keywords extracted from the texts.
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
The text similarity determining method provided by the embodiment is a text similarity calculating method fusing a vector space model and word semantics, comprehensively considers the position relation of key words in a text and the semantic relation among words, extracts text keywords on the basis of a TextRank algorithm, comprehensively calculates the semantic relation among the keywords by combining the vector space model and a HowNet model, pays more importance on the semantic correlation among the texts than a general text similarity calculating method, considers the difference among different semantic correlation calculating methods instead of simply considering the coincidence degree of words appearing at high frequency or calculating the semantic correlation in a single mode, and can well meet the requirements of practical application.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a text similarity determining apparatus according to a third embodiment of the present invention, where the text similarity determining apparatus includes:
the text preprocessing module 310 is configured to perform text preprocessing on two preset texts to be subjected to text similarity comparison;
the keyword extraction module 320 is configured to extract keywords of two preset texts after text preprocessing respectively;
the text vector similarity determining module 330 is configured to represent the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determine a text vector similarity;
the similarity matrix determination module 340 is configured to perform semantic similarity calculation on the words of the two preset texts based on a preset semantic network to obtain a similarity matrix;
the text semantic similarity determining module 350 is configured to extract word similarity values of two preset texts according to the similarity matrix, and determine text semantic similarity;
and a mixed text similarity determining module 360, configured to determine a mixed text similarity between the two preset texts according to the text vector similarity and the text semantic similarity.
According to the technical scheme, the text similarity is calculated by fusing the vector space model and the word semantics, the problem that the accuracy and the recall rate of the text similarity calculation method are not high is solved, and the effect of improving the accuracy and the recall rate of the text similarity calculation method is achieved.
Optionally, the text preprocessing module 310 is specifically configured to:
and performing word segmentation, part of speech tagging and word stop operation on the preset text.
Optionally, the keyword extraction module 320 is specifically configured to:
and respectively extracting keywords of the two preset texts after the texts are preprocessed through a TextRank algorithm.
Optionally, the text vector similarity determining module 330 is specifically configured to:
taking a union set of keyword sets of two preset texts as feature words of a vector space model; wherein the feature word is expressed as F ═ { F ═ F1,f2,...,fnThe keyword sets of the two preset texts are respectively represented asAndn is the number of the feature words;
for the first character word F of the first text of the two preset texts1The weight calculation method comprises the following steps:
wherein f is1iIs the ith feature word of the first text,set of keyword weight values for a first textM is the number of keywords of the first text, ω1jIs f1iIn thatWeight value of the corresponding keyword in (1), w1iIs f1iIf the weighted value is greater than the weighted value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
For the second characteristic word F of the second text of the two preset texts2The weight calculation method comprises the following steps:
wherein f is2iIs the ith feature word of the second text,is the set of keyword weight values of the second text, m is the number of keywords of the second text, ω2jIs f2iIn thatWeight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
Optionally, the text vector similarity determining module 330 is further specifically configured to:
the text vector similarity calculation formula is as follows:
wherein, T1、T2The first text and the second text are in turn.
Optionally, the similarity matrix determining module 340 is specifically configured to:
based on the known network, the calculation formula of the similarity of the sememes is as follows:
wherein S is1、S2The expression of two of the sense-elements,is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) The path length of two sememes in the hierarchical tree, α is a variable parameter;
calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;
the similarity of the real word concepts is calculated as follows:
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β1+β2+β3+β41 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Is the similarity of the relational semantic description, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
suppose W1And W2Are two Chinese words, W1There are l concepts: c11,C12,...,C1l;W2There are k concepts: c21,C22,...,C2k(ii) a Then W is1And W2Can pass through C1iAnd C2jThe maximum value of similarity in all combinations represents, i.e.:
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
the feature items of the two preset texts are respectively keywords W of the two preset texts, the similarity matrix M is then as follows:
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of the keywords extracted from the texts.
Optionally, the text semantic similarity determining module 350 is specifically configured to:
taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrixes; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the textHowNetThe calculation formula is as follows:
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
Optionally, the mixed text similarity determining module 360 is specifically configured to:
the calculation formula of the similarity S of the mixed text is as follows:
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
The text similarity determining device provided by the embodiment of the invention can execute the text similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.
Example four
Fig. 4 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the apparatus includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the apparatus may be connected by a bus or other means, for example, in fig. 4.
The memory 420 serves as a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text similarity determination method in the embodiment of the present invention (for example, the text preprocessing module 310, the keyword extraction module 320, the text vector similarity determination module 330, the similarity matrix determination module 340, the text semantic similarity determination module 350, and the mixed text similarity determination module 360 in the text similarity determination device). The processor 410 executes various functional applications of the device and data processing by executing software programs, instructions, and modules stored in the memory 420, that is, implements the text similarity determination method described above.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 440 may include a display device such as a display screen.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a text similarity determination method, and the method includes:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the text similarity determination method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the text similarity determining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (10)
1. A text similarity determination method is characterized by comprising the following steps:
performing text preprocessing on two preset texts to be subjected to text similarity comparison;
respectively extracting keywords of the two preprocessed texts;
expressing the two preset texts by a vector space model according to the vocabulary weight of the keyword, and determining the similarity of text vectors;
based on a preset semantic network, performing semantic similarity calculation on words of the two preset texts to obtain a similarity matrix;
extracting word similarity values of the two preset texts according to the similarity matrix, and determining text semantic similarity;
and determining the mixed text similarity of the two preset texts according to the text vector similarity and the text semantic similarity.
2. The method according to claim 1, wherein the text preprocessing is performed on two preset texts to be subjected to text similarity comparison, and comprises:
and performing word segmentation, part of speech tagging and word stop operation on the preset text.
3. The method according to claim 1, wherein the extracting keywords of the two pre-set texts after text pre-processing respectively comprises:
and respectively extracting keywords of the two pre-set texts after text preprocessing by a TextRank algorithm.
4. The method according to claim 1, wherein said representing two of said predetermined texts in a vector space model according to the vocabulary weight of said keyword comprises:
taking a union set of keyword sets of the two preset texts as feature words of the vector space model; wherein the feature word is represented as F ═ { F1,f2,...,fnAnd the keyword sets of the two preset texts are respectively represented asAndn is the number of the feature words;
for the first text of the two preset textsCharacteristic word F1The weight calculation method comprises the following steps:
wherein f is1iFor the ith said feature word of said first text,a set of keyword weight values for the first text, m being the number of keywords of the first text, ω1jIs f1iIn thatWeight value of the corresponding keyword in (1), w1iIs f1iIf the weight value is greater than the threshold value, the text vector expression of the first text is T1=[w11,w12,...,w1n];
For the second characteristic word F of the second text of the two preset texts2The weight calculation method comprises the following steps:
wherein f is2iFor the ith feature word of the second text,a set of keyword weight values for the second text, m being the number of keywords of the second text, ω2jIs f2iIn thatWeight value of the corresponding keyword in (1), w2iIs f2iIf the weight value is greater than the threshold value, the text vector expression of the second text is T2=[w21,w22,...,w2n]。
6. The method according to claim 5, wherein the semantic similarity calculation is performed on words of two preset texts based on a preset semantic network to obtain a similarity matrix, and the similarity matrix comprises:
based on the known network, the calculation formula of the similarity of the sememes is as follows:
wherein S is1、S2The expression of two of the sense-elements,is two sememes S1、S2The depth of the hierarchical tree in which each is located; distance (S)1,S2) The path length of two sememes in the hierarchical tree, α is a variable parameter;
calculating the similarity of the concept of the fictitious word in the same way as the calculation formula of the similarity of the sememes;
the similarity of the real word concepts is calculated as follows:
wherein C is1、C2Representing two concepts βi(1. ltoreq. i.ltoreq.4) is an adjustable parameter of β1+β2+β3+β41 and β1≥β2≥β3≥β4(ii) a According to the difference in the sense, Sim1(S1,S2) Is the similarity of the first independent primitive description, Sim2(S1,S2) Is the similarity of other independent sense description formulas, Sim3(S1,S2) Is the similarity of the relational semantic description, Sim4(S1,S2) Is the similarity of the symbolic primitive description;
suppose W1And W2Are two Chinese words, W1There are l concepts: c11,C12,...,C1l;W2There are k concepts: c21,C22,...,C2k(ii) a Then W is1And W2Can pass through C1iAnd C2jThe maximum value of similarity in all combinations represents, i.e.:
Sim(W1,W2)=MAX(Sim(C1i,C2j)),
wherein, i is 1, 2,. and l; j ═ 1, 2,. k;
the feature items of the two preset texts are respectively keywords W of the two preset texts, then the similarity matrixThe following were used:
wherein, Sim (W)1x,W2y) And m is the similarity value between the xth characteristic item in the first text and the yth characteristic item in the second text, and m is the number of keywords extracted from the texts.
7. The method according to claim 6, wherein the extracting word similarity values of two preset texts according to the similarity matrix to determine semantic similarity of the texts comprises:
taking the largest one of the similarity matrixes as Max (i), recording the largest one, and deleting the row and the column which the Max (i) belongs to from the similarity matrixes; repeating the above process until no element exists in the matrix, and calculating the semantic similarity S of the textHowNetThe calculation formula is as follows:
wherein m is the number of keywords extracted from the text, T1、T2The first text and the second text are in turn.
8. The method of claim 7, wherein determining a mixed text similarity of two preset texts according to the text vector similarity and the text semantic similarity comprises:
the calculation formula of the similarity S of the mixed text is as follows:
S=γ*SVSM(T1,T2)+(1-γ)*SHowNet(T1,T2),
wherein gamma is a variable parameter, and gamma is more than or equal to 0 and less than or equal to 1.
9. An apparatus, characterized in that the apparatus comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the text similarity determination method of any one of claims 1-8.
10. A storage medium containing computer-executable instructions for performing the text similarity determination method according to any one of claims 1 to 8 when executed by a computer processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010559737.2A CN111737997A (en) | 2020-06-18 | 2020-06-18 | Text similarity determination method, text similarity determination equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010559737.2A CN111737997A (en) | 2020-06-18 | 2020-06-18 | Text similarity determination method, text similarity determination equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111737997A true CN111737997A (en) | 2020-10-02 |
Family
ID=72649742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010559737.2A Pending CN111737997A (en) | 2020-06-18 | 2020-06-18 | Text similarity determination method, text similarity determination equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111737997A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112214558A (en) * | 2020-11-18 | 2021-01-12 | 国家计算机网络与信息安全管理中心 | Theme correlation degree judging method and device |
CN112364620A (en) * | 2020-11-06 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Text similarity judgment method and device and computer equipment |
CN112364947A (en) * | 2021-01-14 | 2021-02-12 | 北京崔玉涛儿童健康管理中心有限公司 | Text similarity calculation method and device |
CN113837772A (en) * | 2021-09-24 | 2021-12-24 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for auditing marketing information |
CN115688771A (en) * | 2023-01-05 | 2023-02-03 | 京华信息科技股份有限公司 | Document content comparison performance improving method and system |
CN117273015A (en) * | 2023-11-22 | 2023-12-22 | 湖南省水运建设投资集团有限公司 | Electronic file archiving and classifying method for semantic analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
JP2019086412A (en) * | 2017-11-07 | 2019-06-06 | 大日本印刷株式会社 | Inspection system, inspection method and method for manufacturing inspection system |
-
2020
- 2020-06-18 CN CN202010559737.2A patent/CN111737997A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
JP2019086412A (en) * | 2017-11-07 | 2019-06-06 | 大日本印刷株式会社 | Inspection system, inspection method and method for manufacturing inspection system |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
Non-Patent Citations (5)
Title |
---|
ANSHIKA PAL: "Effective Focused Crawling Based on Content and Link Structure Analysis", 《(IJCSIS) INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND INFORMATION SECURITY》, vol. 2, no. 1, pages 1 - 5 * |
冯高磊: "基于向量空间模型结合语义的文本相似度算法", 《现代电子技术》, no. 11, pages 157 - 161 * |
冯高磊;高嵩峰;: "基于向量空间模型结合语义的文本相似度算法", 现代电子技术, no. 11, pages 157 - 161 * |
李周平: "《网络数据爬取与分析实务》", vol. 1, 北京理工大学出版社, pages: 173 - 174 * |
黄承慧: "一种结合词项语义信息和TF-IDF方法的文本相似度量方法", 计算机学报, no. 05, pages 98 - 106 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364620A (en) * | 2020-11-06 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Text similarity judgment method and device and computer equipment |
CN112364620B (en) * | 2020-11-06 | 2024-04-05 | 中国平安人寿保险股份有限公司 | Text similarity judging method and device and computer equipment |
CN112214558A (en) * | 2020-11-18 | 2021-01-12 | 国家计算机网络与信息安全管理中心 | Theme correlation degree judging method and device |
CN112214558B (en) * | 2020-11-18 | 2023-08-15 | 国家计算机网络与信息安全管理中心 | Theme relevance discriminating method and device |
CN112364947A (en) * | 2021-01-14 | 2021-02-12 | 北京崔玉涛儿童健康管理中心有限公司 | Text similarity calculation method and device |
CN112364947B (en) * | 2021-01-14 | 2021-06-29 | 北京育学园健康管理中心有限公司 | Text similarity calculation method and device |
CN113837772A (en) * | 2021-09-24 | 2021-12-24 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for auditing marketing information |
CN115688771A (en) * | 2023-01-05 | 2023-02-03 | 京华信息科技股份有限公司 | Document content comparison performance improving method and system |
CN115688771B (en) * | 2023-01-05 | 2023-03-21 | 京华信息科技股份有限公司 | Document content comparison performance improving method and system |
CN117273015A (en) * | 2023-11-22 | 2023-12-22 | 湖南省水运建设投资集团有限公司 | Electronic file archiving and classifying method for semantic analysis |
CN117273015B (en) * | 2023-11-22 | 2024-02-13 | 湖南省水运建设投资集团有限公司 | Electronic file archiving and classifying method for semantic analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11775760B2 (en) | Man-machine conversation method, electronic device, and computer-readable medium | |
US11017178B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN107168954B (en) | Text keyword generation method and device, electronic equipment and readable storage medium | |
CN111737997A (en) | Text similarity determination method, text similarity determination equipment and storage medium | |
CN105095204B (en) | The acquisition methods and device of synonym | |
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
Nayak et al. | Survey on pre-processing techniques for text mining | |
CN106960030B (en) | Information pushing method and device based on artificial intelligence | |
CN111190997B (en) | Question-answering system implementation method using neural network and machine learning ordering algorithm | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
KR20170004154A (en) | Method and system for automatically summarizing documents to images and providing the image-based contents | |
CN111539197A (en) | Text matching method and device, computer system and readable storage medium | |
US20200073890A1 (en) | Intelligent search platforms | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN110879834A (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN114357117A (en) | Transaction information query method and device, computer equipment and storage medium | |
CN110717038A (en) | Object classification method and device | |
Kokane et al. | Word sense disambiguation: a supervised semantic similarity based complex network approach | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
CN117076636A (en) | Information query method, system and equipment for intelligent customer service | |
CN106951511A (en) | A kind of Text Clustering Method and device | |
Ronghui et al. | Application of Improved Convolutional Neural Network in Text Classification. | |
CN111985217B (en) | Keyword extraction method, computing device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |