CN109684629B - Method and device for calculating similarity between texts, storage medium and electronic equipment - Google Patents
Method and device for calculating similarity between texts, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN109684629B CN109684629B CN201811420108.0A CN201811420108A CN109684629B CN 109684629 B CN109684629 B CN 109684629B CN 201811420108 A CN201811420108 A CN 201811420108A CN 109684629 B CN109684629 B CN 109684629B
- Authority
- CN
- China
- Prior art keywords
- text
- participle
- word
- word segmentation
- information transfer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Abstract
The disclosure relates to a method and a device for calculating similarity between texts, a storage medium and an electronic device. The method comprises the following steps: performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text; determining semantic information transfer cost between the first text and the second text according to the information quantity carried by each participle in the first participle set and the second participle set in the text and the word embedding vector corresponding to each participle; and determining the similarity between the first text and the second text according to the semantic information transfer cost. Therefore, the semantic influence of each word and each word context in the text on the text is fully considered, and the similarity calculation basis is closer to the semantic of the text, so that the calculated similarity is more accurate.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for calculating similarity between texts, a storage medium, and an electronic device.
Background
In the prior art, a structured processing mode is generally adopted when calculating the similarity between texts. The text is first structured, for example, two sections of text are processed into vectors, such as a word-based one-hot representation, a vectorized representation based on word-embedding accumulation, and so on. And then, similarity calculation is carried out on the result after the text structuring processing, such as calculation of Euclidean distance between text vectors, cosine included angle between vectors and the like. However, the semantics of words in a text are influenced by the context of the position of the word in the text, so that the semantic conversion between words in the text is influenced, and the similarity calculation method focuses on the whole text and does not consider the semantic corresponding conversion relationship between words in the text, so that the semantic information of each text is incomplete during calculation, and the calculated text similarity is not accurate enough.
Disclosure of Invention
The purpose of the present disclosure is to provide a method, an apparatus, a storage medium, and an electronic device for calculating inter-text similarity, so as to calculate inter-text similarity more accurately.
In order to achieve the above object, according to a first aspect of the present disclosure, there is provided an inter-text similarity calculation method including:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
and determining the similarity between the first text and the second text according to the semantic information transfer cost.
Optionally, the determining, according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle, a semantic information transfer cost between the first text and the second text includes:
respectively determining information transfer amounts corresponding to the word segments in the first word segment set when the word segments are transferred to the word segments in the second word segment set according to the information amount carried by each word segment in the text;
according to the word embedding vector corresponding to each participle, respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set;
and determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set is determined ij The following conditions are satisfied:
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information amount carried in the second text for the jth participle in the second participle set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein X i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount, including:
calculating the semantic information transfer cost according to the following formula (2):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the determining, according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle, a semantic information transfer cost between the first text and the second text, includes:
calculating the semantic information transfer cost according to the following formula (3):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information quantity wq carried in the first text for the ith word in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to a second aspect of the present disclosure, there is provided an inter-text similarity calculation apparatus including:
the processing module is used for performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text;
the first determining module is used for determining semantic information transfer cost between the first text and the second text according to the information quantity carried by each participle in the first participle set and the second participle set in the text where the participle is located and a word embedding vector corresponding to each participle;
and the second determining module is used for determining the similarity between the first text and the second text according to the semantic information transfer cost.
Optionally, the first determining module includes:
the first determining submodule is used for respectively determining corresponding information transfer quantity when each participle in the first participle set is transferred to each participle in the second participle set according to the information quantity carried by each participle in the text;
the second determining submodule is used for respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
and the third determining submodule is used for determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set is determined ij The following conditions are satisfied:
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information quantity wq carried in the first text for the ith word in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer S between the ith participle in the first participle set and the jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j Embedding a vector for a word corresponding to the jth participle in the second participle set;
the third determining submodule is used for calculating the semantic information transfer cost according to the following formula (2):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the first determining module is configured to calculate the semantic information transfer cost according to the following formula (3):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the second determining module is configured to determine the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method provided by the first aspect of the present disclosure.
According to a fourth aspect of the present disclosure, there is provided an electronic apparatus comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.
According to the technical scheme, for two texts with similarity to be calculated, word segmentation and stop word filtering processing is firstly carried out, two word segmentation sets which do not contain repeated word segmentation and respectively correspond to the two texts are obtained according to processing results, then semantic information transfer cost between the two texts is determined according to information quantity carried by each word segmentation in the text and a word embedding vector corresponding to each word segmentation, and then the similarity between the two texts is determined according to the semantic information transfer cost. Therefore, the semantic information transfer cost between texts is used as a standard for measuring the similarity between the texts, the similarity between the texts is determined through the semantic information transfer cost between the texts, the semantic influence of each word and each word context in the text on the text is fully considered, the similarity is calculated according to the semantic close to the text, and the calculated similarity is more accurate.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 is a flowchart of an inter-text similarity calculation method provided according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an exemplary implementation of the step of determining semantic information transfer cost between the first text and the second text according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle in the inter-text similarity calculation method provided by the present disclosure;
fig. 3 is a block diagram of an inter-text similarity calculation apparatus provided according to an embodiment of the present disclosure;
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of an inter-text similarity calculation method provided according to an embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.
In step 11, for the first text and the second text with similarity to be calculated, performing word segmentation and stop word filtering processing, and obtaining a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text, which do not contain repeated words, according to the processing result.
For a first text and a second text with similarity between texts to be calculated, firstly, word segmentation processing and stop word filtering processing are respectively carried out on the first text and the second text. Because stop words have no practical meaning and have no reference to the similarity between the actual texts, in order to prevent the stop words from influencing the subsequent text similarity calculation, the stop word filtering processing is continued after the word segmentation processing is carried out on the first text and the second text so as to delete the stop words. For example, if the first text D1 and the second text D2 are subjected to the word segmentation processing, the participles corresponding to the first text D1 are v1, v7, v1, v8, v2, v9, and v3, the participles corresponding to the second text D2 are v4, v5, v7, v4, and v6, where v7, v8, and v9 are stop words, after the stop word filtering processing, the processing results corresponding to the first text D1 are v1, v2, and v3, and the processing results corresponding to the second text D2 are v4, v5, v4, and v6.
And after the word segmentation and stop word filtering processing, obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which corresponds to the second text according to the processing result. For example, as for the processing results of D1 and D2 in the above example, the first set of words corresponding to the first text D1 is { v1, v2, v3}, and the second set of words corresponding to the second text D2 is { v4, v5, v6}.
In step 12, semantic information transfer cost between the first text and the second text is determined according to the information amount carried in the text by each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle.
In the calculation of the similarity between texts, the calculated similarity of semantics between texts can be considered, and from another perspective, the semantic information contained in one text can be converted into the semantic information contained in the other text only by judging how much cost is required to be paid, that is, the cost required to be paid for transferring the semantic information of one text into the semantic information of the other text, that is, the transfer cost of the semantic information. The semantic information transfer cost between texts can be determined by the information quantity carried in the text by the participles in the text and the word embedding vectors corresponding to the participles.
Firstly, the information quantity carried by each participle in the first participle set and the second participle set in the text is determined. The information amount of the participles carried in the text can be reflected by the comparison condition of the participles in the text compared with all the participles, wherein all the participles refer to all the participles contained in the participle set in which the participle is located. For example, the above information amount may be reflected by a specific gravity of the participle in the text where the participle appears compared to all the above participles. The larger the proportion is, the more information carried in the text of the word segmentation is; the smaller the specific gravity, the less information carried by the word segmentation in the text.
The word embedding vector corresponding to the participle may be determined by a word embedding algorithm. And mapping each participle in the first participle set and the second participle set to the same word embedding space by using a word embedding algorithm. Illustratively, the word embedding model utilized by the word embedding algorithm may be an existing word embedding model. For another example, the Word embedding model used by the Word embedding algorithm may also be a new Word embedding model with pertinence obtained by training existing data using an existing Word embedding model, for example, the new Word embedding model may be obtained by training Word2 vec.
In step 13, the similarity between the first text and the second text is determined according to the semantic information transfer cost.
As described above, the semantic information transfer cost between the first text and the second text may reflect the cost required for transferring the semantic information of the first text to the semantic information of the second text. If the semantic information transfer cost between the first text and the second text is higher, the similarity between the first text and the second text is lower; and if the semantic information transfer cost between the first text and the second text is smaller, the similarity between the first text and the second text is higher. Therefore, according to the semantic information transfer cost between the first text and the second text determined in step 12, the similarity between the first text and the second text can be further determined. Illustratively, a function for obtaining the similarity between texts may be preset, so that the similarity between texts and the semantic information transfer cost between texts satisfy a negative correlation relationship.
According to the scheme, for two texts with similarity to be calculated, word segmentation and stop word filtering processing is firstly carried out, two word segmentation sets which do not contain repeated word segmentation and respectively correspond to the two texts are obtained according to processing results, then semantic information transfer cost between the two texts is determined according to information quantity carried by each word segmentation in the text and word embedding vectors corresponding to each word segmentation, and then the similarity between the two texts is determined according to the semantic information transfer cost. Therefore, the semantic information transfer cost between the texts is used as a standard for measuring the similarity between the texts, the similarity between the texts is determined through the semantic information transfer cost between the texts, the semantic influence of each word and each word context in the texts on the texts is fully considered, the calculation basis of the similarity is closer to the semantics of the texts, and the calculated similarity is more accurate.
In order to make those skilled in the art more understand the technical solutions provided by the embodiments of the present invention, the following detailed descriptions are provided for the corresponding steps in the foregoing.
First, a method for determining the amount of information carried in a text where a word is located is exemplified. In one possible implementation, the amount of information carried by the segmented word in the text may be determined by the number of occurrences of the segmented word in the text.
Illustratively, the amount wp of information carried by the kth word in the text of the kth word in the word set w k Can be determined by the following formula:
wherein, c k The number of times of the k-th participle appearing in the text, l is the total number of participles contained in the participle set w,representing the sum of the times of the occurrences of the participles in the participle set w in the text.
By adopting the method, the information quantity carried by the participle in the text is determined by the sum of the occurrence frequency of the participle in the text and the occurrence frequency of each participle in the participle set in which the participle is located in the text. Because the information quantity carried by the participle in the text is related to the occurrence frequency of the participle in the text, the information quantity carried by the participle in the text can be determined more simply and conveniently by using the calculation mode.
Then, for the information amount carried in the text where each participle in the first participle set and the second participle set is located in step 12 and the word embedding vector corresponding to each participle, determining semantic information transfer cost between the first text and the second text, and performing an example.
In one possible embodiment, step 12 may include the following steps, as shown in FIG. 2.
In step 21, according to the information amount carried by each participle in the text, the information transfer amount corresponding to each participle in the first participle set when each participle is transferred to each participle in the second participle set is respectively determined.
Exemplarily, the information transfer amount T when the ith participle in the first participle set is transferred to the jth participle in the second participle set ij The following conditions may be satisfied:
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j And the information quantity carried in the second text for the jth participle in the second participle set.
Since the information amount corresponding to the word segmentation itself is limited, knowing the information amount carried by the word segmentation in the text, the word segmentation information transfer amount for transferring the word segmentation to another text should be equal to the information amount carried by the word segmentation in the text, i.e. under the above conditionsSimilarly, the amount of information transferred to another text by a word to the word, i.e. the amount of information that the word can receive, should be equal to the amount of information carried by the word in the text, i.e. the above conditions
According to the above conditions, at least one group of information transfer amounts meeting the conditions can be determined, that is, the information transfer amounts corresponding to the participles in the first participle set are determined when the participles in the first participle set are transferred to the participles in the second participle set. For example, if the total number of participles in the first participle set is 3 and the total number of participles in the second participle set is 2, then it is determined that a set of information transfer amounts satisfying the above condition includes T 11 、T 21 、 T 31 、T 12 、T 22 And T 32 。
In step 22, according to the word embedding vector corresponding to each participle, the semantic transfer amount between each participle in the first participle set and each participle in the second participle set is respectively determined.
Through a word embedding algorithm, word embedding vectors of each participle in the same word embedding space can be determined, and the spatial distance between two word embedding vectors in the space can reflect the semantic similarity of words corresponding to the two word embedding vectors. The closer the spatial distance of two word embedding vectors is, the more similar the semantics of the corresponding words of the two word embedding vectors can be considered.
For example, the spatial distance of two word embedding vectors may be considered as the amount of semantic transfer between corresponding words of the two word embedding vectors. Therefore, the semantic transfer S between the ith participle in the first participle set and the jth participle in the second participle set ij Can be calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, into a word corresponding to the ith word in the first word segmentation set j And embedding a vector for a word corresponding to the jth participle in the second participle set.
In step 23, semantic information transfer costs between the first text and the second text are determined according to the information transfer amount and the semantic transfer amount.
And according to the determined information transfer amount and the semantic transfer amount, directly determining semantic information transfer cost between the first text and the second text. Illustratively, the semantic information transfer cost may be calculated according to the following formula (2):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij For the information transfer quantity when transferring the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text.
As mentioned above, for the ith participle in the first participle set and the jth participle in the second participle set, since the word embedding vectors corresponding to the two participles are fixed, the semantic shift amount S is fixed ij The value is a constant value after the calculation of the formula (1). Aiming at the letter of the two participlesAmount of information transfer T ij Only the conditions mentioned above need to be satisfiedAndthat is, there may be a plurality of sets of information transfer amounts T satisfying the condition ij . For example, if the first participle set contains 3 participles, the second participle set contains 2 participles, and wp 1 Is 0.25,wp 2 Is 0.5,wp 3 Is 0.25,wq 1 Is 0.5,wq 2 0.5, the information transfer amount should satisfy the following condition:
T 11 +T 12 =wp 1 =0.25;
T 21 +T 22 =wp 2 =0.5;
T 31 +T 32 =wp 3 =0.25;
T 11 +T 21 +T 31 =wq 1 =0.5;
T 12 +T 22 +T 32 =wq 2 =0.5。
multiple groups of T meeting the conditions can be obtained through calculation 11 、T 12 、T 21 、T 22 、T 31 、T 32 . For example, there is a set of eligible information transfers as: t is 11 =0.125,T 12 =0.125,T 21 =0.25,T 22 =0.25, T 31 =0.125,T 32 =0.125. For another example, there is a set of eligible information transfer volumes: t is 11 =0.05, T 12 =0.2,T 21 =0.25,T 22 =0.25,T 31 =0.2,T 32 =0.05. Meanwhile, other sets of information transfer quantities meeting the conditions exist, and are not listed one by one. Thus, it is possible to select a plurality of different sets of information transfer amounts, for example, all of the sets of information transfer amounts that meet the conditions, and for example, select a plurality of sets of information transfer amounts from the sets of information transfer amounts that meet the conditions. Determining to selectAfter each group of information is transferred, the information is respectively substituted into the formula (2), so that a plurality of groups of different targets can be obtainedDetermining the minimum value in the obtained calculation result as the final semantic information transfer cost. In order to obtain a more accurate semantic information transfer cost result, multiple sets of information transfer amounts can be selected as much as possible for calculation.
For example, if the total number of participles in the first participle set is 1 and the total number of participles in the second participle set is 2, the information transfer amount is T 11 、T 12 Semantic transfer quantity of S 11 、S 12 Then, then Selecting different T groups 11 And T 12 And determining the minimum value of the result as the semantic information transfer cost. For example, if two different sets of T are selected 11 And T 12 The resulting G's were 0.78 and 0.48, respectively, and the final G was 0.48.
By adopting the method, the information transfer amount corresponding to each participle transferred from one text to another text is determined, the semantic information transfer amount between the participles of the two texts is determined, then the semantic information transfer cost between the two texts is determined by combining the information transfer amount and the semantic transfer amount, the information transfer cost and the semantic transfer are fully considered, the semantic information transfer cost between the texts can be determined more accurately, and therefore more accurate calculation basis is provided for subsequent text similarity calculation.
In another possible embodiment, for a given equation (2), an appropriate solution may be made to obtain a simpler way of calculation. For example, the following solving process may be referred to for the solution of equation (2):
from the above, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j And the information amount carried in the second text for the jth participle in the second participle set.
Therefore, in correspondence with the above solution, in combination with the case where equation (2) takes the minimum value for the solution, step 12 may include the following steps:
calculating the semantic information transfer cost according to the following formula (3):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j The information quantity, X, carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith word in the first word segmentation set j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is the semantic information transfer cost between the first text and the second text.
By adopting the method, the formula is solved and converted, the corresponding semantic information transfer cost can be calculated only by obtaining the information quantity carried by the participle in the text and the word embedding vector corresponding to the participle, other intermediate data do not need to be calculated, the solution is convenient, and the calculation space is saved.
The following is an example of determining the similarity between the first text and the second text according to the semantic information transfer cost in step 13.
In one possible embodiment, step 13 may include the steps of:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to the formula, the higher the transfer cost of semantic information among texts is, the smaller the similarity among the texts is; the smaller the semantic information transfer cost between texts is, the greater the similarity between texts is.
By adopting the mode, the obtained semantic information transfer cost between the texts is converted into the similarity between the texts, and the calculation is simple and convenient.
In other embodiments, the base e in equation (4) may be replaced by another base, e 2 、e 3 And the like, and the present disclosure is not limited thereto depending on the specific application scenario.
Fig. 3 is a block diagram of an inter-text similarity calculation apparatus provided according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 includes:
the processing module 31 is configured to perform word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text, which do not contain repeated word segmentation, according to a processing result;
a first determining module 32, configured to determine semantic information transfer cost between the first text and the second text according to an information amount carried in a text where each participle in the first participle set and the second participle set is located, and a word embedding vector corresponding to each participle;
a second determining module 33, configured to determine, according to the semantic information transfer cost, a similarity between the first text and the second text.
Optionally, the first determining module 32 includes:
the first determining sub-module is used for respectively determining information transfer amounts corresponding to the word segmentation in the first word segmentation set when each word segmentation is transferred to each word segmentation in the second word segmentation set according to the information amount carried by each word segmentation in the text;
the second determining submodule is used for respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
and the third determining submodule is used for determining the semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word segmentation in the first word segmentation set is transferred to the jth word segmentation in the second word segmentation set ij The following conditions are satisfied:
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j ‖ 2 (1)
wherein X i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set;
the third determining submodule is used for calculating the semantic information transfer cost according to the following formula (2):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the first determining module 32 is configured to calculate the semantic information transfer cost according to the following formula (3):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is the semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the second determining module 33 is configured to determine the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, electronic device 1900 may be provided as a server. Referring to fig. 4, an electronic device 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the above-described inter-text similarity calculation method.
Additionally, the electronic device 1900 may also include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management for the electronic device 1900, and the communication component 1950 may be configured to enable communication for the electronic device 1900, e.g., wired or wireless communication. In addition, the electronic device 1900 may also include input/output (I/O) interfaces 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, mac OS XTM, unixTM, linuxTM, etc., stored in memory 1932.
In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions which, when executed by a processor, implement the steps of the above-described inter-text similarity calculation method. For example, the computer readable storage medium may be the above-mentioned memory 1932 including program instructions executable by the processor 1922 of the electronic device 1900 to perform the above-mentioned inter-text similarity calculation method.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.
Claims (7)
1. A method for calculating inter-text similarity, the method comprising:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated words and corresponds to the first text and a second word segmentation set which corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
determining the similarity between the first text and the second text according to the semantic information transfer cost;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text of each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle, including:
respectively determining information transfer amounts corresponding to the transfer of each participle in the first participle set to each participle in the second participle set according to the information amount carried by each participle in the text;
respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount;
the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set ij The following conditions are satisfied:
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j Embedding a vector for a word corresponding to the jth participle in the second participle set;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount, wherein the determining semantic information transfer cost comprises:
calculating the semantic information transfer cost according to the following formula (2):
wherein m is the first moietyThe total number of participles contained in the word set, n is the total number of participles contained in the second participle set, T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij Obtaining a semantic information transfer cost between the first text and the second text, wherein G is a semantic information transfer amount between the ith participle in the first participle set and the jth participle in the second participle set;
the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
2. The method according to claim 1, wherein the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
3. A method of calculating inter-text similarity, the method comprising:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
determining the similarity between the first text and the second text according to the semantic information transfer cost;
the determining semantic information transfer cost between the first text and the second text according to the information amount carried by each participle in the first participle set and the second participle set in the text where each participle is located and the word embedding vector corresponding to each participle comprises:
calculating the semantic information transfer cost according to the following formula (3):
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text;
the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
4. The method according to claim 3, wherein the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
5. An inter-text similarity calculation apparatus that performs the inter-text similarity calculation method according to claim 1 or 3, the apparatus comprising:
the processing module is used for performing word segmentation and stop word filtering processing on a first text and a second text of which the similarity is to be calculated, and obtaining a first word segmentation set which does not contain repeated words and corresponds to the first text and a second word segmentation set which does not contain repeated words and corresponds to the second text according to a processing result;
the first determining module is used for determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text of each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle;
and the second determining module is used for determining the similarity between the first text and the second text according to the semantic information transfer cost.
6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
7. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811420108.0A CN109684629B (en) | 2018-11-26 | 2018-11-26 | Method and device for calculating similarity between texts, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811420108.0A CN109684629B (en) | 2018-11-26 | 2018-11-26 | Method and device for calculating similarity between texts, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109684629A CN109684629A (en) | 2019-04-26 |
CN109684629B true CN109684629B (en) | 2022-12-16 |
Family
ID=66185547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811420108.0A Active CN109684629B (en) | 2018-11-26 | 2018-11-26 | Method and device for calculating similarity between texts, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109684629B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110348022A (en) * | 2019-07-18 | 2019-10-18 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus of similarity analysis, storage medium and electronic equipment |
CN110597980B (en) * | 2019-09-12 | 2021-04-30 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN110704621B (en) * | 2019-09-25 | 2023-04-21 | 北京大米科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN111160028B (en) * | 2019-12-31 | 2023-05-16 | 东软集团股份有限公司 | Method, device, storage medium and equipment for judging semantic similarity of two texts |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN103838789A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Text similarity computing method |
CN104462323A (en) * | 2014-12-02 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Semantic similarity computing method, search result processing method and search result processing device |
CN106776503A (en) * | 2016-12-22 | 2017-05-31 | 东软集团股份有限公司 | The determination method and device of text semantic similarity |
CN108182222A (en) * | 2017-12-26 | 2018-06-19 | 东软集团股份有限公司 | A kind of text matching technique and device |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
CN108763333A (en) * | 2018-05-11 | 2018-11-06 | 北京航空航天大学 | A kind of event collection of illustrative plates construction method based on Social Media |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9430463B2 (en) * | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
-
2018
- 2018-11-26 CN CN201811420108.0A patent/CN109684629B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN103838789A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Text similarity computing method |
CN104462323A (en) * | 2014-12-02 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Semantic similarity computing method, search result processing method and search result processing device |
CN106776503A (en) * | 2016-12-22 | 2017-05-31 | 东软集团股份有限公司 | The determination method and device of text semantic similarity |
CN108182222A (en) * | 2017-12-26 | 2018-06-19 | 东软集团股份有限公司 | A kind of text matching technique and device |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
CN108763333A (en) * | 2018-05-11 | 2018-11-06 | 北京航空航天大学 | A kind of event collection of illustrative plates construction method based on Social Media |
Non-Patent Citations (1)
Title |
---|
《基于词汇语义信息的文本相似度计算》;谷重阳;《计算机应用研究》;20180228;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109684629A (en) | 2019-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109684629B (en) | Method and device for calculating similarity between texts, storage medium and electronic equipment | |
CN111950638B (en) | Image classification method and device based on model distillation and electronic equipment | |
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
CN109816039B (en) | Cross-modal information retrieval method and device and storage medium | |
CN111898643B (en) | Semantic matching method and device | |
CN110298035B (en) | Word vector definition method, device, equipment and storage medium based on artificial intelligence | |
CN111028006B (en) | Service delivery auxiliary method, service delivery method and related device | |
CN110874528B (en) | Text similarity obtaining method and device | |
CN109948140B (en) | Word vector embedding method and device | |
CN110083834B (en) | Semantic matching model training method and device, electronic equipment and storage medium | |
WO2021027125A1 (en) | Sequence labeling method and apparatus, computer device and storage medium | |
CN110941951B (en) | Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment | |
CN113128419B (en) | Obstacle recognition method and device, electronic equipment and storage medium | |
CN112183111A (en) | Long text semantic similarity matching method and device, electronic equipment and storage medium | |
CN111078639A (en) | Data standardization method and device and electronic equipment | |
CN110197213B (en) | Image matching method, device and equipment based on neural network | |
US20160188680A1 (en) | Electronic device and information searching method for the electronic device | |
CN114511083A (en) | Model training method and device, storage medium and electronic device | |
CN111027316A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN113934848A (en) | Data classification method and device and electronic equipment | |
US20160247045A1 (en) | Constructing and using support vector machines | |
CN111428125A (en) | Sorting method and device, electronic equipment and readable storage medium | |
CN110674388A (en) | Mapping method and device for push item, storage medium and terminal equipment | |
CN110287943B (en) | Image object recognition method and device, electronic equipment and storage medium | |
CN113962221A (en) | Text abstract extraction method and device, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |