CN109684629B - Method and device for calculating similarity between texts, storage medium and electronic equipment - Google Patents

Method and device for calculating similarity between texts, storage medium and electronic equipment Download PDF

Info

Publication number
CN109684629B
CN109684629B CN201811420108.0A CN201811420108A CN109684629B CN 109684629 B CN109684629 B CN 109684629B CN 201811420108 A CN201811420108 A CN 201811420108A CN 109684629 B CN109684629 B CN 109684629B
Authority
CN
China
Prior art keywords
text
participle
word
word segmentation
information transfer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811420108.0A
Other languages
Chinese (zh)
Other versions
CN109684629A (en
Inventor
董超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811420108.0A priority Critical patent/CN109684629B/en
Publication of CN109684629A publication Critical patent/CN109684629A/en
Application granted granted Critical
Publication of CN109684629B publication Critical patent/CN109684629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The disclosure relates to a method and a device for calculating similarity between texts, a storage medium and an electronic device. The method comprises the following steps: performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text; determining semantic information transfer cost between the first text and the second text according to the information quantity carried by each participle in the first participle set and the second participle set in the text and the word embedding vector corresponding to each participle; and determining the similarity between the first text and the second text according to the semantic information transfer cost. Therefore, the semantic influence of each word and each word context in the text on the text is fully considered, and the similarity calculation basis is closer to the semantic of the text, so that the calculated similarity is more accurate.

Description

Method and device for calculating similarity between texts, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for calculating similarity between texts, a storage medium, and an electronic device.
Background
In the prior art, a structured processing mode is generally adopted when calculating the similarity between texts. The text is first structured, for example, two sections of text are processed into vectors, such as a word-based one-hot representation, a vectorized representation based on word-embedding accumulation, and so on. And then, similarity calculation is carried out on the result after the text structuring processing, such as calculation of Euclidean distance between text vectors, cosine included angle between vectors and the like. However, the semantics of words in a text are influenced by the context of the position of the word in the text, so that the semantic conversion between words in the text is influenced, and the similarity calculation method focuses on the whole text and does not consider the semantic corresponding conversion relationship between words in the text, so that the semantic information of each text is incomplete during calculation, and the calculated text similarity is not accurate enough.
Disclosure of Invention
The purpose of the present disclosure is to provide a method, an apparatus, a storage medium, and an electronic device for calculating inter-text similarity, so as to calculate inter-text similarity more accurately.
In order to achieve the above object, according to a first aspect of the present disclosure, there is provided an inter-text similarity calculation method including:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
and determining the similarity between the first text and the second text according to the semantic information transfer cost.
Optionally, the determining, according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle, a semantic information transfer cost between the first text and the second text includes:
respectively determining information transfer amounts corresponding to the word segments in the first word segment set when the word segments are transferred to the word segments in the second word segment set according to the information amount carried by each word segment in the text;
according to the word embedding vector corresponding to each participle, respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set;
and determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set is determined ij The following conditions are satisfied:
Figure BDA0001880378890000021
and the number of the first and second groups,
Figure BDA0001880378890000022
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information amount carried in the second text for the jth participle in the second participle set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein X i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount, including:
calculating the semantic information transfer cost according to the following formula (2):
Figure BDA0001880378890000031
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the determining, according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle, a semantic information transfer cost between the first text and the second text, includes:
calculating the semantic information transfer cost according to the following formula (3):
Figure BDA0001880378890000032
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information quantity wq carried in the first text for the ith word in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to a second aspect of the present disclosure, there is provided an inter-text similarity calculation apparatus including:
the processing module is used for performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text;
the first determining module is used for determining semantic information transfer cost between the first text and the second text according to the information quantity carried by each participle in the first participle set and the second participle set in the text where the participle is located and a word embedding vector corresponding to each participle;
and the second determining module is used for determining the similarity between the first text and the second text according to the semantic information transfer cost.
Optionally, the first determining module includes:
the first determining submodule is used for respectively determining corresponding information transfer quantity when each participle in the first participle set is transferred to each participle in the second participle set according to the information quantity carried by each participle in the text;
the second determining submodule is used for respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
and the third determining submodule is used for determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set is determined ij The following conditions are satisfied:
Figure BDA0001880378890000051
and the number of the first and second groups,
Figure BDA0001880378890000052
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information quantity wq carried in the first text for the ith word in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer S between the ith participle in the first participle set and the jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j Embedding a vector for a word corresponding to the jth participle in the second participle set;
the third determining submodule is used for calculating the semantic information transfer cost according to the following formula (2):
Figure BDA0001880378890000053
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the first determining module is configured to calculate the semantic information transfer cost according to the following formula (3):
Figure BDA0001880378890000054
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the second determining module is configured to determine the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method provided by the first aspect of the present disclosure.
According to a fourth aspect of the present disclosure, there is provided an electronic apparatus comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.
According to the technical scheme, for two texts with similarity to be calculated, word segmentation and stop word filtering processing is firstly carried out, two word segmentation sets which do not contain repeated word segmentation and respectively correspond to the two texts are obtained according to processing results, then semantic information transfer cost between the two texts is determined according to information quantity carried by each word segmentation in the text and a word embedding vector corresponding to each word segmentation, and then the similarity between the two texts is determined according to the semantic information transfer cost. Therefore, the semantic information transfer cost between texts is used as a standard for measuring the similarity between the texts, the similarity between the texts is determined through the semantic information transfer cost between the texts, the semantic influence of each word and each word context in the text on the text is fully considered, the similarity is calculated according to the semantic close to the text, and the calculated similarity is more accurate.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 is a flowchart of an inter-text similarity calculation method provided according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an exemplary implementation of the step of determining semantic information transfer cost between the first text and the second text according to an information amount carried in a text where each participle in the first participle set and the second participle set is located and a word embedding vector corresponding to each participle in the inter-text similarity calculation method provided by the present disclosure;
fig. 3 is a block diagram of an inter-text similarity calculation apparatus provided according to an embodiment of the present disclosure;
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of an inter-text similarity calculation method provided according to an embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.
In step 11, for the first text and the second text with similarity to be calculated, performing word segmentation and stop word filtering processing, and obtaining a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text, which do not contain repeated words, according to the processing result.
For a first text and a second text with similarity between texts to be calculated, firstly, word segmentation processing and stop word filtering processing are respectively carried out on the first text and the second text. Because stop words have no practical meaning and have no reference to the similarity between the actual texts, in order to prevent the stop words from influencing the subsequent text similarity calculation, the stop word filtering processing is continued after the word segmentation processing is carried out on the first text and the second text so as to delete the stop words. For example, if the first text D1 and the second text D2 are subjected to the word segmentation processing, the participles corresponding to the first text D1 are v1, v7, v1, v8, v2, v9, and v3, the participles corresponding to the second text D2 are v4, v5, v7, v4, and v6, where v7, v8, and v9 are stop words, after the stop word filtering processing, the processing results corresponding to the first text D1 are v1, v2, and v3, and the processing results corresponding to the second text D2 are v4, v5, v4, and v6.
And after the word segmentation and stop word filtering processing, obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which corresponds to the second text according to the processing result. For example, as for the processing results of D1 and D2 in the above example, the first set of words corresponding to the first text D1 is { v1, v2, v3}, and the second set of words corresponding to the second text D2 is { v4, v5, v6}.
In step 12, semantic information transfer cost between the first text and the second text is determined according to the information amount carried in the text by each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle.
In the calculation of the similarity between texts, the calculated similarity of semantics between texts can be considered, and from another perspective, the semantic information contained in one text can be converted into the semantic information contained in the other text only by judging how much cost is required to be paid, that is, the cost required to be paid for transferring the semantic information of one text into the semantic information of the other text, that is, the transfer cost of the semantic information. The semantic information transfer cost between texts can be determined by the information quantity carried in the text by the participles in the text and the word embedding vectors corresponding to the participles.
Firstly, the information quantity carried by each participle in the first participle set and the second participle set in the text is determined. The information amount of the participles carried in the text can be reflected by the comparison condition of the participles in the text compared with all the participles, wherein all the participles refer to all the participles contained in the participle set in which the participle is located. For example, the above information amount may be reflected by a specific gravity of the participle in the text where the participle appears compared to all the above participles. The larger the proportion is, the more information carried in the text of the word segmentation is; the smaller the specific gravity, the less information carried by the word segmentation in the text.
The word embedding vector corresponding to the participle may be determined by a word embedding algorithm. And mapping each participle in the first participle set and the second participle set to the same word embedding space by using a word embedding algorithm. Illustratively, the word embedding model utilized by the word embedding algorithm may be an existing word embedding model. For another example, the Word embedding model used by the Word embedding algorithm may also be a new Word embedding model with pertinence obtained by training existing data using an existing Word embedding model, for example, the new Word embedding model may be obtained by training Word2 vec.
In step 13, the similarity between the first text and the second text is determined according to the semantic information transfer cost.
As described above, the semantic information transfer cost between the first text and the second text may reflect the cost required for transferring the semantic information of the first text to the semantic information of the second text. If the semantic information transfer cost between the first text and the second text is higher, the similarity between the first text and the second text is lower; and if the semantic information transfer cost between the first text and the second text is smaller, the similarity between the first text and the second text is higher. Therefore, according to the semantic information transfer cost between the first text and the second text determined in step 12, the similarity between the first text and the second text can be further determined. Illustratively, a function for obtaining the similarity between texts may be preset, so that the similarity between texts and the semantic information transfer cost between texts satisfy a negative correlation relationship.
According to the scheme, for two texts with similarity to be calculated, word segmentation and stop word filtering processing is firstly carried out, two word segmentation sets which do not contain repeated word segmentation and respectively correspond to the two texts are obtained according to processing results, then semantic information transfer cost between the two texts is determined according to information quantity carried by each word segmentation in the text and word embedding vectors corresponding to each word segmentation, and then the similarity between the two texts is determined according to the semantic information transfer cost. Therefore, the semantic information transfer cost between the texts is used as a standard for measuring the similarity between the texts, the similarity between the texts is determined through the semantic information transfer cost between the texts, the semantic influence of each word and each word context in the texts on the texts is fully considered, the calculation basis of the similarity is closer to the semantics of the texts, and the calculated similarity is more accurate.
In order to make those skilled in the art more understand the technical solutions provided by the embodiments of the present invention, the following detailed descriptions are provided for the corresponding steps in the foregoing.
First, a method for determining the amount of information carried in a text where a word is located is exemplified. In one possible implementation, the amount of information carried by the segmented word in the text may be determined by the number of occurrences of the segmented word in the text.
Illustratively, the amount wp of information carried by the kth word in the text of the kth word in the word set w k Can be determined by the following formula:
Figure BDA0001880378890000101
wherein, c k The number of times of the k-th participle appearing in the text, l is the total number of participles contained in the participle set w,
Figure BDA0001880378890000102
representing the sum of the times of the occurrences of the participles in the participle set w in the text.
By adopting the method, the information quantity carried by the participle in the text is determined by the sum of the occurrence frequency of the participle in the text and the occurrence frequency of each participle in the participle set in which the participle is located in the text. Because the information quantity carried by the participle in the text is related to the occurrence frequency of the participle in the text, the information quantity carried by the participle in the text can be determined more simply and conveniently by using the calculation mode.
Then, for the information amount carried in the text where each participle in the first participle set and the second participle set is located in step 12 and the word embedding vector corresponding to each participle, determining semantic information transfer cost between the first text and the second text, and performing an example.
In one possible embodiment, step 12 may include the following steps, as shown in FIG. 2.
In step 21, according to the information amount carried by each participle in the text, the information transfer amount corresponding to each participle in the first participle set when each participle is transferred to each participle in the second participle set is respectively determined.
Exemplarily, the information transfer amount T when the ith participle in the first participle set is transferred to the jth participle in the second participle set ij The following conditions may be satisfied:
Figure BDA0001880378890000111
and the number of the first and second groups,
Figure BDA0001880378890000112
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j And the information quantity carried in the second text for the jth participle in the second participle set.
Since the information amount corresponding to the word segmentation itself is limited, knowing the information amount carried by the word segmentation in the text, the word segmentation information transfer amount for transferring the word segmentation to another text should be equal to the information amount carried by the word segmentation in the text, i.e. under the above conditions
Figure BDA0001880378890000113
Similarly, the amount of information transferred to another text by a word to the word, i.e. the amount of information that the word can receive, should be equal to the amount of information carried by the word in the text, i.e. the above conditions
Figure BDA0001880378890000114
According to the above conditions, at least one group of information transfer amounts meeting the conditions can be determined, that is, the information transfer amounts corresponding to the participles in the first participle set are determined when the participles in the first participle set are transferred to the participles in the second participle set. For example, if the total number of participles in the first participle set is 3 and the total number of participles in the second participle set is 2, then it is determined that a set of information transfer amounts satisfying the above condition includes T 11 、T 21 、 T 31 、T 12 、T 22 And T 32
In step 22, according to the word embedding vector corresponding to each participle, the semantic transfer amount between each participle in the first participle set and each participle in the second participle set is respectively determined.
Through a word embedding algorithm, word embedding vectors of each participle in the same word embedding space can be determined, and the spatial distance between two word embedding vectors in the space can reflect the semantic similarity of words corresponding to the two word embedding vectors. The closer the spatial distance of two word embedding vectors is, the more similar the semantics of the corresponding words of the two word embedding vectors can be considered.
For example, the spatial distance of two word embedding vectors may be considered as the amount of semantic transfer between corresponding words of the two word embedding vectors. Therefore, the semantic transfer S between the ith participle in the first participle set and the jth participle in the second participle set ij Can be calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, into a word corresponding to the ith word in the first word segmentation set j And embedding a vector for a word corresponding to the jth participle in the second participle set.
In step 23, semantic information transfer costs between the first text and the second text are determined according to the information transfer amount and the semantic transfer amount.
And according to the determined information transfer amount and the semantic transfer amount, directly determining semantic information transfer cost between the first text and the second text. Illustratively, the semantic information transfer cost may be calculated according to the following formula (2):
Figure BDA0001880378890000121
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij For the information transfer quantity when transferring the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text.
As mentioned above, for the ith participle in the first participle set and the jth participle in the second participle set, since the word embedding vectors corresponding to the two participles are fixed, the semantic shift amount S is fixed ij The value is a constant value after the calculation of the formula (1). Aiming at the letter of the two participlesAmount of information transfer T ij Only the conditions mentioned above need to be satisfied
Figure 1
And
Figure BDA0001880378890000123
that is, there may be a plurality of sets of information transfer amounts T satisfying the condition ij . For example, if the first participle set contains 3 participles, the second participle set contains 2 participles, and wp 1 Is 0.25,wp 2 Is 0.5,wp 3 Is 0.25,wq 1 Is 0.5,wq 2 0.5, the information transfer amount should satisfy the following condition:
T 11 +T 12 =wp 1 =0.25;
T 21 +T 22 =wp 2 =0.5;
T 31 +T 32 =wp 3 =0.25;
T 11 +T 21 +T 31 =wq 1 =0.5;
T 12 +T 22 +T 32 =wq 2 =0.5。
multiple groups of T meeting the conditions can be obtained through calculation 11 、T 12 、T 21 、T 22 、T 31 、T 32 . For example, there is a set of eligible information transfers as: t is 11 =0.125,T 12 =0.125,T 21 =0.25,T 22 =0.25, T 31 =0.125,T 32 =0.125. For another example, there is a set of eligible information transfer volumes: t is 11 =0.05, T 12 =0.2,T 21 =0.25,T 22 =0.25,T 31 =0.2,T 32 =0.05. Meanwhile, other sets of information transfer quantities meeting the conditions exist, and are not listed one by one. Thus, it is possible to select a plurality of different sets of information transfer amounts, for example, all of the sets of information transfer amounts that meet the conditions, and for example, select a plurality of sets of information transfer amounts from the sets of information transfer amounts that meet the conditions. Determining to selectAfter each group of information is transferred, the information is respectively substituted into the formula (2), so that a plurality of groups of different targets can be obtained
Figure BDA0001880378890000131
Determining the minimum value in the obtained calculation result as the final semantic information transfer cost. In order to obtain a more accurate semantic information transfer cost result, multiple sets of information transfer amounts can be selected as much as possible for calculation.
For example, if the total number of participles in the first participle set is 1 and the total number of participles in the second participle set is 2, the information transfer amount is T 11 、T 12 Semantic transfer quantity of S 11 、S 12 Then, then
Figure BDA0001880378890000132
Figure BDA0001880378890000133
Selecting different T groups 11 And T 12 And determining the minimum value of the result as the semantic information transfer cost. For example, if two different sets of T are selected 11 And T 12 The resulting G's were 0.78 and 0.48, respectively, and the final G was 0.48.
By adopting the method, the information transfer amount corresponding to each participle transferred from one text to another text is determined, the semantic information transfer amount between the participles of the two texts is determined, then the semantic information transfer cost between the two texts is determined by combining the information transfer amount and the semantic transfer amount, the information transfer cost and the semantic transfer are fully considered, the semantic information transfer cost between the texts can be determined more accurately, and therefore more accurate calculation basis is provided for subsequent text similarity calculation.
In another possible embodiment, for a given equation (2), an appropriate solution may be made to obtain a simpler way of calculation. For example, the following solving process may be referred to for the solution of equation (2):
Figure BDA0001880378890000141
from the above, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j And the information amount carried in the second text for the jth participle in the second participle set.
Therefore, in correspondence with the above solution, in combination with the case where equation (2) takes the minimum value for the solution, step 12 may include the following steps:
calculating the semantic information transfer cost according to the following formula (3):
Figure BDA0001880378890000151
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i The information content, wq, carried in the first text for the ith word segmentation in the first word segmentation set j The information quantity, X, carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith word in the first word segmentation set j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is the semantic information transfer cost between the first text and the second text.
By adopting the method, the formula is solved and converted, the corresponding semantic information transfer cost can be calculated only by obtaining the information quantity carried by the participle in the text and the word embedding vector corresponding to the participle, other intermediate data do not need to be calculated, the solution is convenient, and the calculation space is saved.
The following is an example of determining the similarity between the first text and the second text according to the semantic information transfer cost in step 13.
In one possible embodiment, step 13 may include the steps of:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
According to the formula, the higher the transfer cost of semantic information among texts is, the smaller the similarity among the texts is; the smaller the semantic information transfer cost between texts is, the greater the similarity between texts is.
By adopting the mode, the obtained semantic information transfer cost between the texts is converted into the similarity between the texts, and the calculation is simple and convenient.
In other embodiments, the base e in equation (4) may be replaced by another base, e 2 、e 3 And the like, and the present disclosure is not limited thereto depending on the specific application scenario.
Fig. 3 is a block diagram of an inter-text similarity calculation apparatus provided according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 includes:
the processing module 31 is configured to perform word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text, which do not contain repeated word segmentation, according to a processing result;
a first determining module 32, configured to determine semantic information transfer cost between the first text and the second text according to an information amount carried in a text where each participle in the first participle set and the second participle set is located, and a word embedding vector corresponding to each participle;
a second determining module 33, configured to determine, according to the semantic information transfer cost, a similarity between the first text and the second text.
Optionally, the first determining module 32 includes:
the first determining sub-module is used for respectively determining information transfer amounts corresponding to the word segmentation in the first word segmentation set when each word segmentation is transferred to each word segmentation in the second word segmentation set according to the information amount carried by each word segmentation in the text;
the second determining submodule is used for respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
and the third determining submodule is used for determining the semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount.
Optionally, the information transfer amount T when the ith word segmentation in the first word segmentation set is transferred to the jth word segmentation in the second word segmentation set ij The following conditions are satisfied:
Figure BDA0001880378890000161
and the number of the first and second groups,
Figure BDA0001880378890000162
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j2 (1)
wherein X i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set;
the third determining submodule is used for calculating the semantic information transfer cost according to the following formula (2):
Figure BDA0001880378890000171
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, and T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij And G is the semantic information transfer cost between the first text and the second text, wherein the semantic information transfer amount is between the ith participle in the first participle set and the jth participle in the second participle set.
Optionally, the first determining module 32 is configured to calculate the semantic information transfer cost according to the following formula (3):
Figure BDA0001880378890000172
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j And embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is the semantic information transfer cost between the first text and the second text.
Optionally, the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
Optionally, the second determining module 33 is configured to determine the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, electronic device 1900 may be provided as a server. Referring to fig. 4, an electronic device 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the above-described inter-text similarity calculation method.
Additionally, the electronic device 1900 may also include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management for the electronic device 1900, and the communication component 1950 may be configured to enable communication for the electronic device 1900, e.g., wired or wireless communication. In addition, the electronic device 1900 may also include input/output (I/O) interfaces 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, mac OS XTM, unixTM, linuxTM, etc., stored in memory 1932.
In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions which, when executed by a processor, implement the steps of the above-described inter-text similarity calculation method. For example, the computer readable storage medium may be the above-mentioned memory 1932 including program instructions executable by the processor 1922 of the electronic device 1900 to perform the above-mentioned inter-text similarity calculation method.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (7)

1. A method for calculating inter-text similarity, the method comprising:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated words and corresponds to the first text and a second word segmentation set which corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
determining the similarity between the first text and the second text according to the semantic information transfer cost;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text of each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle, including:
respectively determining information transfer amounts corresponding to the transfer of each participle in the first participle set to each participle in the second participle set according to the information amount carried by each participle in the text;
respectively determining semantic transfer quantity between each participle in the first participle set and each participle in the second participle set according to the word embedding vector corresponding to each participle;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount;
the information transfer amount T when the ith word in the first word segmentation set is transferred to the jth word in the second word segmentation set ij The following conditions are satisfied:
Figure FDA0003846122800000011
and (c) a second step of,
Figure FDA0003846122800000012
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content carried in the second text for the jth word segmentation in the second word segmentation set;
semantic transfer quantity S between ith participle in the first participle set and jth participle in the second participle set ij Calculated by the following formula (1):
S ij =||X i -X j || 2 (1)
wherein, X i Embedding a vector, X, for a word corresponding to the ith word in the first set of words j Embedding a vector for a word corresponding to the jth participle in the second participle set;
determining semantic information transfer cost between the first text and the second text according to the information transfer amount and the semantic transfer amount, wherein the determining semantic information transfer cost comprises:
calculating the semantic information transfer cost according to the following formula (2):
Figure FDA0003846122800000021
wherein m is the first moietyThe total number of participles contained in the word set, n is the total number of participles contained in the second participle set, T ij In order to transfer the ith word segmentation in the first word segmentation set to the jth word segmentation in the second word segmentation set, S ij Obtaining a semantic information transfer cost between the first text and the second text, wherein G is a semantic information transfer amount between the ith participle in the first participle set and the jth participle in the second participle set;
the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
2. The method according to claim 1, wherein the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
3. A method of calculating inter-text similarity, the method comprising:
performing word segmentation and stop word filtering processing on a first text and a second text with similarity to be calculated, and obtaining a first word segmentation set which does not contain repeated word segmentation and corresponds to the first text and a second word segmentation set which does not contain repeated word segmentation and corresponds to the second text according to a processing result;
determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text by each participle in the first participle set and the second participle set and a word embedding vector corresponding to each participle;
determining the similarity between the first text and the second text according to the semantic information transfer cost;
the determining semantic information transfer cost between the first text and the second text according to the information amount carried by each participle in the first participle set and the second participle set in the text where each participle is located and the word embedding vector corresponding to each participle comprises:
calculating the semantic information transfer cost according to the following formula (3):
Figure FDA0003846122800000031
wherein m is the total number of participles contained in the first participle set, n is the total number of participles contained in the second participle set, wp i Information content wq carried in the first text for the ith word segmentation in the first word segmentation set j The information content X carried in the second text for the jth participle in the second participle set i Embedding a vector, X, into a word corresponding to the ith participle in the first participle set j Embedding a vector for a word corresponding to the jth participle in the second participle set, wherein G is a semantic information transfer cost between the first text and the second text;
the determining the similarity between the first text and the second text according to the semantic information transfer cost includes:
determining the similarity SIM according to the following formula (4):
SIM=e -G (4)
wherein G is a semantic information transfer cost between the first text and the second text.
4. The method according to claim 3, wherein the amount of information carried by the participle in the text is determined by the number of occurrences of the participle in the text.
5. An inter-text similarity calculation apparatus that performs the inter-text similarity calculation method according to claim 1 or 3, the apparatus comprising:
the processing module is used for performing word segmentation and stop word filtering processing on a first text and a second text of which the similarity is to be calculated, and obtaining a first word segmentation set which does not contain repeated words and corresponds to the first text and a second word segmentation set which does not contain repeated words and corresponds to the second text according to a processing result;
the first determining module is used for determining semantic information transfer cost between the first text and the second text according to the information quantity carried in the text of each participle in the first participle set and the second participle set and the word embedding vector corresponding to each participle;
and the second determining module is used for determining the similarity between the first text and the second text according to the semantic information transfer cost.
6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
7. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-4.
CN201811420108.0A 2018-11-26 2018-11-26 Method and device for calculating similarity between texts, storage medium and electronic equipment Active CN109684629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811420108.0A CN109684629B (en) 2018-11-26 2018-11-26 Method and device for calculating similarity between texts, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811420108.0A CN109684629B (en) 2018-11-26 2018-11-26 Method and device for calculating similarity between texts, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109684629A CN109684629A (en) 2019-04-26
CN109684629B true CN109684629B (en) 2022-12-16

Family

ID=66185547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811420108.0A Active CN109684629B (en) 2018-11-26 2018-11-26 Method and device for calculating similarity between texts, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109684629B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348022A (en) * 2019-07-18 2019-10-18 北京香侬慧语科技有限责任公司 A kind of method, apparatus of similarity analysis, storage medium and electronic equipment
CN110597980B (en) * 2019-09-12 2021-04-30 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium
CN110704621B (en) * 2019-09-25 2023-04-21 北京大米科技有限公司 Text processing method and device, storage medium and electronic equipment
CN111160028B (en) * 2019-12-31 2023-05-16 东软集团股份有限公司 Method, device, storage medium and equipment for judging semantic similarity of two texts

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
CN104462323A (en) * 2014-12-02 2015-03-25 百度在线网络技术(北京)有限公司 Semantic similarity computing method, search result processing method and search result processing device
CN106776503A (en) * 2016-12-22 2017-05-31 东软集团股份有限公司 The determination method and device of text semantic similarity
CN108182222A (en) * 2017-12-26 2018-06-19 东软集团股份有限公司 A kind of text matching technique and device
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN108763333A (en) * 2018-05-11 2018-11-06 北京航空航天大学 A kind of event collection of illustrative plates construction method based on Social Media

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9430463B2 (en) * 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
CN104462323A (en) * 2014-12-02 2015-03-25 百度在线网络技术(北京)有限公司 Semantic similarity computing method, search result processing method and search result processing device
CN106776503A (en) * 2016-12-22 2017-05-31 东软集团股份有限公司 The determination method and device of text semantic similarity
CN108182222A (en) * 2017-12-26 2018-06-19 东软集团股份有限公司 A kind of text matching technique and device
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN108763333A (en) * 2018-05-11 2018-11-06 北京航空航天大学 A kind of event collection of illustrative plates construction method based on Social Media

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于词汇语义信息的文本相似度计算》;谷重阳;《计算机应用研究》;20180228;全文 *

Also Published As

Publication number Publication date
CN109684629A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109684629B (en) Method and device for calculating similarity between texts, storage medium and electronic equipment
CN111950638B (en) Image classification method and device based on model distillation and electronic equipment
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN109816039B (en) Cross-modal information retrieval method and device and storage medium
CN111898643B (en) Semantic matching method and device
CN110298035B (en) Word vector definition method, device, equipment and storage medium based on artificial intelligence
CN111028006B (en) Service delivery auxiliary method, service delivery method and related device
CN110874528B (en) Text similarity obtaining method and device
CN109948140B (en) Word vector embedding method and device
CN110083834B (en) Semantic matching model training method and device, electronic equipment and storage medium
WO2021027125A1 (en) Sequence labeling method and apparatus, computer device and storage medium
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN113128419B (en) Obstacle recognition method and device, electronic equipment and storage medium
CN112183111A (en) Long text semantic similarity matching method and device, electronic equipment and storage medium
CN111078639A (en) Data standardization method and device and electronic equipment
CN110197213B (en) Image matching method, device and equipment based on neural network
US20160188680A1 (en) Electronic device and information searching method for the electronic device
CN114511083A (en) Model training method and device, storage medium and electronic device
CN111027316A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN113934848A (en) Data classification method and device and electronic equipment
US20160247045A1 (en) Constructing and using support vector machines
CN111428125A (en) Sorting method and device, electronic equipment and readable storage medium
CN110674388A (en) Mapping method and device for push item, storage medium and terminal equipment
CN110287943B (en) Image object recognition method and device, electronic equipment and storage medium
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant