Disclosure of Invention
The embodiment of the invention provides a text similarity calculation method, a device, electronic equipment and a storage medium.
In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a text similarity calculation method, where the method includes:
calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
calculating the topic similarity between the two text sentences to be matched based on LDA (Latent Dirichlet Allocation, document topic generation model);
and determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Further, the calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model includes:
mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched;
and calculating the semantic similarity between the two text sentences to be matched based on the text vector.
Further, the calculating the semantic similarity between the two text sentences to be matched based on the text vector includes:
and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
Further, before the semantic similarity between the two text sentences to be matched is calculated based on the word2vec space vector model, the method further comprises:
collecting text sentences of a target field to form a corpus aiming at the target field;
and generating the word2vec space vector model by taking text sentences in the corpus as training data.
Further, the calculating, based on the document topic generation model LDA, the topic similarity between the two text sentences to be matched includes:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
Further, the determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity includes:
and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
Further, before calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model, or before calculating the topic similarity between the two text sentences to be matched based on the document topic generation model LDA, the method further comprises:
and performing word segmentation on the two text sentences to be matched.
In a second aspect, an embodiment of the present invention provides a text similarity calculation apparatus, including:
the semantic similarity calculation module is used for calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
the topic similarity calculation module is used for calculating topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
and the comprehensive similarity calculation module is used for determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the text similarity calculation method according to the first aspect when executing the computer program.
In a fourth aspect, embodiments of the present invention provide a storage medium containing computer executable instructions which, when executed by a computer processor, implement a text similarity calculation method as described in the first aspect above.
According to the text similarity calculation method provided by the embodiment of the invention, the matching precision of the text similarity is improved by comprehensively considering the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA, and in the robot session, the completeness of the answer and the correlation between the answer and the question can be improved by the text similarity calculation method provided by the embodiment of the invention.
Detailed Description
In order to make the technical problems solved by the present invention, the technical solutions adopted and the technical effects achieved more clear, the technical solutions of the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Example 1
Fig. 1 is a schematic flow chart of a text similarity calculation method according to an embodiment of the present invention. The text similarity calculation method disclosed by the embodiment is suitable for the robot conversation field, the answer sentence with the highest semantic similarity with the input sentence is matched from the answer library, so that automatic reply is carried out on the input sentence, and in the scene, the input sentence is the text sentence A to be matched, and any sentence in the answer library is the text sentence B to be matched. The text similarity calculation method disclosed by the embodiment is also suitable for matching sentences with highest similarity with Shui You barrages in a live broadcasting room, so that a robot can automatically reply to Shui You barrages. The text similarity calculation method may be performed by a text similarity calculation device, which may be implemented in software and/or hardware and is typically integrated in a terminal, such as a server or the like. Referring specifically to fig. 1, the method comprises the steps of:
110. and calculating the semantic similarity between the two text sentences to be matched based on the word2vec space vector model.
The calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model includes:
mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched;
and calculating the semantic similarity between the two text sentences to be matched based on the text vector.
The word2vec space vector model specifically refers to a corresponding relation between words and vectors, namely each word is represented by a vector form, and the vectors also contain context association attributes between words, so that the similarity between text sentences to be matched can be obtained by comparing the similarity between text vectors corresponding to the text sentences to be matched.
The word2vec space vector model is trained in advance through a corpus in the corresponding field, and specifically, the generating of the word2vec space vector model comprises the following steps:
collecting text sentences of a target field to form a corpus aiming at the target field;
and generating the word2vec space vector model by taking text sentences in the corpus as training data.
The target field may be, for example, a barrage field sent for a living broadcast room, where the live video content of each living broadcast room is different, so that the barrage content sent for each living broadcast room is different, but there are usually many similar content between barrage texts sent for the same living broadcast room, so that all barrages sent for the same living broadcast room may be used as text sentences in the field and may form a corpus in the field, and then all text sentences in the corpus are used as training data to train a neural network model to obtain a word2vec space vector model. The word2vec space vector model can be used to perform similarity matching on text sentences within the domain.
Specifically, calculating the semantic similarity between the two text sentences to be matched based on the text vector includes:
and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
And inputting the text sentence to be matched into the word2vec space vector model to obtain a text vector corresponding to the text sentence to be matched in the space vector. The similarity of the text sentences to be matched in terms of semantics can be obtained through the formula (1).
120. And calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA.
Illustratively, the calculating, based on the document topic generation model LDA, the topic similarity between the two text sentences to be matched includes:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
The LDA model is a document theme generation model, also called a three-layer Bayesian probability module, comprises a word, a theme and a document three-layer structure, is an unsupervised machine learning technology, and has the following principle: each word of each document is considered to be obtained by a process of "selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability", the document-to-topic obeys a polynomial distribution, and the topic-to-word obeys a polynomial distribution. The LDA model can be obtained by training a corpus consisting of text sentences collected for the target field. And inputting the text sentence to be matched into a trained LDA model to obtain the distribution probability of each word in the text sentence to be matched under each theme in the model. And mapping the text sentence A to be matched and the text sentence B to be matched under each topic based on the topic-vocabulary probability distribution of the LDA, calculating the similarity of the text sentence A to be matched and the text sentence B to be matched under each topic, and finally taking the topic probability distribution with the highest similarity. The formula (2) can reflect the similarity of the text sentence A to be matched and the text sentence B to be matched in different scenes (subjects), and the result is relatively objective.
130. And determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Illustratively, the determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity includes:
and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
In generalIt is considered that the topic similarity of two text sentences to be matched can reflect the relationship between the two text sentences, so that the weight corresponding to the topic similarityUsually set higher, and the weight corresponding to semantic similarity +.>Usually lower, preferably, the weight corresponding to semantic similarity +.>May be set to 0.4 and the weight corresponding to the topic similarity may be set to 0.6. In the robot conversation field, a relatively accurate and complete answer reply set can be selected from an answer library by integrating the similarity of a text sentence A to be matched and a text sentence B to be matched in a vector space and under different scenes (topics).
The order of steps 110 and 120 is not limited, and step 120 may be preferentially executed, or step 110 may be preferentially executed, and the present embodiment is described by taking step 110 as an example, but the order of execution of steps 110 and 120 is not limited.
Further, before calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model, or before calculating the topic similarity between the two text sentences to be matched based on the document topic generation model LDA, the method further comprises:
the word segmentation is performed on the two text sentences to be matched, specifically, the word segmentation is performed on the two text sentences to be matched by using a jieba word segmentation tool in python, which is not described in detail in this embodiment.
Further, summarizing a text similarity calculation method provided in this embodiment, the method mainly includes the following steps:
step 1, preparing a corpus in the corresponding field in advance, and training word2vec and LDA by using the corpus to obtain a word2vec space vector model and an LDA model respectively.
Step 2, inputting a text sentence A to be matched and a text sentence B to be matched into a word2vec space vector model, and respectively obtaining text vectors corresponding to the text sentence A to be matched by using the word2vec modelAnd text vector corresponding to text sentence B to be matched +.>Inputting a text sentence A to be matched and a text sentence B to be matched into the LDA model to respectively obtain different topics D of each word in the text sentence A to be matched and the text sentence B to be matched i Distribution probability->And->If on subject D i The word V does not exist below i Word V i At subject D i The lower distribution probability takes a value of 1 if at topic D i Under the word V i The corresponding probability is taken.
And 3, calculating the semantic similarity of the text sentence A to be matched and the text sentence B to be matched in the space vector according to the formula (1).
And 4, calculating the topic similarity of the text sentence A to be matched and the text sentence B to be matched under the LDA according to the formula (2).
And 5, synthesizing the semantic similarity and the topic similarity of the text sentence A to be matched and the text sentence B to be matched according to the formula (3) to obtain the final matching degree of the text sentence A to be matched and the text sentence B to be matched.
The above calculation procedure is illustrated: assuming that word2vec and LDA models are completed through corpus training in the corresponding field, obtaining a text sentence A to be matched=I want to go to Beijing university, obtaining a text sentence B to be matched=Beijing university to be truly playful, and performing word segmentation processing on the text sentence A to be matched and the text sentence B to be matched by using a jieba word segmentation tool:
text sentence a=i want to go to beijing university of reading to be matched
University of text sentence b=beijing to be matched is truly playable
The total number of words L in the text sentence A to be matched A =5, total number of words L in text sentence B to be matched B The set of words in the text sentence a to be matched and the words in the text sentence B to be matched is { i want to go to true fun of the university of Beijing read }, and the total number of words in the set is m=8.
Inputting a trained word2vec and an LDA model into a word2vec model, so as to obtain a text vector corresponding to the text sentence A to be matched in the word2vec space vector model, wherein the text sentence A to be matched is read from Beijing universityText vector corresponding to text sentence B to be matched in word2vec space vector model +.>Word V in text sentence A to be matched i Subject is D i Distribution probability->Word V in text sentence B to be matched i Subject is D i Probability of distribution below
Substituting the data into a formula (1), calculating the semantic similarity of the text sentence A to be matched and the text sentence B to be matched, and assuming that the calculation result is:
according to the above formula (2), it is assumed that the subject D 1 Then, the similarity between the text sentence A to be matched and the text sentence B to be matched is as follows:at subject D 2 Then, the similarity between the text sentence A to be matched and the text sentence B to be matched is as follows:thus, the result of the above equation (2) is Max {0.35,0.85} = 0.85;
setting upAccording to the formula (3), calculating the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched as follows:
according to the text similarity calculation method provided by the embodiment of the invention, the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA are comprehensively considered, so that the matching precision of the text similarity is improved, and in a robot session, the text similarity calculation method provided by the embodiment of the invention can be used for selecting a relatively accurate and complete answer reply set from an answer library, so that the completeness of an answer and the relativity between the answer and a question are improved.
Example two
Fig. 2 is a schematic structural diagram of a text similarity calculating device according to a second embodiment of the present invention. Referring to fig. 2, the apparatus includes: a semantic similarity calculation module 210, a topic similarity calculation module 220, and a comprehensive similarity calculation module 230;
the semantic similarity calculation module 210 is configured to calculate semantic similarity between two text sentences to be matched based on a word2vec space vector model;
the topic similarity calculation module 220 is configured to calculate topic similarity between the two text sentences to be matched based on the document topic generation model LDA;
the comprehensive similarity calculation module 230 is configured to determine a comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Further, the semantic similarity calculation module 210 includes: the mapping unit is used for mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched; the calculating unit is used for calculating the semantic similarity between the two text sentences to be matched based on the text vector.
Further, the computing unit is specifically configured to: and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
Further, the device further comprises: the collection module is used for collecting text sentences in the target field to form a corpus aiming at the target field; and the generation module is used for generating the word2vec space vector model by taking the text sentences in the corpus as training data.
Further, the topic similarity calculation module 220 is specifically configured to:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
Further, the integrated similarity calculation module 230 is specifically configured to: and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
Further, the device further comprises: the processing module is used for word segmentation processing of the two text sentences to be matched before semantic similarity between the two text sentences to be matched is calculated based on the word2vec space vector model or before topic similarity between the two text sentences to be matched is calculated based on the document topic generation model LDA.
According to the text similarity calculating device provided by the embodiment of the invention, the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA are comprehensively considered, so that the matching precision of the text similarity is improved, and in a robot session, the text similarity calculating method provided by the embodiment of the invention can be used for selecting a relatively accurate and complete answer reply set from an answer library, so that the completeness of an answer and the relativity between the answer and a question are improved.
Example III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device includes: a processor 670, a memory 671, and a computer program stored on the memory 671 and executable on the processor 670; wherein the number of processors 670 may be one or more, one processor 670 is illustrated in FIG. 3; the processor 670 implements the text similarity calculation method as described in the above embodiment one when executing the computer program. As shown in fig. 3, the electronic device may further comprise input means 672 and output means 673. The processor 670, memory 671, input device 672 and output device 673 may be connected by a bus or other means, for example in fig. 3.
The memory 671 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as text similarity calculation means/modules (e.g., a semantic similarity calculation module 210, a topic similarity calculation module 220, and a comprehensive similarity calculation module 230 in a text similarity calculation means, etc.) in the embodiments of the present invention. The processor 670 executes various functional applications of the electronic device and data processing, i.e., implements the text similarity calculation method described above, by running software programs, instructions, and modules stored in the memory 671.
The memory 671 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 671 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 671 may further include memory located remotely from processor 670, which may be connected to the electronic device/storage medium via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 672 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. The output device 673 may include a display device such as a display screen.
Example IV
A fourth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a text similarity calculation method, the method comprising:
calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
and determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the text similarity calculation related operations provided in any of the embodiments of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a storage medium, or a network device, etc.) to execute the embodiments of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.