CN110895656B - Text similarity calculation method and device, electronic equipment and storage medium - Google Patents

Text similarity calculation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110895656B
CN110895656B CN201811066429.5A CN201811066429A CN110895656B CN 110895656 B CN110895656 B CN 110895656B CN 201811066429 A CN201811066429 A CN 201811066429A CN 110895656 B CN110895656 B CN 110895656B
Authority
CN
China
Prior art keywords
matched
text
similarity
topic
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811066429.5A
Other languages
Chinese (zh)
Other versions
CN110895656A (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orange Fruit Zhuanhua Technology Co ltd
Beijing Peihong Wangzhi Technology Co ltd
Original Assignee
Beijing Orange Fruit Zhuanhua Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orange Fruit Zhuanhua Technology Co ltd filed Critical Beijing Orange Fruit Zhuanhua Technology Co ltd
Priority to CN201811066429.5A priority Critical patent/CN110895656B/en
Publication of CN110895656A publication Critical patent/CN110895656A/en
Application granted granted Critical
Publication of CN110895656B publication Critical patent/CN110895656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a text similarity calculation method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model; calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA; and determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity. By adopting the technical scheme, the most suitable candidate answer set with the input text can be calculated, the robot automatically replies to the input text, the relevance between the candidate answer and the input text is effectively improved, the completeness of the answer is improved, and the calculation accuracy of the text similarity is improved.

Description

Text similarity calculation method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a text similarity calculation method, a device, electronic equipment and a storage medium.
Background
Currently, live broadcasting room application programs based on an iOS platform or an Android platform develop rapidly and are popular with users. The barrage is a very popular expression mode for information communication and information sharing of a live broadcast platform, and interaction between audience and anchor can be realized through the barrage, so that good live broadcast atmosphere can be created.
In the field of robot conversations, one important link is to find the reply with the highest semantic similarity to the input sentence. Similarly, in a live broadcasting room, a reply with high similarity is often calculated according to the water-friendly barrage, and the robot automatically replies to the water-friendly barrage. At present, a TF-IDF (Term Frequency-inverse text Frequency) algorithm is generally adopted in a live broadcasting room to calculate the similarity between two barrages, however, the key words of each document are determined based on the Frequency distribution of words or phrases appearing in a document set, then word Frequency vectors are constructed according to the Frequency of the key words appearing in the document set, the similarity between the documents is determined by calculating the similarity between the word Frequency vectors of the documents, and therefore, the TF-IDF algorithm only considers the word Frequency of the words in the documents or only considers the importance degree of the words in the documents.
Therefore, in order to increase the accuracy of text similarity calculation, improvements to existing similarity calculation algorithms are needed.
Disclosure of Invention
The embodiment of the invention provides a text similarity calculation method, a device, electronic equipment and a storage medium.
In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a text similarity calculation method, where the method includes:
calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
calculating the topic similarity between the two text sentences to be matched based on LDA (Latent Dirichlet Allocation, document topic generation model);
and determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Further, the calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model includes:
mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched;
and calculating the semantic similarity between the two text sentences to be matched based on the text vector.
Further, the calculating the semantic similarity between the two text sentences to be matched based on the text vector includes:
and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
Further, before the semantic similarity between the two text sentences to be matched is calculated based on the word2vec space vector model, the method further comprises:
collecting text sentences of a target field to form a corpus aiming at the target field;
and generating the word2vec space vector model by taking text sentences in the corpus as training data.
Further, the calculating, based on the document topic generation model LDA, the topic similarity between the two text sentences to be matched includes:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
Further, the determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity includes:
and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
Further, before calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model, or before calculating the topic similarity between the two text sentences to be matched based on the document topic generation model LDA, the method further comprises:
and performing word segmentation on the two text sentences to be matched.
In a second aspect, an embodiment of the present invention provides a text similarity calculation apparatus, including:
the semantic similarity calculation module is used for calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
the topic similarity calculation module is used for calculating topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
and the comprehensive similarity calculation module is used for determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the text similarity calculation method according to the first aspect when executing the computer program.
In a fourth aspect, embodiments of the present invention provide a storage medium containing computer executable instructions which, when executed by a computer processor, implement a text similarity calculation method as described in the first aspect above.
According to the text similarity calculation method provided by the embodiment of the invention, the matching precision of the text similarity is improved by comprehensively considering the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA, and in the robot session, the completeness of the answer and the correlation between the answer and the question can be improved by the text similarity calculation method provided by the embodiment of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the drawings needed in the description of the embodiments of the present invention, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the contents of the embodiments of the present invention and these drawings without inventive effort for those skilled in the art.
Fig. 1 is a schematic flow chart of a text similarity calculation method according to a first embodiment of the present invention;
fig. 2 is a schematic structural diagram of a text similarity calculating device according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order to make the technical problems solved by the present invention, the technical solutions adopted and the technical effects achieved more clear, the technical solutions of the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Example 1
Fig. 1 is a schematic flow chart of a text similarity calculation method according to an embodiment of the present invention. The text similarity calculation method disclosed by the embodiment is suitable for the robot conversation field, the answer sentence with the highest semantic similarity with the input sentence is matched from the answer library, so that automatic reply is carried out on the input sentence, and in the scene, the input sentence is the text sentence A to be matched, and any sentence in the answer library is the text sentence B to be matched. The text similarity calculation method disclosed by the embodiment is also suitable for matching sentences with highest similarity with Shui You barrages in a live broadcasting room, so that a robot can automatically reply to Shui You barrages. The text similarity calculation method may be performed by a text similarity calculation device, which may be implemented in software and/or hardware and is typically integrated in a terminal, such as a server or the like. Referring specifically to fig. 1, the method comprises the steps of:
110. and calculating the semantic similarity between the two text sentences to be matched based on the word2vec space vector model.
The calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model includes:
mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched;
and calculating the semantic similarity between the two text sentences to be matched based on the text vector.
The word2vec space vector model specifically refers to a corresponding relation between words and vectors, namely each word is represented by a vector form, and the vectors also contain context association attributes between words, so that the similarity between text sentences to be matched can be obtained by comparing the similarity between text vectors corresponding to the text sentences to be matched.
The word2vec space vector model is trained in advance through a corpus in the corresponding field, and specifically, the generating of the word2vec space vector model comprises the following steps:
collecting text sentences of a target field to form a corpus aiming at the target field;
and generating the word2vec space vector model by taking text sentences in the corpus as training data.
The target field may be, for example, a barrage field sent for a living broadcast room, where the live video content of each living broadcast room is different, so that the barrage content sent for each living broadcast room is different, but there are usually many similar content between barrage texts sent for the same living broadcast room, so that all barrages sent for the same living broadcast room may be used as text sentences in the field and may form a corpus in the field, and then all text sentences in the corpus are used as training data to train a neural network model to obtain a word2vec space vector model. The word2vec space vector model can be used to perform similarity matching on text sentences within the domain.
Specifically, calculating the semantic similarity between the two text sentences to be matched based on the text vector includes:
and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
And inputting the text sentence to be matched into the word2vec space vector model to obtain a text vector corresponding to the text sentence to be matched in the space vector. The similarity of the text sentences to be matched in terms of semantics can be obtained through the formula (1).
120. And calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA.
Illustratively, the calculating, based on the document topic generation model LDA, the topic similarity between the two text sentences to be matched includes:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
The LDA model is a document theme generation model, also called a three-layer Bayesian probability module, comprises a word, a theme and a document three-layer structure, is an unsupervised machine learning technology, and has the following principle: each word of each document is considered to be obtained by a process of "selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability", the document-to-topic obeys a polynomial distribution, and the topic-to-word obeys a polynomial distribution. The LDA model can be obtained by training a corpus consisting of text sentences collected for the target field. And inputting the text sentence to be matched into a trained LDA model to obtain the distribution probability of each word in the text sentence to be matched under each theme in the model. And mapping the text sentence A to be matched and the text sentence B to be matched under each topic based on the topic-vocabulary probability distribution of the LDA, calculating the similarity of the text sentence A to be matched and the text sentence B to be matched under each topic, and finally taking the topic probability distribution with the highest similarity. The formula (2) can reflect the similarity of the text sentence A to be matched and the text sentence B to be matched in different scenes (subjects), and the result is relatively objective.
130. And determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Illustratively, the determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity includes:
and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
In generalIt is considered that the topic similarity of two text sentences to be matched can reflect the relationship between the two text sentences, so that the weight corresponding to the topic similarityUsually set higher, and the weight corresponding to semantic similarity +.>Usually lower, preferably, the weight corresponding to semantic similarity +.>May be set to 0.4 and the weight corresponding to the topic similarity may be set to 0.6. In the robot conversation field, a relatively accurate and complete answer reply set can be selected from an answer library by integrating the similarity of a text sentence A to be matched and a text sentence B to be matched in a vector space and under different scenes (topics).
The order of steps 110 and 120 is not limited, and step 120 may be preferentially executed, or step 110 may be preferentially executed, and the present embodiment is described by taking step 110 as an example, but the order of execution of steps 110 and 120 is not limited.
Further, before calculating the semantic similarity between two text sentences to be matched based on the word2vec space vector model, or before calculating the topic similarity between the two text sentences to be matched based on the document topic generation model LDA, the method further comprises:
the word segmentation is performed on the two text sentences to be matched, specifically, the word segmentation is performed on the two text sentences to be matched by using a jieba word segmentation tool in python, which is not described in detail in this embodiment.
Further, summarizing a text similarity calculation method provided in this embodiment, the method mainly includes the following steps:
step 1, preparing a corpus in the corresponding field in advance, and training word2vec and LDA by using the corpus to obtain a word2vec space vector model and an LDA model respectively.
Step 2, inputting a text sentence A to be matched and a text sentence B to be matched into a word2vec space vector model, and respectively obtaining text vectors corresponding to the text sentence A to be matched by using the word2vec modelAnd text vector corresponding to text sentence B to be matched +.>Inputting a text sentence A to be matched and a text sentence B to be matched into the LDA model to respectively obtain different topics D of each word in the text sentence A to be matched and the text sentence B to be matched i Distribution probability->And->If on subject D i The word V does not exist below i Word V i At subject D i The lower distribution probability takes a value of 1 if at topic D i Under the word V i The corresponding probability is taken.
And 3, calculating the semantic similarity of the text sentence A to be matched and the text sentence B to be matched in the space vector according to the formula (1).
And 4, calculating the topic similarity of the text sentence A to be matched and the text sentence B to be matched under the LDA according to the formula (2).
And 5, synthesizing the semantic similarity and the topic similarity of the text sentence A to be matched and the text sentence B to be matched according to the formula (3) to obtain the final matching degree of the text sentence A to be matched and the text sentence B to be matched.
The above calculation procedure is illustrated: assuming that word2vec and LDA models are completed through corpus training in the corresponding field, obtaining a text sentence A to be matched=I want to go to Beijing university, obtaining a text sentence B to be matched=Beijing university to be truly playful, and performing word segmentation processing on the text sentence A to be matched and the text sentence B to be matched by using a jieba word segmentation tool:
text sentence a=i want to go to beijing university of reading to be matched
University of text sentence b=beijing to be matched is truly playable
The total number of words L in the text sentence A to be matched A =5, total number of words L in text sentence B to be matched B The set of words in the text sentence a to be matched and the words in the text sentence B to be matched is { i want to go to true fun of the university of Beijing read }, and the total number of words in the set is m=8.
Inputting a trained word2vec and an LDA model into a word2vec model, so as to obtain a text vector corresponding to the text sentence A to be matched in the word2vec space vector model, wherein the text sentence A to be matched is read from Beijing universityText vector corresponding to text sentence B to be matched in word2vec space vector model +.>Word V in text sentence A to be matched i Subject is D i Distribution probability->Word V in text sentence B to be matched i Subject is D i Probability of distribution below
Substituting the data into a formula (1), calculating the semantic similarity of the text sentence A to be matched and the text sentence B to be matched, and assuming that the calculation result is:
according to the above formula (2), it is assumed that the subject D 1 Then, the similarity between the text sentence A to be matched and the text sentence B to be matched is as follows:at subject D 2 Then, the similarity between the text sentence A to be matched and the text sentence B to be matched is as follows:thus, the result of the above equation (2) is Max {0.35,0.85} = 0.85;
setting upAccording to the formula (3), calculating the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched as follows:
according to the text similarity calculation method provided by the embodiment of the invention, the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA are comprehensively considered, so that the matching precision of the text similarity is improved, and in a robot session, the text similarity calculation method provided by the embodiment of the invention can be used for selecting a relatively accurate and complete answer reply set from an answer library, so that the completeness of an answer and the relativity between the answer and a question are improved.
Example two
Fig. 2 is a schematic structural diagram of a text similarity calculating device according to a second embodiment of the present invention. Referring to fig. 2, the apparatus includes: a semantic similarity calculation module 210, a topic similarity calculation module 220, and a comprehensive similarity calculation module 230;
the semantic similarity calculation module 210 is configured to calculate semantic similarity between two text sentences to be matched based on a word2vec space vector model;
the topic similarity calculation module 220 is configured to calculate topic similarity between the two text sentences to be matched based on the document topic generation model LDA;
the comprehensive similarity calculation module 230 is configured to determine a comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Further, the semantic similarity calculation module 210 includes: the mapping unit is used for mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched; the calculating unit is used for calculating the semantic similarity between the two text sentences to be matched based on the text vector.
Further, the computing unit is specifically configured to: and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
Further, the device further comprises: the collection module is used for collecting text sentences in the target field to form a corpus aiming at the target field; and the generation module is used for generating the word2vec space vector model by taking the text sentences in the corpus as training data.
Further, the topic similarity calculation module 220 is specifically configured to:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
Further, the integrated similarity calculation module 230 is specifically configured to: and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
Further, the device further comprises: the processing module is used for word segmentation processing of the two text sentences to be matched before semantic similarity between the two text sentences to be matched is calculated based on the word2vec space vector model or before topic similarity between the two text sentences to be matched is calculated based on the document topic generation model LDA.
According to the text similarity calculating device provided by the embodiment of the invention, the similarity of the text sentence to be matched in the vector space and the similarity of the text sentence to be matched under each topic of the LDA are comprehensively considered, so that the matching precision of the text similarity is improved, and in a robot session, the text similarity calculating method provided by the embodiment of the invention can be used for selecting a relatively accurate and complete answer reply set from an answer library, so that the completeness of an answer and the relativity between the answer and a question are improved.
Example III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device includes: a processor 670, a memory 671, and a computer program stored on the memory 671 and executable on the processor 670; wherein the number of processors 670 may be one or more, one processor 670 is illustrated in FIG. 3; the processor 670 implements the text similarity calculation method as described in the above embodiment one when executing the computer program. As shown in fig. 3, the electronic device may further comprise input means 672 and output means 673. The processor 670, memory 671, input device 672 and output device 673 may be connected by a bus or other means, for example in fig. 3.
The memory 671 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as text similarity calculation means/modules (e.g., a semantic similarity calculation module 210, a topic similarity calculation module 220, and a comprehensive similarity calculation module 230 in a text similarity calculation means, etc.) in the embodiments of the present invention. The processor 670 executes various functional applications of the electronic device and data processing, i.e., implements the text similarity calculation method described above, by running software programs, instructions, and modules stored in the memory 671.
The memory 671 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 671 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 671 may further include memory located remotely from processor 670, which may be connected to the electronic device/storage medium via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 672 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. The output device 673 may include a display device such as a display screen.
Example IV
A fourth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a text similarity calculation method, the method comprising:
calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
and determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the text similarity calculation related operations provided in any of the embodiments of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a storage medium, or a network device, etc.) to execute the embodiments of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (9)

1. A text similarity calculation method, comprising:
calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
calculating the topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity;
the calculating the topic similarity between the two text sentences to be matched based on the document topic generation model LDA comprises the following steps:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic, x, in the topic set of the LDA model i,Di Representing word V in text sentence A to be matched i Subject is D i Probability of distribution of y i,Di Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
2. The method of claim 1, wherein the calculating semantic similarity between two text sentences to be matched based on the word2vec space vector model comprises:
mapping the two text sentences to be matched in the word2vec space vector model to respectively obtain text vectors corresponding to the two text sentences to be matched;
and calculating the semantic similarity between the two text sentences to be matched based on the text vector.
3. The method of claim 2, wherein the calculating semantic similarity between the two text sentences to be matched based on the text vector comprises:
and calculating the semantic similarity between the two text sentences to be matched according to the following formula:
wherein vecSim (A, B) represents semantic similarity between the text sentence A to be matched and the text sentence B to be matched,representing a text vector corresponding to the text sentence A to be matched in a word2vec space vector model,/->Representing a text vector corresponding to the text sentence B to be matched in a word2vec space vector model, and n represents the text vector +.>And->Is a dimension of (c).
4. The method of claim 2, wherein prior to the calculating semantic similarity between two text sentences to be matched based on the word2vec space vector model, the method further comprises:
collecting text sentences of a target field to form a corpus aiming at the target field;
and generating the word2vec space vector model by taking text sentences in the corpus as training data.
5. The method of claim 1, wherein the determining the integrated similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity comprises:
and calculating the comprehensive similarity between the two text sentences to be matched according to the following formula:
wherein SenSim (A, B) represents the comprehensive similarity between the text sentence A to be matched and the text sentence B to be matched, vecSim (A, B) represents the semantic similarity between the text sentence A to be matched and the text sentence B to be matched, and Sim LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched,weight representing semantic similarity correspondence between text sentence A to be matched and text sentence B to be matched, < ->And the weight corresponding to the topic similarity between the text sentence A to be matched and the text sentence B to be matched is represented.
6. The method of claim 1, wherein prior to calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model or prior to calculating topic similarity between the two text sentences to be matched based on a document topic generation model LDA, the method further comprises:
and performing word segmentation on the two text sentences to be matched.
7. A text similarity calculation device, the device comprising:
the semantic similarity calculation module is used for calculating semantic similarity between two text sentences to be matched based on a word2vec space vector model;
the topic similarity calculation module is used for calculating topic similarity between the two text sentences to be matched based on a document topic generation model LDA;
the comprehensive similarity calculation module is used for determining the comprehensive similarity between the two text sentences to be matched according to the semantic similarity and the topic similarity;
the topic similarity calculation module is specifically configured to:
and calculating the topic similarity between the two text sentences to be matched according to the following formula:
wherein Sim is LDA (A, B) represents the topic similarity between the text sentence A to be matched and the text sentence B to be matched, D i Representing the ith topic in the topic collection of the LDA model,representing word V in text sentence A to be matched i Subject is D i Distribution probability of->Representing word V in text sentence B to be matched i Subject is D i Probability of distribution of L A Representing the total number of words in the text sentence A to be matched, L B Representing the total number of words in the text sentence B to be matched, and m represents the total number of words in the set consisting of the words in the text sentence A to be matched and the words in the text sentence B to be matched.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text similarity calculation method of any of claims 1-6 when the computer program is executed by the processor.
9. A storage medium containing computer executable instructions which when executed by a computer processor implement the text similarity calculation method of any one of claims 1-6.
CN201811066429.5A 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium Active CN110895656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811066429.5A CN110895656B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811066429.5A CN110895656B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110895656A CN110895656A (en) 2020-03-20
CN110895656B true CN110895656B (en) 2023-12-29

Family

ID=69785340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811066429.5A Active CN110895656B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110895656B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667806B (en) * 2020-10-20 2024-07-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN112765321A (en) * 2021-01-22 2021-05-07 中信银行股份有限公司 Interface query method and device, equipment and computer readable storage medium
CN113239666B (en) * 2021-05-13 2023-09-29 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113239150B (en) * 2021-05-17 2024-02-27 平安科技(深圳)有限公司 Text matching method, system and equipment
CN113591462A (en) * 2021-07-28 2021-11-02 咪咕数字传媒有限公司 Bullet screen reply generation method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN106445920A (en) * 2016-09-29 2017-02-22 北京理工大学 Sentence similarity calculation method based on sentence meaning structure characteristics
CN108090047A (en) * 2018-01-10 2018-05-29 华南师范大学 A kind of definite method and apparatus of text similarity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI621084B (en) * 2016-12-01 2018-04-11 財團法人資訊工業策進會 System, method and non-transitory computer readable storage medium for matching cross-area products

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN106445920A (en) * 2016-09-29 2017-02-22 北京理工大学 Sentence similarity calculation method based on sentence meaning structure characteristics
CN108090047A (en) * 2018-01-10 2018-05-29 华南师范大学 A kind of definite method and apparatus of text similarity

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
张璐,芦天亮,杜彦辉.基于WMF_LDA主题模型的文本相似度计算.计算机应用研究.2018,第2.2节,图1-2. *
李国 ; 张春杰 ; 张志远 ; .一种基于加权LDA模型的文本聚类方法.中国民航大学学报.2016,(02),全文. *
王素格 ; 李书鸣 ; 陈鑫 ; 穆婉青 ; 乔霈 ; .面向高考阅读理解观点类问题的答案抽取方法.郑州大学学报(理学版).2018,(01),全文. *
白如江 ; 冷伏海 ; 廖君华 ; .一种基于语义组块特征的改进Cosine文本相似度计算方法.数据分析与知识发现.2017,(06),全文. *
邱先标,陈笑蓉.一种基于特征加权的文本相似度计算算法.贵州大学学报(自然科学版).2018,第35卷(第1期),第2章,图2. *
陈二静 ; 姜恩波 ; .文本相似度计算方法研究综述.数据分析与知识发现.2017,(06),全文. *

Also Published As

Publication number Publication date
CN110895656A (en) 2020-03-20

Similar Documents

Publication Publication Date Title
CN111625635B (en) Question-answering processing method, device, equipment and storage medium
CN110895656B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
CN106328147B (en) Speech recognition method and device
US11704501B2 (en) Providing a response in a session
CN109284490B (en) Text similarity calculation method and device, electronic equipment and storage medium
US20150243279A1 (en) Systems and methods for recommending responses
JP6677419B2 (en) Voice interaction method and apparatus
CN109710916B (en) Label extraction method and device, electronic equipment and storage medium
CN108664465B (en) Method and related device for automatically generating text
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
WO2015021937A1 (en) Method and device for user recommendation
CN112287085B (en) Semantic matching method, system, equipment and storage medium
CN111723784A (en) Risk video identification method and device and electronic equipment
CN110019691A (en) Conversation message treating method and apparatus
CN113342968A (en) Text abstract extraction method and device
CN113505198A (en) Keyword-driven generating type dialogue reply method and device and electronic equipment
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN110347807B (en) Problem information processing method and device
CN112182159A (en) Personalized retrieval type conversation method and system based on semantic representation
Deena et al. Exploring the use of acoustic embeddings in neural machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231127

Address after: B302, 3rd Floor, Building 106 Lize Zhongyuan, Chaoyang District, Beijing, 100000

Applicant after: Beijing Orange Fruit Zhuanhua Technology Co.,Ltd.

Address before: Room 528, 5th Floor, Building D, Building 33, No. 99 Kechuang 14th Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing, 100176 (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Applicant before: Beijing Peihong Wangzhi Technology Co.,Ltd.

Effective date of registration: 20231127

Address after: Room 528, 5th Floor, Building D, Building 33, No. 99 Kechuang 14th Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing, 100176 (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Applicant after: Beijing Peihong Wangzhi Technology Co.,Ltd.

Address before: 11 / F, building B1, phase 4.1, software industry, No.1, Software Park East Road, Wuhan East Lake Development Zone, Wuhan City, Hubei Province, 430070

Applicant before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant