CN109885657B - Text similarity calculation method and device and storage medium - Google Patents

Text similarity calculation method and device and storage medium Download PDF

Info

Publication number
CN109885657B
CN109885657B CN201910124084.2A CN201910124084A CN109885657B CN 109885657 B CN109885657 B CN 109885657B CN 201910124084 A CN201910124084 A CN 201910124084A CN 109885657 B CN109885657 B CN 109885657B
Authority
CN
China
Prior art keywords
similarity
texts
text
vocabulary sets
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910124084.2A
Other languages
Chinese (zh)
Other versions
CN109885657A (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Ouyuan Network Video Co ltd
Original Assignee
Wuhan Ouyuan Network Video Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Ouyuan Network Video Co ltd filed Critical Wuhan Ouyuan Network Video Co ltd
Priority to CN201910124084.2A priority Critical patent/CN109885657B/en
Publication of CN109885657A publication Critical patent/CN109885657A/en
Application granted granted Critical
Publication of CN109885657B publication Critical patent/CN109885657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text similarity calculation method is applied to the technical field of computer application and comprises the following steps: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and an adjusting parameter of the second similarity. The disclosure also provides a text similarity calculation device and a storage medium. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.

Description

Text similarity calculation method and device and storage medium
Technical Field
The present disclosure relates to the field of computer application technologies, and in particular, to a method and an apparatus for calculating text similarity, and a storage medium.
Background
Text similarity is a representation method for quantifying the degree of similarity between texts, and has been widely used in recent years in the fields of information retrieval, document copy detection, machine translation, public opinion monitoring, and the like.
In the existing technology for calculating text similarity, a space vector model method is used for mapping a text into word vectors in a semantic space, and calculating the space distance between the word vectors is a common method for calculating the text similarity at present.
The existing method for representing text similarity by calculating the distance between word vectors is the similarity of texts from the semantic point of view, and generally does not consider the similarity of words used by the texts, so the effect of evaluating the text similarity is not good.
Disclosure of Invention
One aspect of the present disclosure provides a text similarity calculation method, including: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.
Optionally, calculating the first similarity between the two first vocabulary sets comprises: the two first vocabulary sets are respectively A1 and B1, and vectors obtained by vectorization processing of the two first vocabulary sets are respectively A1 and B1
Figure GDA0001995888410000021
And
Figure GDA0001995888410000022
the first similarity of the two texts is score (A, B)semanticAnd then:
Figure GDA0001995888410000023
optionally, the inputting the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets includes: respectively inputting the two texts into a preset N-gram language model, and outputting two second vocabulary sets, wherein the two second vocabulary sets are A2 and B2; comparing the two second vocabulary sets to obtain the total number of words len in A2 (A2)n_text) Total number of words len in B2 (B2)n_text) The number N of words in the two second vocabulary sets being the samen_textThe number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2).
Optionally, calculating the second similarity of the two texts based on the two second vocabulary sets further includes: let the second similarity of the two texts be score (A, B)textAnd then:
Figure GDA0001995888410000024
optionally, the sum of the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity is 1.
Optionally, deriving the similarity between the two texts based on the first similarity and the second similarity comprises: making the two texts respectively be a and B, making the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity respectively be α and β, and making the similarity of the two texts be sim (a, B), then:
sim(A,B)=α*score(A,B)semantic+β*score(A,B)text
optionally, the two texts exist in a corpus in a specific field, and performing word segmentation processing on the two texts respectively to obtain two first vocabulary sets includes: performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies.
Another aspect of the present disclosure provides a text similarity calculation apparatus, including:
the first calculation module is used for performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets;
the second calculation module is used for respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets;
and the third calculating module is used for calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.
Another aspect of the present disclosure provides an electronic device including: a processor: a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.
Another aspect of the present disclosure provides a computer-readable medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
The at least one technical scheme adopted in the embodiment of the disclosure can achieve the following beneficial effects:
in the embodiment of the disclosure, word segmentation processing may be performed on two texts to be processed respectively to obtain two first vocabulary sets, and a first similarity between the two texts is calculated based on the two first vocabulary sets; then, the two texts are respectively input into a preset N-gram language model to obtain two second vocabulary sets, and the second similarity of the two texts is calculated based on the two second vocabulary sets; and finally, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.
Drawings
For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
fig. 1 schematically illustrates a flowchart of a text similarity calculation method provided by an embodiment of the present disclosure;
fig. 2 schematically shows a block diagram of a text similarity calculation device provided by an embodiment of the present disclosure;
fig. 3 schematically illustrates a block diagram of a computer system provided by an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
Accordingly, the techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable medium having instructions stored thereon for use by or in connection with an instruction execution system. In the context of this disclosure, a computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the instructions. For example, the computer readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the computer readable medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.
Fig. 1 schematically shows a flowchart of a text similarity calculation method according to an embodiment of the present disclosure.
Specifically, as shown in fig. 1, a method for calculating text similarity according to an embodiment of the present disclosure includes the following operations:
step 101, performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating a first similarity of the two texts based on the two first vocabulary sets.
In the disclosed embodiment, a domain-specific corpus can be established prior to step 101.
The two texts exist in a corpus of a specific field, and the two texts are respectively subjected to word segmentation processing to obtain two first vocabulary sets, wherein the two first vocabulary sets comprise: performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies.
The word segmentation processing for all the texts in the corpus of the specific field can use word segmentation tools, such as jieba word segmentation, thesulac, SnowNLP, and the like. Assuming that the word segmentation processing is performed on the text "xiaoming/linkuwa/area", the result of the word segmentation output on the text using the tool is "xiaoming/linkuwa/area".
After the two first vocabulary sets are obtained, vectorization processing can be performed on the two first vocabulary sets.
The term vectorization is a method of expressing a term mathematically. And performing word vector training on the vocabulary contained in the corpus to obtain corresponding word vectors. The existing vocabulary vectorization technology can effectively distinguish the consent expression, the multiple meanings and the lexical interpretation of the vocabulary, so that the calculation of the space distance between word vectors can effectively reflect the similarity degree of the expression semantics of the word vectors.
Specifically, the two first vocabulary sets can be represented by word vectors by using the existing vocabulary vectorization technology, such as Doc2vec, and can be realized by using a Doc2vec algorithm model in a gensim toolkit.
The method comprises the following steps of performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets, wherein the word segmentation processing comprises the following steps: let the two first vocabulary sets be A1 and B1 respectively, and the vectors obtained by vectorizing the two first vocabulary sets are respectively
Figure GDA0001995888410000061
And
Figure GDA0001995888410000062
the first similarity between the two first vocabulary sets is score (A1, B1)semanticAnd then:
Figure GDA0001995888410000063
for example, suppose there is a bullet corpus, and there are A, B bullet curtains in the bullet corpus, where A is "miss sister's song sound is good and liked", and B is "miss sister with sweet beautiful voice, song sound is good". A, B are participled and stop words are removed, the first vocabulary set A1 output for A is "Miss/singing/good hearing/good/like", and the first vocabulary set B1 output for B is "beautiful/sweet/Miss/singing/good". Respectively aiming at the first vocabularyCombining A1 and B1 to carry out vectorization processing to obtain word vectors
Figure GDA0001995888410000064
And
Figure GDA0001995888410000065
wherein:
Figure GDA0001995888410000066
Figure GDA0001995888410000071
then the first similarity between a and B can be calculated as:
score(A,B)semantic=1.41。
and 102, respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets.
It should be noted that N-gram is a language model, and can also implement the word segmentation function. Commonly used N-grams are Bi-grams (N ═ 2) and Tri-grams (N ═ 3). For example, the text "I love deep learning", according to the results of Bi-gram and Tri-gram decomposition:
bi-gram: { "I love", "love is deep", "depth", "graduate", "study" },
tri-gram: { "I love deeply", "love depth", "depth study", "all study" }.
Inputting the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets, wherein the two second vocabulary sets comprise: inputting the two texts into a preset N-gram language model respectively, and outputting the two second vocabulary sets to enable the two second vocabulary sets to be A2 and B2 respectively; the two second vocabulary sets are compared to obtain the total number of words len in A2 (A2)n_text) Total number of words len in B2 (B2)n_text) The two second vocabulary sets are the sameNumber of words Nn_textThe number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2).
Still taking the example in step 101, a is "sister miss, the singing voice is good and liked", B is "sister miss, the singing voice is good and sweet, and N is 3, after a and B are respectively input into the Tri-gram model, two second vocabulary sets are output as a2 and B2, respectively, a2 and B2 are as follows:
a2 { "sister", "sister song", "sister singing", "good singing voice", "good hearing", "favorite" },
b2 { "beautiful voice", "beautiful sound sweet", "sweet xiao", "miss song sound", "song sound not", "sound good" },
the total number len of the words in A2 can be known from the two second participle sets (A2)n_text) 8, total number of words in B2 len (B2)n_text) 10, the number of the words N in the two second vocabulary sets is the samen_textIs 3, the number of all non-repeating words in the two second vocabulary sets len (a2 ═ B2) is 15.
Respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets further comprises: let the second similarity of the two texts be score (A, B)textThen;
Figure GDA0001995888410000081
the total number of words len in A2 obtained from the above two second sets of words (A2)n_text) Total number of words len in B2 (B2)n_text) The number N of the same words in the two second vocabulary setsn_textThe number of all non-repeating words in the two second vocabulary sets len (a2 ═ B2) can be calculated to obtain the second similarity of A, B two textsThe degree is as follows:
score(A,B)text=0.1。
and 103, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.
The sum of the preset adjusting parameter of the first similarity and the preset adjusting parameter of the second similarity is 1.
Based on the first similarity and the second similarity, deriving the similarity between the two texts comprises: let the two texts be a and B, respectively, the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity are α and β, respectively, and the similarity of the two texts is sim (a, B), then:
sim(A,B)=α*score(A1,B1)semantic+β*score(A2,B2)text
wherein alpha + beta is 1, alpha is more than or equal to 0 and less than or equal to 1, and beta is more than or equal to 0 and less than or equal to 1.
According to the results of the exemplary calculation in step 101 and step 102, the first similarity of the two texts a and B is 1.41, and the second similarity is 0.1. Generally, if α is 0.6 and β is 0.4, the text similarity between the two texts is:
sim(A,B)=α*score(A1,B1)semantic+β*score(A2,B2)text=0.886,
as a result, the text similarity between the two texts A, B was 0.886.
In the embodiment of the disclosure, word segmentation processing may be performed on two texts to be processed respectively to obtain two first vocabulary sets, and based on the two first vocabulary sets, a first similarity of the two texts is calculated; then, the two texts are respectively input into a preset N-gram language model to obtain two second vocabulary sets, and the second similarity of the two texts is calculated based on the two second vocabulary sets; and finally, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and an adjusting parameter of the second similarity. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.
Fig. 2 is a block diagram of a text similarity calculation apparatus according to an embodiment of the present disclosure.
As shown in fig. 2, the text similarity calculation device includes: a first calculation module 210, a second calculation module 220, and a third calculation module 230.
Specifically, the first calculating module 210 performs word segmentation on two texts to be processed, where the two texts exist in the corpus of the specific field, performs word segmentation on all texts in the corpus of the specific field, and removes stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies. And vectorizing the two first vocabulary sets based on all vocabularies contained in the corpus to obtain two word vectors. And calculating the first similarity of the two texts according to the word vectors obtained by the two first vocabulary sets.
The second calculation module 220 is configured to input the two texts into a preset N-gram language model respectively for word segmentation processing, so as to obtain two second vocabulary sets, where the two second vocabulary sets are a2 and B2; the two second vocabulary sets are compared to obtain the total number of words len in A2 (A2)n_text) Total number of words len in B2 (B2)n_text) The number N of the same words in the two second vocabulary setsn_textThe number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2). And calculating a second similarity of the two texts based on the parameters acquired by the two second vocabulary sets.
And a third calculating module 230, configured to calculate the similarity between the two texts according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity, where a sum of the preset adjusting parameter of the first similarity and the preset adjusting parameter of the second similarity is 1, and based on the first similarity and the second similarity, calculate the similarity between the two texts.
It is understood that the first calculation module 210, the second calculation module 220, and the third calculation module 230 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the first computing module 210, the second computing module 220, and the third computing module 230 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in a suitable combination of three implementations of software, hardware, and firmware. Alternatively, at least one of the first, second and third computing modules 210, 220, 230 may be implemented at least in part as a computer program module that, when executed by a computer, may perform the functions of the respective module.
Fig. 3 schematically illustrates a block diagram of a computer system provided by an embodiment of the present disclosure.
As shown in fig. 3, computer system 300 includes a processor 310, a computer-readable storage medium 320, a signal transmitter 330, and a signal receiver 340. The computer system 300 may perform a method according to an embodiment of the present disclosure.
In particular, processor 310 may include, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 310 may also include on-board memory for caching purposes. The processor 310 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
Computer-readable storage medium 320 may be, for example, any medium that can contain, store, communicate, propagate, or transport the instructions. For example, a readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the readable storage medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.
The computer-readable storage medium 320 may include a computer program 321, which computer program 321 may include code/computer-executable instructions that, when executed by the processor 310, cause the processor 310 to perform a method according to an embodiment of the disclosure, or any variation thereof.
The computer program 321 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 321 may include one or more program modules, including 321A, modules 321B, … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 310 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 310.
According to an embodiment of the present disclosure, the processor 310 may interact with the signal transmitter 330 and the signal receiver 340 to perform a method according to an embodiment of the present disclosure or any variation thereof.
According to an embodiment of the present invention, at least one of identification signal transmitting module 310, identification signal receiving module 320, identification module 330, and information signal transceiving module 340 may be implemented as a computer program module described with reference to fig. 3, which when executed by processor 310, may implement the corresponding operations described above.
The present disclosure also provides a computer-readable medium, which may be embodied in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer readable medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, a computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, optical fiber cable, radio frequency signals, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims (6)

1. A text similarity calculation method is characterized by comprising the following steps:
the method includes the steps of performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets, and includes the following steps:
the two first vocabulary sets are respectively A1 and B1, and vectors obtained by vectorization processing of the two first vocabulary sets are respectively A1 and B1
Figure FDA0002963264210000011
And
Figure FDA0002963264210000012
the first similarity of the two texts is score (A, B)semanticAnd then:
Figure FDA0002963264210000013
respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets, wherein the calculation comprises the following steps:
respectively inputting the two texts into a preset N-gram language model, and outputting two second vocabulary sets, wherein the two second vocabulary sets are A2 and B2;
comparing the two second vocabulary sets to obtain the total number of words len in A2 (A2)n_text) Total number of words len in B2 (B2)n_text) The number N of words in the two second vocabulary sets being the samen_textThe number of all words in the two second vocabulary sets which are not repeated len (A2 @ B2)n_text
Let the second similarity of the two texts be score (A, B)textAnd then:
Figure FDA0002963264210000014
calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity;
the sum of the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity is 1, that is:
α+β=1,
wherein alpha is more than or equal to 0 and less than or equal to 1, and beta is more than or equal to 0 and less than or equal to 1.
2. The method of claim 1, wherein deriving the similarity between the two pieces of text based on the first similarity and the second similarity comprises:
making the two texts respectively be a and B, making the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity respectively be α and β, and making the similarity of the two texts be sim (a, B), then:
sim(A,B)=α*score(A,B)semantic+β*score(A,B)text
3. the method according to claim 1, wherein the two texts exist in a corpus in a specific domain, and the performing word segmentation on the two texts to be processed respectively to obtain two first vocabulary sets comprises:
performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field;
and acquiring the two first vocabulary sets from the set of all vocabularies.
4. A device for calculating text similarity, comprising:
the first calculation module is configured to perform word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculate a first similarity between the two texts based on the two first vocabulary sets, including:
the two first vocabulary sets are respectively A1 and B1, and vectors obtained by vectorization processing of the two first vocabulary sets are respectively A1 and B1
Figure FDA0002963264210000021
And
Figure FDA0002963264210000022
the first similarity of the two texts is score (A, B)semanticAnd then:
Figure FDA0002963264210000023
a second calculating module, configured to input the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets, and calculate a second similarity between the two texts based on the two second vocabulary sets, where the second calculating module includes:
respectively inputting the two texts into a preset N-gram language model, and outputting two second vocabulary sets, wherein the two second vocabulary sets are A2 and B2;
comparing the two second vocabulary sets to obtain the total number of words len in A2 (A2)n_text) Total number of words len in B2 (B2)n_text) The number N of words in the two second vocabulary sets being the samen_textThe number of all words in the two second vocabulary sets which are not repeated len (A2 @ B2)n_text
Let the second similarity of the two texts be score (A, B)textAnd then:
Figure FDA0002963264210000031
the third calculation module is used for calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity;
the sum of the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity is 1, that is:
α+β=1,
wherein alpha is more than or equal to 0 and less than or equal to 1, and beta is more than or equal to 0 and less than or equal to 1.
5. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the text similarity calculation method according to any one of claims 1 to 3 when executing the computer program.
6. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps in the text similarity calculation method according to any one of claims 1 to 3.
CN201910124084.2A 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium Active CN109885657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910124084.2A CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910124084.2A CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Publications (2)

Publication Number Publication Date
CN109885657A CN109885657A (en) 2019-06-14
CN109885657B true CN109885657B (en) 2021-04-27

Family

ID=66928388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910124084.2A Active CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN109885657B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941951B (en) * 2019-10-15 2022-02-15 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111160445B (en) * 2019-12-25 2023-06-16 中国建设银行股份有限公司 Bid file similarity calculation method and device
CN111382563B (en) * 2020-03-20 2023-09-08 腾讯科技(深圳)有限公司 Text relevance determining method and device
CN111737445B (en) * 2020-06-22 2023-09-01 中国银行股份有限公司 Knowledge base searching method and device
CN111814447B (en) * 2020-06-24 2022-05-27 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN112529091A (en) * 2020-12-18 2021-03-19 广州视源电子科技股份有限公司 Courseware similarity detection method and device and storage medium
CN112364947B (en) * 2021-01-14 2021-06-29 北京育学园健康管理中心有限公司 Text similarity calculation method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
US9704483B2 (en) * 2015-07-28 2017-07-11 Google Inc. Collaborative language model biasing
CN105469104B (en) * 2015-11-03 2019-09-10 小米科技有限责任公司 Calculation method, device and the server of text information similarity
CN108509407B (en) * 2017-02-27 2022-03-18 阿里巴巴(中国)有限公司 Text semantic similarity calculation method and device and user terminal
CN108090047B (en) * 2018-01-10 2022-05-24 华南师范大学 Text similarity determination method and equipment
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN109284502B (en) * 2018-09-13 2024-02-13 广州财盟科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec

Also Published As

Publication number Publication date
CN109885657A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN109885657B (en) Text similarity calculation method and device and storage medium
US11314921B2 (en) Text error correction method and apparatus based on recurrent neural network of artificial intelligence
KR102494139B1 (en) Apparatus and method for training neural network, apparatus and method for speech recognition
KR102363369B1 (en) Generating vector representations of documents
WO2020253060A1 (en) Speech recognition method, model training method, apparatus and device, and storage medium
US10089303B2 (en) Customizable and low-latency interactive computer-aided translation
US20230031591A1 (en) Methods and apparatus to facilitate generation of database queries
US20170004824A1 (en) Speech recognition apparatus, speech recognition method, and electronic device
WO2009066501A1 (en) Information search method, device, and program, and computer-readable recording medium
US20200034702A1 (en) Training of student neural network with switched teacher neural networks
US20140350934A1 (en) Systems and Methods for Voice Identification
CN104572631B (en) The training method and system of a kind of language model
US20180349794A1 (en) Query rejection for language understanding
CN104573099A (en) Topic searching method and device
US20190147345A1 (en) Searching method and system based on multi-round inputs, and terminal
WO2023005386A1 (en) Model training method and apparatus
JP7209330B2 (en) classifier, trained model, learning method
WO2016030730A1 (en) Method for text processing
JP2020042257A (en) Voice recognition method and device
CN105229625A (en) Obtain the mixing Hash scheme of effective HMM
Zhang et al. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions
CN105469801A (en) Input speech restoring method and device
CN109446334A (en) A kind of method that realizing English Text Classification and relevant device
CN113204629A (en) Text matching method and device, computer equipment and readable storage medium
CN112307738A (en) Method and device for processing text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant