CN109885657B

CN109885657B - Text similarity calculation method and device and storage medium

Info

Publication number: CN109885657B
Application number: CN201910124084.2A
Authority: CN
Inventors: 徐乐乐
Original assignee: Wuhan Ouyuan Network Video Co ltd
Current assignee: Wuhan Ouyuan Network Video Co ltd
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2021-04-27
Anticipated expiration: 2039-02-18
Also published as: CN109885657A

Abstract

A text similarity calculation method is applied to the technical field of computer application and comprises the following steps: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and an adjusting parameter of the second similarity. The disclosure also provides a text similarity calculation device and a storage medium. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.

Description

Text similarity calculation method and device and storage medium

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to a method and an apparatus for calculating text similarity, and a storage medium.

Background

Text similarity is a representation method for quantifying the degree of similarity between texts, and has been widely used in recent years in the fields of information retrieval, document copy detection, machine translation, public opinion monitoring, and the like.

In the existing technology for calculating text similarity, a space vector model method is used for mapping a text into word vectors in a semantic space, and calculating the space distance between the word vectors is a common method for calculating the text similarity at present.

The existing method for representing text similarity by calculating the distance between word vectors is the similarity of texts from the semantic point of view, and generally does not consider the similarity of words used by the texts, so the effect of evaluating the text similarity is not good.

Disclosure of Invention

One aspect of the present disclosure provides a text similarity calculation method, including: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.

Optionally, calculating the first similarity between the two first vocabulary sets comprises: the two first vocabulary sets are respectively A1 and B1, and vectors obtained by vectorization processing of the two first vocabulary sets are respectively A1 and B1

And

the first similarity of the two texts is score (A, B)_semanticAnd then:

optionally, the inputting the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets includes: respectively inputting the two texts into a preset N-gram language model, and outputting two second vocabulary sets, wherein the two second vocabulary sets are A2 and B2; comparing the two second vocabulary sets to obtain the total number of words len in A2 (A2)_{n_text}) Total number of words len in B2 (B2)_{n_text}) The number N of words in the two second vocabulary sets being the same_{n_text}The number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2).

Optionally, calculating the second similarity of the two texts based on the two second vocabulary sets further includes: let the second similarity of the two texts be score (A, B)_textAnd then:

optionally, the sum of the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity is 1.

Optionally, deriving the similarity between the two texts based on the first similarity and the second similarity comprises: making the two texts respectively be a and B, making the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity respectively be α and β, and making the similarity of the two texts be sim (a, B), then:

sim(A，B)＝α*score(A，B)_semantic+β*score(A，B)_text

optionally, the two texts exist in a corpus in a specific field, and performing word segmentation processing on the two texts respectively to obtain two first vocabulary sets includes: performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies.

Another aspect of the present disclosure provides a text similarity calculation apparatus, including:

the first calculation module is used for performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets;

the second calculation module is used for respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets;

and the third calculating module is used for calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.

Another aspect of the present disclosure provides an electronic device including: a processor: a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform: performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets; respectively inputting the two texts to a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets; and calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.

Another aspect of the present disclosure provides a computer-readable medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

The at least one technical scheme adopted in the embodiment of the disclosure can achieve the following beneficial effects:

in the embodiment of the disclosure, word segmentation processing may be performed on two texts to be processed respectively to obtain two first vocabulary sets, and a first similarity between the two texts is calculated based on the two first vocabulary sets; then, the two texts are respectively input into a preset N-gram language model to obtain two second vocabulary sets, and the second similarity of the two texts is calculated based on the two second vocabulary sets; and finally, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

fig. 1 schematically illustrates a flowchart of a text similarity calculation method provided by an embodiment of the present disclosure;

fig. 2 schematically shows a block diagram of a text similarity calculation device provided by an embodiment of the present disclosure;

fig. 3 schematically illustrates a block diagram of a computer system provided by an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable medium having instructions stored thereon for use by or in connection with an instruction execution system. In the context of this disclosure, a computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the instructions. For example, the computer readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the computer readable medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.

Fig. 1 schematically shows a flowchart of a text similarity calculation method according to an embodiment of the present disclosure.

Specifically, as shown in fig. 1, a method for calculating text similarity according to an embodiment of the present disclosure includes the following operations:

step 101, performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating a first similarity of the two texts based on the two first vocabulary sets.

In the disclosed embodiment, a domain-specific corpus can be established prior to step 101.

The two texts exist in a corpus of a specific field, and the two texts are respectively subjected to word segmentation processing to obtain two first vocabulary sets, wherein the two first vocabulary sets comprise: performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies.

The word segmentation processing for all the texts in the corpus of the specific field can use word segmentation tools, such as jieba word segmentation, thesulac, SnowNLP, and the like. Assuming that the word segmentation processing is performed on the text "xiaoming/linkuwa/area", the result of the word segmentation output on the text using the tool is "xiaoming/linkuwa/area".

After the two first vocabulary sets are obtained, vectorization processing can be performed on the two first vocabulary sets.

The term vectorization is a method of expressing a term mathematically. And performing word vector training on the vocabulary contained in the corpus to obtain corresponding word vectors. The existing vocabulary vectorization technology can effectively distinguish the consent expression, the multiple meanings and the lexical interpretation of the vocabulary, so that the calculation of the space distance between word vectors can effectively reflect the similarity degree of the expression semantics of the word vectors.

Specifically, the two first vocabulary sets can be represented by word vectors by using the existing vocabulary vectorization technology, such as Doc2vec, and can be realized by using a Doc2vec algorithm model in a gensim toolkit.

The method comprises the following steps of performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets, wherein the word segmentation processing comprises the following steps: let the two first vocabulary sets be A1 and B1 respectively, and the vectors obtained by vectorizing the two first vocabulary sets are respectively

And

the first similarity between the two first vocabulary sets is score (A1, B1)_semanticAnd then:

for example, suppose there is a bullet corpus, and there are A, B bullet curtains in the bullet corpus, where A is "miss sister's song sound is good and liked", and B is "miss sister with sweet beautiful voice, song sound is good". A, B are participled and stop words are removed, the first vocabulary set A1 output for A is "Miss/singing/good hearing/good/like", and the first vocabulary set B1 output for B is "beautiful/sweet/Miss/singing/good". Respectively aiming at the first vocabularyCombining A1 and B1 to carry out vectorization processing to obtain word vectors

And

wherein:

then the first similarity between a and B can be calculated as:

score(A，B)_semantic＝1.41。

and 102, respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets.

It should be noted that N-gram is a language model, and can also implement the word segmentation function. Commonly used N-grams are Bi-grams (N ═ 2) and Tri-grams (N ═ 3). For example, the text "I love deep learning", according to the results of Bi-gram and Tri-gram decomposition:

bi-gram: { "I love", "love is deep", "depth", "graduate", "study" },

tri-gram: { "I love deeply", "love depth", "depth study", "all study" }.

Inputting the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets, wherein the two second vocabulary sets comprise: inputting the two texts into a preset N-gram language model respectively, and outputting the two second vocabulary sets to enable the two second vocabulary sets to be A2 and B2 respectively; the two second vocabulary sets are compared to obtain the total number of words len in A2 (A2)_{n_text}) Total number of words len in B2 (B2)_{n_text}) The two second vocabulary sets are the sameNumber of words N_{n_text}The number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2).

Still taking the example in step 101, a is "sister miss, the singing voice is good and liked", B is "sister miss, the singing voice is good and sweet, and N is 3, after a and B are respectively input into the Tri-gram model, two second vocabulary sets are output as a2 and B2, respectively, a2 and B2 are as follows:

a2 { "sister", "sister song", "sister singing", "good singing voice", "good hearing", "favorite" },

b2 { "beautiful voice", "beautiful sound sweet", "sweet xiao", "miss song sound", "song sound not", "sound good" },

the total number len of the words in A2 can be known from the two second participle sets (A2)_{n_text}) 8, total number of words in B2 len (B2)_{n_text}) 10, the number of the words N in the two second vocabulary sets is the same_{n_text}Is 3, the number of all non-repeating words in the two second vocabulary sets len (a2 ═ B2) is 15.

Respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets further comprises: let the second similarity of the two texts be score (A, B)_textThen;

the total number of words len in A2 obtained from the above two second sets of words (A2)_{n_text}) Total number of words len in B2 (B2)_{n_text}) The number N of the same words in the two second vocabulary sets_{n_text}The number of all non-repeating words in the two second vocabulary sets len (a2 ═ B2) can be calculated to obtain the second similarity of A, B two textsThe degree is as follows:

score(A，B)_text＝0.1。

and 103, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity.

The sum of the preset adjusting parameter of the first similarity and the preset adjusting parameter of the second similarity is 1.

Based on the first similarity and the second similarity, deriving the similarity between the two texts comprises: let the two texts be a and B, respectively, the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity are α and β, respectively, and the similarity of the two texts is sim (a, B), then:

sim(A，B)＝α*score(A1，B1)_semantic+β*score(A2，B2)_text。

wherein alpha + beta is 1, alpha is more than or equal to 0 and less than or equal to 1, and beta is more than or equal to 0 and less than or equal to 1.

According to the results of the exemplary calculation in step 101 and step 102, the first similarity of the two texts a and B is 1.41, and the second similarity is 0.1. Generally, if α is 0.6 and β is 0.4, the text similarity between the two texts is:

sim(A，B)＝α*score(A1，B1)_semantic+β*score(A2，B2)_text＝0.886，

as a result, the text similarity between the two texts A, B was 0.886.

In the embodiment of the disclosure, word segmentation processing may be performed on two texts to be processed respectively to obtain two first vocabulary sets, and based on the two first vocabulary sets, a first similarity of the two texts is calculated; then, the two texts are respectively input into a preset N-gram language model to obtain two second vocabulary sets, and the second similarity of the two texts is calculated based on the two second vocabulary sets; and finally, calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and an adjusting parameter of the second similarity. In the above process, the similarity between text semantics and the similarity of words used by the text are considered when calculating the text similarity, so that the calculation of the text similarity is more accurate.

Fig. 2 is a block diagram of a text similarity calculation apparatus according to an embodiment of the present disclosure.

As shown in fig. 2, the text similarity calculation device includes: a first calculation module 210, a second calculation module 220, and a third calculation module 230.

Specifically, the first calculating module 210 performs word segmentation on two texts to be processed, where the two texts exist in the corpus of the specific field, performs word segmentation on all texts in the corpus of the specific field, and removes stop words to obtain a set of all words contained in the corpus of the specific field; and acquiring the two first vocabulary sets from the set of all vocabularies. And vectorizing the two first vocabulary sets based on all vocabularies contained in the corpus to obtain two word vectors. And calculating the first similarity of the two texts according to the word vectors obtained by the two first vocabulary sets.

The second calculation module 220 is configured to input the two texts into a preset N-gram language model respectively for word segmentation processing, so as to obtain two second vocabulary sets, where the two second vocabulary sets are a2 and B2; the two second vocabulary sets are compared to obtain the total number of words len in A2 (A2)_{n_text}) Total number of words len in B2 (B2)_{n_text}) The number N of the same words in the two second vocabulary sets_{n_text}The number of all words in the two second vocabulary sets that do not repeat len (a2 ═ B2). And calculating a second similarity of the two texts based on the parameters acquired by the two second vocabulary sets.

And a third calculating module 230, configured to calculate the similarity between the two texts according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity, where a sum of the preset adjusting parameter of the first similarity and the preset adjusting parameter of the second similarity is 1, and based on the first similarity and the second similarity, calculate the similarity between the two texts.

It is understood that the first calculation module 210, the second calculation module 220, and the third calculation module 230 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the first computing module 210, the second computing module 220, and the third computing module 230 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in a suitable combination of three implementations of software, hardware, and firmware. Alternatively, at least one of the first, second and

third computing modules

210, 220, 230 may be implemented at least in part as a computer program module that, when executed by a computer, may perform the functions of the respective module.

As shown in fig. 3, computer system 300 includes a processor 310, a computer-readable storage medium 320, a signal transmitter 330, and a signal receiver 340. The computer system 300 may perform a method according to an embodiment of the present disclosure.

In particular, processor 310 may include, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 310 may also include on-board memory for caching purposes. The processor 310 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

Computer-readable storage medium 320 may be, for example, any medium that can contain, store, communicate, propagate, or transport the instructions. For example, a readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the readable storage medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.

The computer-readable storage medium 320 may include a computer program 321, which computer program 321 may include code/computer-executable instructions that, when executed by the processor 310, cause the processor 310 to perform a method according to an embodiment of the disclosure, or any variation thereof.

The computer program 321 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 321 may include one or more program modules, including 321A, modules 321B, … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 310 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 310.

According to an embodiment of the present disclosure, the processor 310 may interact with the signal transmitter 330 and the signal receiver 340 to perform a method according to an embodiment of the present disclosure or any variation thereof.

According to an embodiment of the present invention, at least one of identification signal transmitting module 310, identification signal receiving module 320, identification module 330, and information signal transceiving module 340 may be implemented as a computer program module described with reference to fig. 3, which when executed by processor 310, may implement the corresponding operations described above.

The present disclosure also provides a computer-readable medium, which may be embodied in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer readable medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, a computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, optical fiber cable, radio frequency signals, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims

1. A text similarity calculation method is characterized by comprising the following steps:

the method includes the steps of performing word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculating first similarity of the two texts based on the two first vocabulary sets, and includes the following steps:

the two first vocabulary sets are respectively A1 and B1, and vectors obtained by vectorization processing of the two first vocabulary sets are respectively A1 and B1

And

the first similarity of the two texts is score (A, B)_semanticAnd then:

respectively inputting the two texts into a preset N-gram language model to obtain two second vocabulary sets, and calculating a second similarity of the two texts based on the two second vocabulary sets, wherein the calculation comprises the following steps:

respectively inputting the two texts into a preset N-gram language model, and outputting two second vocabulary sets, wherein the two second vocabulary sets are A2 and B2;

comparing the two second vocabulary sets to obtain the total number of words len in A2 (A2)_{n_text}) Total number of words len in B2 (B2)_{n_text}) The number N of words in the two second vocabulary sets being the same_{n_text}The number of all words in the two second vocabulary sets which are not repeated len (A2 @ B2)_{n_text}；

Let the second similarity of the two texts be score (A, B)_textAnd then:

calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity;

the sum of the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity is 1, that is:

α+β＝1，

wherein alpha is more than or equal to 0 and less than or equal to 1, and beta is more than or equal to 0 and less than or equal to 1.

2. The method of claim 1, wherein deriving the similarity between the two pieces of text based on the first similarity and the second similarity comprises:

making the two texts respectively be a and B, making the preset adjustment parameter of the first similarity and the preset adjustment parameter of the second similarity respectively be α and β, and making the similarity of the two texts be sim (a, B), then:

sim(A，B)＝α*score(A，B)_semantic+β*score(A，B)_text。

3. the method according to claim 1, wherein the two texts exist in a corpus in a specific domain, and the performing word segmentation on the two texts to be processed respectively to obtain two first vocabulary sets comprises:

performing word segmentation processing on all texts in the corpus of the specific field, and removing stop words to obtain a set of all words contained in the corpus of the specific field;

and acquiring the two first vocabulary sets from the set of all vocabularies.

4. A device for calculating text similarity, comprising:

the first calculation module is configured to perform word segmentation processing on two texts to be processed respectively to obtain two first vocabulary sets, and calculate a first similarity between the two texts based on the two first vocabulary sets, including:

And

the first similarity of the two texts is score (A, B)_semanticAnd then:

a second calculating module, configured to input the two texts into a preset N-gram language model respectively to obtain two second vocabulary sets, and calculate a second similarity between the two texts based on the two second vocabulary sets, where the second calculating module includes:

Let the second similarity of the two texts be score (A, B)_textAnd then:

the third calculation module is used for calculating the similarity of the two texts based on the first similarity and the second similarity according to a preset adjusting parameter of the first similarity and a preset adjusting parameter of the second similarity;

α+β＝1，

5. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the text similarity calculation method according to any one of claims 1 to 3 when executing the computer program.

6. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps in the text similarity calculation method according to any one of claims 1 to 3.