CN109885657A - A kind of calculation method of text similarity, device and storage medium - Google Patents

A kind of calculation method of text similarity, device and storage medium Download PDF

Info

Publication number
CN109885657A
CN109885657A CN201910124084.2A CN201910124084A CN109885657A CN 109885657 A CN109885657 A CN 109885657A CN 201910124084 A CN201910124084 A CN 201910124084A CN 109885657 A CN109885657 A CN 109885657A
Authority
CN
China
Prior art keywords
similarity
text
texts
lexical
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910124084.2A
Other languages
Chinese (zh)
Other versions
CN109885657B (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Ouyue Netvision Co Ltd
Original Assignee
Wuhan Ouyue Netvision Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Ouyue Netvision Co Ltd filed Critical Wuhan Ouyue Netvision Co Ltd
Priority to CN201910124084.2A priority Critical patent/CN109885657B/en
Publication of CN109885657A publication Critical patent/CN109885657A/en
Application granted granted Critical
Publication of CN109885657B publication Critical patent/CN109885657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of calculation method of text similarity, applied to being related to computer application technology, comprising: originally carry out word segmentation processing respectively to two provisions to be processed, obtain two the first lexical sets, and two the first lexical sets are based on, calculate first similarity of two texts;Two provisions are originally separately input into preset N-gram language model, two the second lexical sets is obtained, and be based on two the second lexical sets, calculates second similarity of two texts;According to the adjustment parameter of the first preset similarity and the adjustment parameter of the second similarity, it is based on the first similarity and the second similarity, calculates the similarity of two texts.The disclosure additionally provides the computing device and storage medium of a kind of text similarity.In the above process, the similarity degree between text semantic had both been considered when calculating text similarity, it is also considered that the similarity degree of text word, so that being accurately calculated to text similarity.

Description

A kind of calculation method of text similarity, device and storage medium
Technical field
This disclosure relates to computer application technology more particularly to a kind of calculation method of text similarity, device and Storage medium.
Background technique
Text similarity is a kind of representation for quantifying the similarity degree between text, is widely used in recent years The fields such as information retrieval, document copying detection, machine translation, public sentiment monitoring.
In the existing technology for calculating text similarity, text mapping is become semantic empty using vector space model method Between in term vector, calculating the space length between term vector is the present common practice for calculating text similarity.
The existing method for indicating text similarity by the distance calculated between term vector is from semantic angle text Similitude, the typically no similitude for considering word used in text, therefore the effect for evaluating text similarity is not so good.
Summary of the invention
An aspect of this disclosure provides a kind of calculation method of text similarity, comprising: to two provisions to be processed This carries out word segmentation processing respectively, obtains two the first lexical sets, and be based on described two first lexical sets, calculates described two First similarity of bar text;Two provision is originally separately input into preset N-gram language model, obtains two second Lexical set, and described two second lexical sets are based on, calculate the second similarity of two texts;According to preset institute The adjustment parameter of the first similarity and the adjustment parameter of second similarity are stated, first similarity and described second are based on Similarity calculates the similarity of two texts.
Optionally, the first similarity calculated between described two first lexical sets includes: to enable described two first words Collecting conjunction is respectively A1 and B1, and described two first lexical sets carry out the vector that vectorization is handled and are respectivelyWith First similarity of two texts is score (A, B)semantic, then:
Optionally, two provision is originally separately input into preset N-gram language model, obtains two the second vocabulary Set includes: that two provisions are originally inputted to preset N-gram language model respectively, exports described two second lexical sets, enables institute Stating two the second lexical sets is respectively A2 and B2;More described two second lexical sets obtain the total number of word in A2 len(A2n_text), the total number len (B2 of word in B2n_text), identical word number in described two second lexical sets Nn_text, the number len (A2 ∪ B2) of unduplicated all words in described two second lexical sets.
Optionally, described two second lexical sets are based on, the second similarity of two texts is calculated further include: are enabled Second similarity of two texts is score (A, B)text, then:
Optionally, the sum of adjustment parameter of the adjustment parameter of the first preset similarity and second similarity is 1.
Optionally, it is based on first similarity and second similarity, obtains the similarity packet of two texts Include: enabling two provision is originally respectively A and B, the adjustment parameter of the first preset similarity and second similarity Adjustment parameter is respectively α and β, and the similarity of two texts is sim (A, B), then:
Sim (A, B)=α * score (A, B)semantiC+ β * score (A, B)text
Optionally, two texts are present in the corpus of specific area, originally carry out at participle respectively to two provision Reason obtains two the first lexical sets and includes: that all texts in corpus to the specific area carry out word segmentation processing, and Stop words is removed, the set for all vocabulary for including in the corpus of the specific area is obtained;From the collection of all vocabulary Described two first lexical sets are obtained in conjunction.
Another aspect of the disclosure provides a kind of computing device of text similarity, comprising:
First computing module obtains two the first vocabulary for originally carrying out word segmentation processing respectively to two provisions to be processed Set, and described two first lexical sets are based on, calculate the first similarity of two texts;
Second computing module obtains two for two provision to be originally separately input into preset N-gram language model A second lexical set, and described two second lexical sets are based on, calculate the second similarity of two texts;
Third computing module, for the adjustment parameter and second similarity according to preset first similarity Adjustment parameter is based on first similarity and second similarity, calculates the similarity of two texts.
Another aspect of the present disclosure provides a kind of electronic equipment, comprising: processor: memory, being stored with computer can It executes instruction, the computer executable instructions by the processor when being executed, so that the processor executes: to be processed Two provisions originally carry out word segmentation processing respectively, obtain two the first lexical sets, and be based on described two first lexical sets, calculate First similarity of two texts;Two provision is originally separately input into preset N-gram language model, obtains two A second lexical set, and described two second lexical sets are based on, calculate the second similarity of two texts;According to pre- The adjustment parameter for first similarity set and the adjustment parameter of second similarity are based on first similarity and institute The second similarity is stated, the similarity of two texts is calculated.
Another aspect of the present disclosure provides a kind of computer-readable medium, is stored with computer executable instructions, described Instruction is when executed for realizing method as described above.
Another aspect of the present disclosure provides a kind of computer program, and the computer program, which includes that computer is executable, to be referred to It enables, described instruction is when executed for realizing method as described above.
The embodiment of the present disclosure use at least one above-mentioned technical solution can reach it is following the utility model has the advantages that
In embodiment of the disclosure, word segmentation processing originally first can be carried out to two provisions to be processed respectively, obtains two First lexical set, and described two first lexical sets are based on, calculate the first similarity of two texts;Then by institute It states two provisions and is originally separately input into preset N-gram language model, obtain two the second lexical sets, and based on described two Second lexical set calculates the second similarity of two texts;Finally according to the adjusting of preset first similarity The adjustment parameter of parameter and second similarity is based on first similarity and second similarity, calculates described two The similarity of bar text.In the above process, the similarity degree between text semantic had both been considered when calculating text similarity, The similarity degree of text word is considered, so that being accurately calculated to text similarity.
Detailed description of the invention
In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:
Fig. 1 diagrammatically illustrates a kind of flow chart of the calculation method of text similarity of embodiment of the present disclosure offer;
Fig. 2 diagrammatically illustrates a kind of structural block diagram of the computing device of text similarity of embodiment of the present disclosure offer;
Fig. 3 diagrammatically illustrates a kind of block diagram of computer system of embodiment of the present disclosure offer.
Specific embodiment
Hereinafter, will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are only exemplary , and it is not intended to limit the scope of the present disclosure.In addition, in the following description, descriptions of well-known structures and technologies are omitted, with Avoid unnecessarily obscuring the concept of the disclosure.
Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.Used here as Word " one ", " one (kind) " and "the" etc. also should include " multiple ", " a variety of " the meaning, unless in addition context clearly refers to Out.The terms "include", "comprise" as used herein etc. show the presence of the feature, step, operation and/or component, still Presence is not precluded or adds other one or more features, step, operation or component.
There are all terms (including technical and scientific term) as used herein those skilled in the art to be generally understood Meaning, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Meaning, without that should be explained with idealization or excessively mechanical mode.
Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or combinations thereof can be realized by computer program instructions.These computer program instructions can be supplied to general purpose computer, The processor of special purpose computer or other programmable data processing units, so that these instructions are when executed by this processor can be with Creation is for realizing function/operation device illustrated in these block diagrams and/or flow chart.
Therefore, the technology of the disclosure can be realized in the form of hardware and/or software (including firmware, microcode etc.).Separately Outside, the technology of the disclosure can take the form of the computer program product on the computer-readable medium for being stored with instruction, should Computer program product uses for instruction execution system or instruction execution system is combined to use.In the context of the disclosure In, computer-readable medium, which can be, can include, store, transmitting, propagating or transmitting the arbitrary medium of instruction.For example, calculating Machine readable medium can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, device or propagation medium. The specific example of computer-readable medium includes: magnetic memory apparatus, such as tape or hard disk (HDD);Light storage device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication link.
Fig. 1 diagrammatically illustrates a kind of flow chart of the calculation method of text similarity according to the embodiment of the present disclosure.
Specifically, as shown in Figure 1, a kind of calculation method of text similarity of the embodiment of the present disclosure includes following operation:
Step 101, word segmentation processing is originally carried out to two provisions to be processed respectively, obtains two the first lexical sets, and base In two first lexical sets, the first similarity of two texts is calculated.
In the embodiments of the present disclosure, before step 101, the corpus of a specific area can be established.
Two texts are present in the corpus of specific area, originally carry out word segmentation processing respectively to two provision, obtain Two the first lexical sets include: that all texts in the corpus to the specific area carry out word segmentation processing, and remove and deactivate Word obtains the set of all vocabulary for including in the corpus of the specific area;Obtained from the set of all vocabulary this two A first lexical set.
Carrying out word segmentation processing to all texts in the corpus of the specific area can be used participle tool, such as jieba Participle, thulac, SnowNLP etc..Assuming that word segmentation processing is carried out to text " Xiao Ming comes Liwan District ", using tool to text Participle output the result is that " Xiao Ming/come/Li Gulf/area ".
After obtaining two first lexical sets, vectorization processing can be carried out to two first lexical sets.
It should be noted that vocabulary vectorization is the method for indicating vocabulary mathematicization.By that will include in corpus Vocabulary carries out term vector training, obtains corresponding term vector.Existing vocabulary vectorization technology can effectively differentiating words converge it is same Expectation reaches, vocabulary ambiguity and vocabulary free translation, therefore, calculate the space length between term vector can effecting reaction term vector express language The similarity degree of justice.
Specifically, existing vocabulary vectorization technology, such as Doc2vec can be used, which is used Term vector indicates, can realize by the Doc2vec algorithm model in gensim kit.
Word segmentation processing is originally carried out to two provisions to be processed respectively, obtains two the first lexical sets, and be based on this two First lexical set, the first similarity for calculating two texts include: that enable two first lexical sets be respectively A1 and B1, Two first lexical sets carry out the vector that vectorization is handledWithTwo first lexical sets it Between the first similarity be score (A1, B1)semantic, then:
For example, it is assumed that there is a barrage corpus, two barrages of A, B present in barrage corpus, wherein A is " small Elder sister's song is pleasing to the ear, is delithted with ", B is " the small elder sister of people's Western style of singing sweet tea, song are pretty good ".Word segmentation processing is carried out to A, B respectively, and Stop words is removed, the first lexical set A1 to A output is " small elder sister/song/pleasing to the ear/very/is liked ", to the first of B output Lexical set B1 is " people's beauty/sound sweet tea// small elder sister/song/good ".Vector is carried out to the first lexical set A1 and B1 respectively Change processing, obtains term vectorWithWherein:
The first similarity between A and B can then be calculated are as follows:
Score (A, B)semantic=1.41.
Step 102, which is originally separately input into preset N-gram language model, obtains two the second word finders It closes, and is based on two second lexical sets, calculate the second similarity of two texts.
It should be noted that N-gram is a kind of language model, participle function can also be realized.Common N-gram has Bi- Gram (N=2) and Tri-gram (N=3).Such as text " I likes deep learning ", it is decomposed according to Bi-gram and Tri-gram As a result it is respectively:
Bi-gram:{ " I likes ", " love is deep ", " depth ", " degree is learned ", " study " },
Tri-gram:{ " I likes depth ", " love depth ", " depth ", " all learning " }.
Two provision is originally separately input into preset N-gram language model, obtaining two the second lexical sets includes: Two provisions are originally inputted to preset N-gram language model respectively, export two second lexical sets, enable two second words Collecting conjunction is respectively A2 and B2;Compare two second lexical sets, obtains the total number len (A2 of word in A2n_text), B2 Total number len (the B2 of middle wordn_text), identical word number N in two second lexical setsn_text, this two second The number len (A2 ∪ B2) of unduplicated all words in lexical set.
It is still illustrated with the citing in step 101, A is " small elder sister's song is pleasing to the ear, is delithted with ", and B is " people's Western style of singing sweet tea Small elder sister, song are pretty good ", it is assumed that N=3, after A and B is inputted Tri-gram model respectively, two the second lexical sets of output Respectively A2 and B2, A2 and B2 are as follows:
A2=" small elder sister ", " elder sister's song ", " elder sister's song ", " song is good ", " sound is pleasing to the ear ", " pleasing to the ear very ", " listen and like very much ", " being delithted with " },
B2=" people's Western style of singing ", " Western style of singing sweet tea ", " sound sweet tea ", " sweet tea small ", " Miss ", " small elder sister ", " elder sister's song ", " elder sister's song ", " song is not ", " sound is pretty good " },
According to the total number len (A2 of word in above-mentioned two second participle set A2n_text) it is 8, word in B2 Total number len (B2n_text) it is 10, identical word number N in two second lexical setsn_textIt is 3, this two The number len (A2 ∪ B2) of unduplicated all words is 15 in second lexical set.
Two provision is originally separately input into preset N-gram language model, obtains two the second lexical sets, and base In two second lexical sets, the second similarity of two texts is calculated further include: enable the second similar of two texts Degree is score (A, B)text, then;
Total number len (the A2 of word in the A2 obtained according to above-mentioned two second participle setn_text), word in B2 Total number len (B2n_text), identical word number N in two second lexical setsn_text, two second lexical sets In unduplicated all words number len (A2 ∪ B2), second similarity of two texts of A, B can be calculated are as follows:
Score (A, B)text=0.1.
Step 103, it according to the adjustment parameter of the adjustment parameter of preset first similarity and second similarity, is based on First similarity and second similarity, calculate the similarity of two texts.
The sum of adjustment parameter and the adjustment parameter of second similarity of the first preset similarity are 1.
Based on first similarity and second similarity, show that the similarity of two texts includes: to enable two provision This is respectively A and B, and the adjustment parameter of the first preset similarity and the adjustment parameter of second similarity are respectively α and β, The similarity of two texts is sim (A, B), then:
Sim (A, B)=α * score (A1, B1)semantic+ β * score (A2, B2)text
Wherein, alpha+beta=1, and 0≤α≤1,0≤β≤1.
According to the citing calculating in step 101 and step 102 as a result, first similarity of this two texts of A and B is 1.41, the second similarity is 0.1.Generally enable α=0.6, β=0.4, then the text similarity of above-mentioned two texts are as follows:
Sim (A, B)=α * score (A1, B1)semantic+ β * score (A2, B2)text=0.886,
It follows that the text similarity of this two texts of A, B is 0.886.
In embodiment of the disclosure, word segmentation processing originally first can be carried out to two provisions to be processed respectively, obtains two First lexical set, and two first lexical sets are based on, calculate the first similarity of two texts;Then by this two Text is separately input into preset N-gram language model, obtains two the second lexical sets, and be based on two second vocabulary Set, calculates the second similarity of two texts;Finally according to the adjustment parameter of preset first similarity and this second The adjustment parameter of similarity calculates the similarity of two texts based on first similarity and second similarity.Above In the process, both considered similarity degree between text semantic when calculating text similarity, it is also considered that text word it is similar Degree, so that being accurately calculated to text similarity.
A kind of structural block diagram of the computing device for text similarity that Fig. 2 embodiment of the present disclosure provides.
As shown in Fig. 2, the computing device of text similarity include: the first computing module 210, the second computing module 220 and Third computing module 230.
Specifically, the first computing module 210 carries out word segmentation processing to two texts to be processed, which exists In the corpus of specific area, word segmentation processing is carried out to all texts in the corpus of the specific area, and remove and deactivate Word obtains the set of all vocabulary for including in the corpus of the specific area;Obtained from the set of all vocabulary this two A first lexical set.Two first lexical sets are subjected to vectorization processing based on all vocabulary for including in corpus, Obtain two term vectors.According to the term vector obtained by two first lexical sets, calculate two texts first is similar Degree.
Second computing module 220 is divided for two provision to be originally separately input into preset N-gram language model Word processing, obtains two the second lexical sets, enabling two second lexical sets is respectively A2 and B2;Compare two second words Collect conjunction, obtains the total number len (A2 of word in A2n_text), the total number len (B2 of word in B2n_text), this two second Identical word number N in lexical setn_text, the number len of unduplicated all words in two second lexical sets (A2∪B2).And these parameters obtained based on two second lexical sets, calculate the second similarity of two texts.
Third computing module 230, for the adjustment parameter and second similarity according to preset first similarity Adjustment parameter, wherein the sum of adjustment parameter and the adjustment parameter of second similarity of the first preset similarity are 1, base In first similarity and second similarity, the similarity of two texts is calculated.
It is understood that the first computing module 210, the second computing module 220 and third computing module 230 can merge It is realized in a module or any one module therein can be split into multiple modules.Alternatively, in these modules At least partly function of one or more modules can be combined at least partly function of other modules, and in a module It realizes.According to an embodiment of the invention, in the first computing module 210, the second computing module 220 and third computing module 230 At least one can at least be implemented partly as hardware circuit, such as field programmable gate array (FPGA), programmable logic Array (PLA), system on chip, the system on substrate, the system in encapsulation, specific integrated circuit (ASIC), or can be to electricity Road carries out the hardware such as any other rational method that is integrated or encapsulating or firmware to realize, or with software, hardware and firmware The appropriately combined of three kinds of implementations is realized.Alternatively, the first computing module 210, the second computing module 220 and third calculate mould At least one of block 230 can at least be implemented partly as computer program module, when the program is run by computer, The function of corresponding module can be executed.
Fig. 3 diagrammatically illustrates a kind of block diagram of computer system of embodiment of the present disclosure offer.
As shown in figure 3, computer system 300 includes processor 310, computer readable storage medium 320, sender unit 330 and signal receiver 340.The computer system 300 can execute the method according to the embodiment of the present disclosure.
Specifically, processor 310 for example may include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 310 can also include using for caching The onboard storage device on way.Processor 310 can be the different movements for executing the method flow according to the embodiment of the present disclosure Single treatment unit either multiple processing units.
Computer readable storage medium 320, such as can be times can include, store, transmitting, propagating or transmitting instruction Meaning medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, Device or propagation medium.The specific example of readable storage medium storing program for executing includes: magnetic memory apparatus, such as tape or hard disk (HDD);Optical storage Device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication chain Road.
Computer readable storage medium 320 may include computer program 321, which may include generation Code/computer executable instructions execute processor 310 according to the embodiment of the present disclosure Method or its any deformation.
Computer program 321 can be configured to have the computer program code for example including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 321 may include one or more program modules, for example including 321A, module 321B ....It should be noted that the division mode and number of module are not fixation, those skilled in the art can To be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor 310 When execution, processor 310 is executed according to the method for the embodiment of the present disclosure or its any deformation.
In accordance with an embodiment of the present disclosure, processor 310 can be handed over sender unit 330 and signal receiver 340 Mutually, it executes according to the method for the embodiment of the present disclosure or its any deformation.
According to an embodiment of the invention, identification signals sending module 310, identification signals receiving module 320, body Part at least one of identification module 330 and information signal transceiver module 340 can be implemented as the calculating with reference to Fig. 3 description Corresponding operating described above may be implemented when being executed by processor 310 in machine program module.
The disclosure additionally provides a kind of computer-readable medium, which, which can be in above-described embodiment, retouches Included in the equipment/device/system stated;It is also possible to individualism, and without in the supplying equipment/device/system.On It states computer-readable medium and carries one or more program, when said one or multiple programs are performed, realize root According to the method for the embodiment of the present disclosure.
In accordance with an embodiment of the present disclosure, computer-readable medium can be computer-readable signal media or computer can Read storage medium either the two any combination.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In the disclosure, computer readable storage medium can be it is any include or storage program Tangible medium, which can be commanded execution system, device or device use or in connection.And in this public affairs In opening, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by Instruction execution system, device or device use or program in connection.The journey for including on computer-readable medium Sequence code can transmit with any suitable medium, including but not limited to: wireless, wired, optical cable, radiofrequency signal etc., or Above-mentioned any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations or/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.
Although the disclosure, art technology has shown and described referring to the certain exemplary embodiments of the disclosure Personnel it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents, A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, But should be not only determined by appended claims, also it is defined by the equivalent of appended claims.

Claims (10)

1. a kind of calculation method of text similarity characterized by comprising
Word segmentation processing is originally carried out to two provisions to be processed respectively, obtains two the first lexical sets, and based on described two the One lexical set calculates the first similarity of two texts;
Two provision is originally separately input into preset N-gram language model, obtains two the second lexical sets, and be based on Described two second lexical sets calculate the second similarity of two texts;
According to the adjustment parameter of preset first similarity and the adjustment parameter of second similarity, it is based on described first Similarity and second similarity calculate the similarity of two texts.
2. the method according to claim 1, wherein the calculated between described two first lexical sets One similarity includes:
Enabling described two first lexical sets is respectively A1 and B1, and described two first lexical sets carry out vectorization and handle to obtain Vector be respectivelyWithFirst similarity of two texts is score (A, B)semantic, then:
3. the method according to claim 1, wherein described be originally separately input into preset N- for two provision Gram language model, obtaining two the second lexical sets includes:
Two provisions are originally inputted to preset N-gram language model respectively, described two second lexical sets is exported, enables described two A second lexical set is respectively A2 and B2;
More described two second lexical sets obtain the total number len (A2 of word in A2n_text), the total number of word in B2 len(B2n_text), identical word number N in described two second lexical setsn_text, in described two second lexical sets The number len (A2 ∪ B2) of unduplicated all wordsn_text
4. according to the method described in claim 3, it is characterized in that, described be based on described two second lexical sets, calculating institute State second similarity of two texts further include:
The second similarity for enabling two texts is score (A, B)text, then:
5. the method according to claim 1, wherein the adjustment parameter of the first preset similarity and described The sum of adjustment parameter of second similarity is 1, it may be assumed that
Alpha+beta=1,
Wherein, 0≤α≤1,0≤β≤1.
6. method according to claim 1 or 5, which is characterized in that described to be based on first similarity and described second Similarity show that the similarity of two texts includes:
Enabling two provision is originally respectively A and B, the adjustment parameter of the first preset similarity and second similarity Adjustment parameter is respectively α and β, and the similarity of two texts is sim (A, B), then:
Sim (A, B)=α * score (A, B)semantic+ β * score (A, B)text
7. the method according to claim 1, wherein two texts are present in the corpus of specific area In, described originally to carry out word segmentation processing respectively to two provision, obtaining two the first lexical sets includes:
Word segmentation processing is carried out to all texts in the corpus of the specific area, and removes stop words, is obtained described specific The set for all vocabulary for including in the corpus in field;
Described two first lexical sets are obtained from the set of all vocabulary.
8. a kind of computing device of text similarity characterized by comprising
First computing module obtains two the first lexical sets for originally carrying out word segmentation processing respectively to two provisions to be processed, And described two first lexical sets are based on, calculate the first similarity of two texts;
Second computing module obtains two for two provision to be originally separately input into preset N-gram language model Two lexical sets, and described two second lexical sets are based on, calculate the second similarity of two texts;
Third computing module, the adjusting for adjustment parameter and second similarity according to preset first similarity Parameter is based on first similarity and second similarity, calculates the similarity of two texts.
9. a kind of electronic equipment, comprising: memory, processor and storage are on a memory and the calculating that can run on a processor Machine program, which is characterized in that when the processor executes the computer program, realize any one of claim 1 to 7 institute Each step in the calculation method for the text similarity stated.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program When being executed by processor, each step in the calculation method of text similarity described in any one of claim 1 to 7 is realized Suddenly.
CN201910124084.2A 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium Active CN109885657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910124084.2A CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910124084.2A CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Publications (2)

Publication Number Publication Date
CN109885657A true CN109885657A (en) 2019-06-14
CN109885657B CN109885657B (en) 2021-04-27

Family

ID=66928388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910124084.2A Active CN109885657B (en) 2019-02-18 2019-02-18 Text similarity calculation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN109885657B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111160445A (en) * 2019-12-25 2020-05-15 中国建设银行股份有限公司 Bid document similarity calculation method and device
CN111382563A (en) * 2020-03-20 2020-07-07 腾讯科技(深圳)有限公司 Text relevance determining method and device
CN111737445A (en) * 2020-06-22 2020-10-02 中国银行股份有限公司 Knowledge base searching method and device
CN111814447A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112529091A (en) * 2020-12-18 2021-03-19 广州视源电子科技股份有限公司 Courseware similarity detection method and device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN105469104A (en) * 2015-11-03 2016-04-06 小米科技有限责任公司 Text information similarity calculating method, device and server
US20170032781A1 (en) * 2015-07-28 2017-02-02 Google Inc. Collaborative language model biasing
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec
CN108090047A (en) * 2018-01-10 2018-05-29 华南师范大学 A kind of definite method and apparatus of text similarity
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN109284502A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
US20170032781A1 (en) * 2015-07-28 2017-02-02 Google Inc. Collaborative language model biasing
CN105469104A (en) * 2015-11-03 2016-04-06 小米科技有限责任公司 Text information similarity calculating method, device and server
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec
CN108090047A (en) * 2018-01-10 2018-05-29 华南师范大学 A kind of definite method and apparatus of text similarity
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN109284502A (en) * 2018-09-13 2019-01-29 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李俊峰: "多特征融合的新闻聚类相似度计算方法", 《软件》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111160445A (en) * 2019-12-25 2020-05-15 中国建设银行股份有限公司 Bid document similarity calculation method and device
CN111160445B (en) * 2019-12-25 2023-06-16 中国建设银行股份有限公司 Bid file similarity calculation method and device
CN111382563A (en) * 2020-03-20 2020-07-07 腾讯科技(深圳)有限公司 Text relevance determining method and device
CN111382563B (en) * 2020-03-20 2023-09-08 腾讯科技(深圳)有限公司 Text relevance determining method and device
CN111737445A (en) * 2020-06-22 2020-10-02 中国银行股份有限公司 Knowledge base searching method and device
CN111737445B (en) * 2020-06-22 2023-09-01 中国银行股份有限公司 Knowledge base searching method and device
CN111814447A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN111814447B (en) * 2020-06-24 2022-05-27 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN112529091A (en) * 2020-12-18 2021-03-19 广州视源电子科技股份有限公司 Courseware similarity detection method and device and storage medium
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112364947B (en) * 2021-01-14 2021-06-29 北京育学园健康管理中心有限公司 Text similarity calculation method and device

Also Published As

Publication number Publication date
CN109885657B (en) 2021-04-27

Similar Documents

Publication Publication Date Title
CN109885657A (en) A kind of calculation method of text similarity, device and storage medium
KR101873619B1 (en) Boolean logic in a state machine lattice
TWI515668B (en) Methods and systems for detection in a state machine
KR101996961B1 (en) Methods and systems for data analysis in a state machine
US10255911B2 (en) System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding
CN108288468A (en) Audio recognition method and device
CN104487956B (en) The method and system of state vector data in being held up for use state power traction
KR20140103143A (en) Counter operation in a state machine lattice
CN106663423A (en) System and method of automatic speech recognition using on-the-fly word lattice generation with word histories
US11341945B2 (en) Techniques for learning effective musical features for generative and retrieval-based applications
US20230091272A1 (en) Audio content recognition method and apparatus, and device and computer-readable medium
CN109635094A (en) Method and apparatus for generating answer
CN108428451A (en) Sound control method, electronic equipment and speech control system
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
CN109858045A (en) Machine translation method and device
CN111382270A (en) Intention recognition method, device and equipment based on text classifier and storage medium
CN110473571A (en) Emotion identification method and device based on short video speech
CN105229625B (en) Method for voice recognition and acoustic processing device
CN109766496A (en) A kind of content risks recognition methods, system, equipment and medium
CN109388696A (en) Delete method, apparatus, storage medium and the electronic equipment of rumour article
CN113299298B (en) Residual error unit, network and target identification method, system, device and medium
CN109213916A (en) Method and apparatus for generating information
CN107680587A (en) Acoustic training model method and apparatus
CN110990531B (en) Text emotion recognition method and device
CN105340005A (en) Histogram based pre-pruning scheme for active hmms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant