CN109284490A

CN109284490A - A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Info

Publication number: CN109284490A
Application number: CN201811067314.8A
Authority: CN
Inventors: 徐乐乐
Original assignee: Wuhan Douyu Network Technology Co Ltd
Current assignee: Changsha Jinlv Network Technology Co ltd
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2019-01-29
Anticipated expiration: 2038-09-13
Also published as: CN109284490B

Abstract

The embodiment of the invention discloses a kind of Text similarity computing method, apparatus, electronic equipment and storage mediums, which comprises based on the part of speech similarity between two texts of default part of speech weight calculation；The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency；The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.By using above-mentioned technical proposal, the computational accuracy of text similarity can be improved, and then improve the matching accuracy of Similar Text.

Description

A kind of Text similarity computing method, apparatus, electronic equipment and storage medium

Technical field

The present embodiments relate to technical field of data processing more particularly to a kind of Text similarity computing method, apparatus, Electronic equipment and storage medium.

Background technique

Currently, the direct broadcasting room application program based on iOS platform or based on Android platform is quickly grown, it is deep by user Like.Barrage is that a kind of popular expression way for information interchange and information sharing of platform is broadcast live, and passes through barrage Interacting between spectators and main broadcaster may be implemented, help to build good live streaming atmosphere.

In machine conference field, one of important link is to find out and read statement semantic similarity highest time It is multiple.Equally, it is frequently necessary to be directed to water by robot according to water friend's barrage, calculating and the higher reply of its similarity in direct broadcasting room Friendly barrage is automatically replied.Currently, generalling use TF-IDF (Term Frequency-Inverse in direct broadcasting room Document Frequency, word frequency is against text frequency) algorithm calculates the similarity between two barrages, and still, TF-IDF is calculated The frequency distribution that the main thought of method is word-based or phrase occurs in document sets determines the keyword of every document, then Word frequency vector is constructed according to the number that keyword occurs in document sets, the similarity between word frequency vector by calculating document Determine the similarity between document, it is seen then that TF-IDF algorithm only accounts for the word frequency of word in document, only accounts for document in other words The significance level of middle word.

Therefore, it in order to improve Text similarity computing precision, needs that existing similarity calculation algorithm is continued to improve.

Summary of the invention

The embodiment of the present invention provides a kind of Text similarity computing method, apparatus, electronic equipment and storage medium, passes through institute The computational accuracy of text similarity can be improved in the method for stating.

To achieve the above object, the embodiment of the present invention adopts the following technical scheme that

In a first aspect, the embodiment of the invention provides a kind of Text similarity computing methods, which comprises

Based on the part of speech similarity between two texts of default part of speech weight calculation；

The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency；

The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.

Further, the part of speech similarity based between two texts of default part of speech weight calculation includes:

The part of speech similarity between two texts is calculated according to following formula:

Wherein, Sim_wordpro(A, B) indicates the part of speech similarity between text A and text B, g_iIt indicates in text A i-th The part of speech weight of word, g '_iIndicate the part of speech weight of i-th of word in text B, n indicates the phrase in word and text B in text A At set in word sum, L_AIndicate the sum of word in text A, L_BIndicate the sum of word in text B.

Further, described to be calculated between described two texts based on improved word frequency against text frequency TF-IDF algorithm Text similarity, comprising:

The corresponding TF-IDF weight of each word in each text is calculated according to following formula:

Wherein, W_ijIndicate the corresponding TF-IDF weight of word j in text i, tf_ijIndicate the number that word j occurs in text i, N indicates the text for including in text set sum, n_jIndicate that the text sum in text set comprising word j, i are Text Flags, j is text The mark of word in this；

The text phase between described two texts is calculated based on the corresponding TF-IDF weight of word each in described two texts Like degree.

Further, described that described two texts are calculated based on the corresponding TF-IDF weight of word each in described two texts Between text similarity, comprising:

The text similarity between described two texts is calculated according to following formula:

Wherein, Sim_tf-idf(A, B) indicates the text similarity between text A and text B, W_aiIt indicates in text A i-th The corresponding TF-IDF weight of word, W_biIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate word and text in text A The sum of word in the set of word composition in this B.

Further, described to be determined between described two texts according to the part of speech similarity and the text similarity Comprehensive similarity, comprising:

The comprehensive similarity between described two texts is determined according to following formula:

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A,B)

Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Sim_wordpro(A, B) indicates text A Part of speech similarity between text B, Sim_tf-idf(A, B) indicates the text similarity between text A and text B.

Further, the part of speech similarity based between two texts of default part of speech weight calculation or based on improving Word frequency calculate the text similarity between described two texts against text frequency TF-IDF algorithm before, the method is also wrapped It includes:

Participle and part-of-speech tagging processing are carried out to described two texts.

It is further, described that participle and part-of-speech tagging processing are carried out to described two texts, comprising:

Participle is carried out to described two texts using the jieba participle tool in python and part-of-speech tagging is handled.

Second aspect, the embodiment of the invention provides a kind of Text similarity computing device, described device includes:

Part of speech similarity calculation module, for based on the part of speech similarity between two texts of default part of speech weight calculation；

Text similarity calculation module, it is described two for being calculated based on improved word frequency against text frequency TF-IDF algorithm Text similarity between text；

Comprehensive similarity computing module, it is described two for being determined according to the part of speech similarity and the text similarity Comprehensive similarity between text.

Further, the part of speech similarity calculation module is specifically used for calculating between two texts according to following formula Part of speech similarity:

Wherein, Sim_wordpro(A, B) indicates the part of speech similarity between text A and text B, g_iIt indicates in text A i-th The part of speech weight of word, g_i' indicate text B in i-th of word part of speech weight, n indicate text A in word and text B in phrase At set in word sum, L_AIndicate the sum of word in text A, L_BIndicate the sum of word in text B.

Further, the text similarity calculation module includes:

TF-IDF weight computing unit, for calculating the corresponding TF-IDF of each word in each text according to following formula Weight:

Text similarity calculated, for calculating institute based on the corresponding TF-IDF weight of word each in described two texts State the text similarity between two texts.

Further, the text similarity calculated is specifically used for:

Further, the comprehensive similarity computing module is specifically used for:

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A,B)

Further, described device further include: processing module, for being based on default two texts of part of speech weight calculation described Part of speech similarity between this is calculated between described two texts based on improved word frequency against text frequency TF-IDF algorithm Text similarity before, to described two texts carry out participle and part-of-speech tagging handle.

Further, the processing module is specifically used for: using the jieba participle tool in python to described two texts This carries out participle and part-of-speech tagging processing.

The third aspect the embodiment of the invention provides a kind of electronic equipment, including memory, processor and is stored in storage On device and the computer program that can run on a processor, the processor realizes such as above-mentioned the when executing the computer program Text similarity computing method described in one side.

Fourth aspect, the embodiment of the invention provides a kind of storage medium comprising computer executable instructions, the meters Calculation machine executable instruction realizes the Text similarity computing side as described in above-mentioned first aspect when being executed as computer processor Method.

A kind of Text similarity computing method provided in an embodiment of the present invention, by combining the part of speech similarity between text And text similarity carries out overall merit to the similarity between text, improves the computational accuracy of text similarity, in turn Improve the matching accuracy of Similar Text.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, institute in being described below to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also implement according to the present invention The content of example and these attached drawings obtain other attached drawings.

Fig. 1 is a kind of Text similarity computing method flow schematic diagram that the embodiment of the present invention one provides；

Fig. 2 is a kind of Text similarity computing apparatus structure schematic diagram provided by Embodiment 2 of the present invention；

Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.

Specific embodiment

To keep the technical problems solved, the adopted technical scheme and the technical effect achieved by the invention clearer, below It will the technical scheme of the embodiment of the invention will be described in further detail in conjunction with attached drawing, it is clear that described embodiment is only It is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those skilled in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

Embodiment one

Fig. 1 is a kind of Text similarity computing method flow schematic diagram that the embodiment of the present invention one provides.The present embodiment is public The Text similarity computing method opened is suitable for machine conference field, matches from corpus similar to read statement semanteme Highest answer sentence is spent, to be replied automatically for read statement, is particularly suitable for matching in direct broadcasting room and water friend The highest sentence of barrage similarity, so that water friend's barrage is automatically replied by robot.The Text similarity computing method It can be executed by Text similarity computing device, wherein the device can be implemented by software and/or hardware, and be typically integrated in end In end, such as server etc..Referring specifically to shown in Fig. 1, this method comprises the following steps:

110, based on the part of speech similarity between two texts of default part of speech weight calculation.

Wherein, the part of speech specifically includes: noun, verb, interrogative, adjective and adverbial word etc..It is right in two texts The part of speech of word is answered to react the similitude of two texts to a certain extent, therefore, by combining the part of speech phase between two texts The text similarity between two texts is calculated like degree, the computational accuracy of text similarity can be improved.

Illustratively, according to following formula based on the part of speech similarity between two texts of default part of speech weight calculation:

Wherein, Sim_wordpro(A, B) indicates the part of speech similarity between text A and text B, g_iIt indicates in text A i-th The part of speech weight of word, g '_iIndicate the part of speech weight of i-th of word in text B, n indicates the phrase in word and text B in text A At set in word sum, L_AIndicate the sum of word in text A, L_BThe sum for indicating word in text B, when i is greater than L_AWhen, g_i =0, when i is greater than L_BWhen, g '_i=0, concrete meaning may refer to subsequent illustration.By the way that the denominator of formula (1) is arranged ForThe case where avoidable denominator is zero improves the scope of application of formula (1).

The default part of speech weight is pressed in combination with specific business scenario by the text to part of speech similarity known to multiple groups It is calculated according to above-mentioned formula (1), the anti-weight for releasing corresponding part of speech.In general, the noun and verb in sentence can express sentence It is most of semantic, i.e. the meaning that indicates in sentence of noun and verb is relatively heavy, and hence it is also possible to according to business experience, By the relatively high of noun and verb corresponding part of speech weight setting, and by the relatively low of the weight of other parts of speech setting.It is excellent Choosing, when specific business scenario is in the barrage text sent based on direct broadcasting room platform, the part of speech weight of noun can be Value between 0.7-0.8, the part of speech weight of verb can between 0.6-07 value, the part of speech weight of interrogative can be in 0.5-0.6 Between value, for the present embodiment with the part of speech weight of noun for 0.7, the part of speech weight of verb is 0.6, and the part of speech weight of interrogative is 0.5, the part of speech weight of other words is illustrated for being 0.

Assuming that text A are as follows: I wants to go to Beijing and study in college；

Text B are as follows: university of Pekinese is very joyful；

After carrying out participle and part-of-speech tagging processing to text A and text B, obtain:

A=I/n wants to go to the/Beijing adv/n reading/v university/n

The Beijing B=/n/adv university/n is true/and adj is joyful/adj

The set of word composition in word and text B in text A are as follows: I, wants to go to, Beijing, reads, university, it is very, good Play, therefore, n is equal to 8, the corresponding part of speech weight of each word in set are as follows: U={ 0.7,0,0.7,0.6,0.7,0,0,0 }.Cause The corresponding part of speech weight g of each word in this text A_i={ 0.7,0,0.7,0.6,0.7,0,0,0 }, each word is corresponding in text B Part of speech weight g '_i={ 0.7,0,0.7,0,0,0,0,0 }.It include five words in text B due to including five words in text A, Therefore L_A=5, L_B=5.

Therefore, the part of speech similarity between text A and text B is calculated based on above-mentioned formula (1)

120, the text calculated between described two texts against text frequency TF-IDF algorithm based on improved word frequency is similar Degree.

Specifically, described calculate the text between described two texts against text frequency TF-IDF algorithm based on improved word frequency Word similarity, comprising:

For specific business scenario, a corpus relevant to specific transactions scene can be arranged in advance, for example, institute State specific transactions scene are as follows: the text similarity between the barrage text sent for No. 1 direct broadcasting room is calculated, due to each live streaming Between live content it is different, cause the ownership theme of different direct broadcasting rooms different, therefore, the barrage text sent for different direct broadcasting rooms The domain term for including in this is not quite similar.For example, the corresponding main broadcaster a of No. 1 direct broadcasting room is especially good at and plays games, it is particularly good at and beats " king's honor ", therefore, the frequent live game video of No. 1 direct broadcasting room, the ownership theme of No. 1 direct broadcasting room then may be defined as often straight The game name broadcast, such as " king's honor ", or content relevant to game episode, such as the person names in game, dress Standby title or Mission Objective etc., such as often the ownership theme of No. 1 direct broadcasting room of live streaming " king's honor " can also be " flowers and trees It is blue ", " ermine cicada " or " Lu Na " etc..Necessarily comprising much and live game in the barrage text then sent for No. 1 direct broadcasting room Relevant domain term, at this point, can will then be directed to all barrage texts of No. 1 direct broadcasting room transmission in set period of time as the spy Determine the corpus under business scenario comprising domain term, is trained based on the corpus using above-mentioned formula (2), it is specific to obtain this TF-IDF weight vector space under business scenario, i.e., the vector space of the corresponding TF-IDF weight composition of each domain term.Then For each of No. 1 direct broadcasting room barrage text to be matched, then in the TF-IDF weight vector space under the specific transactions scene A point or a vector, therefore, available barrage text reflecting in the TF-IDF weight vector space to be matched It penetrates, the TF-IDF weight of each word in barrage text to be matched can be obtained.

By to existing TF-IDF weight calculation formulaIt improves, it is provided in this embodiment TF-IDF weight calculation formulaIt may be implemented to treat the neologisms in matched text and carry out TF-IDF Weight calculation, and then realize and the similitude of the text to be matched comprising neologisms is matched, the neologisms refer in TF-IDF corpus Or the word not included in TF-IDF dictionary, it therefore, in text set include the text sum n of neologisms_j=0, existing TF-IDF power Re-computation formulaIt can not then adapt to the case where neologisms occur in text to be matched.

In TF-IDF algorithm, if the frequency tf higher that some word or phrase occur in a text, but in text The frequency occurred in other texts of collection is very low, then it is assumed that the word or phrase have good class discrimination ability, are adapted to Classification, the word or phrase can be used as keyword, and assign higher tf-idf weight for keyword, and therefore, the tf-idf of word is weighed Weight increases with the increase of word frequency rate, increases with the increase of the rare degree of word.Each text is calculated using above-mentioned formula (2) The corresponding TF-IDF weight of each word in this, then each text can be expressed as the real-valued vectors based on TF-IDF weight, Then the length of the corresponding real-valued vectors of each text is normalized, so that the length one of the corresponding real-valued vectors of each text It causes, the cosine similarity of the corresponding real-valued vectors of every two text is finally calculated based on above-mentioned formula (3), which is For the text similarity between two texts.

It should be noted that not limiting sequencing between step 110 and step 120, step 120 can be preferentially executed, Step 110 can also be preferentially executed, this implementation is illustrated for preferentially executing step 110, and but not is to step 110 With the restriction of the execution sequence of step 120.

Participle and part-of-speech tagging processing are carried out to described two texts, specifically, using the jieba in python points Word tool carries out participle to described two texts and part-of-speech tagging is handled, and the present embodiment is no longer described in detail.

130, determine that the synthesis between described two texts is similar according to the part of speech similarity and the text similarity Degree.

Illustratively, described to be determined between described two texts according to the part of speech similarity and the text similarity Comprehensive similarity, comprising:

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A,B) (4)

Continue to be exemplified as example with above-mentioned, it is assumed that text A are as follows: I wants to go to Beijing and study in college；

Text B are as follows: university of Pekinese is very joyful；

A=I/n wants to go to the/Beijing adv/n reading/v university/n

The Beijing B=/n/adv university/n is true/and adj is joyful/adj

Text A, text B are respectively as follows: in the mapping of the vector space of TF-IDF

W_ai={ 0.1,0.2,0.3,0.1,0.6,0.1,0.1,0.1 }

W_bi={ 0.1,0.2,0.5,0.2,0.6,0.3,0.4,0.3 }

The text similarity between text A and text B is then obtained according to above-mentioned formula (3) are as follows:

Text similarity Sim between two texts_tf-idfThe value range of (A, B)=cos θ is [- 1,1], is calculated Value closer to 1, indicate that the text similarity between two texts is higher, i.e. the semanteme of two texts is closer.

The comprehensive similarity between text A and text B is further obtained according to above-mentioned formula (4):

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A, B)=0.458*0.907=0.415

As it can be seen that the text similarity 0.907 between text A and text B is very high, if only by between text A and text B Text similarity determines the semantic similarity between text A and text B, then will appear biggish deviation, and accuracy is not high；And lead to Cross the scheme of the present embodiment it is found that the comprehensive similarity between text A and text B be not it is very high, illustrate text A and text B Semanteme be not it is much like, be consistent with actual conditions, therefore, the present embodiment, which passes through, combines the part of speech between two texts similar Degree and text similarity evaluate the comprehensive similarity between two texts, improve semantic similar between two texts The computational accuracy of degree, and then improve the matching accuracy of Similar Text.

A kind of Text similarity computing method provided in this embodiment, by based on default two texts of part of speech weight calculation Between part of speech similarity；The text between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency Similarity；The skill of the comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity Art means realize the computational accuracy for improving semantic similarity between two texts, and then the matching for improving Similar Text is accurate The purpose of degree.

Embodiment two

Fig. 2 is a kind of Text similarity computing apparatus structure schematic diagram provided by Embodiment 2 of the present invention.Institute referring to fig. 2 Show, described device includes: that part of speech similarity calculation module 210, text similarity calculation module 220 and comprehensive similarity calculate mould Block 230；

Wherein, part of speech similarity calculation module 210, for based on the part of speech between two texts of default part of speech weight calculation Similarity；

Text similarity calculation module 220, for being based on improved word frequency against described in the calculating of text frequency TF-IDF algorithm Text similarity between two texts；

Comprehensive similarity computing module 230, for according to the part of speech similarity and text similarity determination Comprehensive similarity between two texts.

Further, part of speech similarity calculation module 210 is specifically used for calculating between two texts according to following formula Part of speech similarity:

Further, text similarity calculation module 220 includes:

Further, the text similarity calculated is specifically used for:

Further, comprehensive similarity computing module 230 is specifically used for:

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A,B)

A kind of Text similarity computing device provided in this embodiment, by based on default two texts of part of speech weight calculation Between part of speech similarity；The text between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency Similarity；The skill of the comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity Art means realize the computational accuracy for improving semantic similarity between two texts, and then the matching for improving Similar Text is accurate The purpose of degree.

Embodiment three

Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.As shown in figure 3, the electronics is set It is standby to include: processor 670, memory 671 and be stored in the computer journey that run on memory 671 and on processor 670 Sequence；Wherein, the quantity of processor 670 can be one or more, in Fig. 3 by taking a processor 670 as an example；Processor 670 is held The Text similarity computing method as described in above-described embodiment one is realized when the row computer program.As shown in figure 3, described Electronic equipment can also include input unit 672 and output device 673.Processor 670, memory 671,672 and of input unit Output device 673 can be connected by bus or other modes, in Fig. 3 for being connected by bus.

Memory 671 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, as in the embodiment of the present invention Text similarity computing device/module (for example, in Text similarity computing device Part of speech similarity calculation module 210, text similarity calculation module 220 and comprehensive similarity computing module 230 etc.).Processing Software program, instruction and the module that device 670 is stored in memory 671 by operation, thereby executing the various of electronic equipment Above-mentioned Text similarity computing method is realized in functional application and data processing.

Memory 671 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function；Storage data area, which can be stored, uses created data etc. according to terminal.This Outside, memory 671 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 671 can be into one Step includes the memory remotely located relative to processor 670, these remote memories can be set by network connection to electronics Standby/storage medium.The example of above-mentioned network include but is not limited to internet, intranet, local area network, mobile radio communication and its Combination.

Input unit 672 can be used for receiving the number or character information of input, and generates and set with the user of electronic equipment It sets and the related key signals of function control inputs.Output device 673 may include that display screen etc. shows equipment.

Example IV

The embodiment of the present invention four also provides a kind of storage medium comprising computer executable instructions, and the computer can be held Row instruction is used to execute a kind of Text similarity computing method when being executed by computer processor, this method comprises:

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The method operation that executable instruction is not limited to the described above, it is similar to can also be performed text provided by any embodiment of the invention Degree calculates relevant operation.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, storage medium or the network equipment etc.) executes described in each embodiment of the present invention.

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of Text similarity computing method characterized by comprising

2. the method according to claim 1, wherein described based between default two texts of part of speech weight calculation Part of speech similarity include:

Wherein, Sim_wordpro(A, B) indicates the part of speech similarity between text A and text B, g_iIndicate i-th word in text A Part of speech weight, g_i' indicate text B in i-th of word part of speech weight, n indicate text A in word and text B in word composition The sum of word, L in set_AIndicate the sum of word in text A, L_BIndicate the sum of word in text B.

3. the method according to claim 1, wherein described calculated based on improved word frequency against text frequency TF-IDF Method calculates the text similarity between described two texts, comprising:

Wherein, W_ijIndicate the corresponding TF-IDF weight of word j in text i, tf_ijIndicate that the number that word j occurs in text i, N indicate The text sum for including in text set, n_jIndicate the text sum in text set comprising word j, i is Text Flag, and j is in text Word mark；

The text similarity between described two texts is calculated based on the corresponding TF-IDF weight of word each in described two texts.

4. according to the method described in claim 3, it is characterized in that, described corresponding based on each word in described two texts TF-IDF weight calculates the text similarity between described two texts, comprising:

Wherein, Sim_tf-idf(A, B) indicates the text similarity between text A and text B, W_aiIndicate i-th of word pair in text A The TF-IDF weight answered, W_biIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate in word and text B in text A Word composition set in word sum.

5. method according to claim 1-4, which is characterized in that described according to the part of speech similarity and described Text similarity determines the comprehensive similarity between described two texts, comprising:

Sim (A, B)=Sim_wordpro(A,B)*Sim_tf-idf(A,B)

Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Sim_wordpro(A, B) indicates text A and text Part of speech similarity between B, Sim_tf-idf(A, B) indicates the text similarity between text A and text B.

6. method according to claim 1-4, which is characterized in that described based on default part of speech weight calculation two Part of speech similarity between text or based on improved word frequency against text frequency TF-IDF algorithm calculate described two texts it Between text similarity before, the method also includes:

7. according to the method described in claim 6, it is characterized in that, described carry out participle and part of speech mark to described two texts Note processing, comprising:

8. a kind of Text similarity computing device, which is characterized in that described device includes:

Text similarity calculation module, for calculating described two texts against text frequency TF-IDF algorithm based on improved word frequency Between text similarity；

Comprehensive similarity computing module, for determining described two texts according to the part of speech similarity and the text similarity Between comprehensive similarity.

9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor is realized as described in any one of claim 1-7 when executing the computer program Text similarity computing method.

10. a kind of storage medium comprising computer executable instructions, the computer executable instructions are by computer disposal Such as Text similarity computing method of any of claims 1-7 is realized when device executes.