CN109284490A - A kind of Text similarity computing method, apparatus, electronic equipment and storage medium - Google Patents

A kind of Text similarity computing method, apparatus, electronic equipment and storage medium Download PDF

Info

Publication number
CN109284490A
CN109284490A CN201811067314.8A CN201811067314A CN109284490A CN 109284490 A CN109284490 A CN 109284490A CN 201811067314 A CN201811067314 A CN 201811067314A CN 109284490 A CN109284490 A CN 109284490A
Authority
CN
China
Prior art keywords
text
similarity
word
texts
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811067314.8A
Other languages
Chinese (zh)
Other versions
CN109284490B (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Jinlv Network Technology Co ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201811067314.8A priority Critical patent/CN109284490B/en
Publication of CN109284490A publication Critical patent/CN109284490A/en
Application granted granted Critical
Publication of CN109284490B publication Critical patent/CN109284490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the invention discloses a kind of Text similarity computing method, apparatus, electronic equipment and storage mediums, which comprises based on the part of speech similarity between two texts of default part of speech weight calculation;The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency;The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.By using above-mentioned technical proposal, the computational accuracy of text similarity can be improved, and then improve the matching accuracy of Similar Text.

Description

A kind of Text similarity computing method, apparatus, electronic equipment and storage medium
Technical field
The present embodiments relate to technical field of data processing more particularly to a kind of Text similarity computing method, apparatus, Electronic equipment and storage medium.
Background technique
Currently, the direct broadcasting room application program based on iOS platform or based on Android platform is quickly grown, it is deep by user Like.Barrage is that a kind of popular expression way for information interchange and information sharing of platform is broadcast live, and passes through barrage Interacting between spectators and main broadcaster may be implemented, help to build good live streaming atmosphere.
In machine conference field, one of important link is to find out and read statement semantic similarity highest time It is multiple.Equally, it is frequently necessary to be directed to water by robot according to water friend's barrage, calculating and the higher reply of its similarity in direct broadcasting room Friendly barrage is automatically replied.Currently, generalling use TF-IDF (Term Frequency-Inverse in direct broadcasting room Document Frequency, word frequency is against text frequency) algorithm calculates the similarity between two barrages, and still, TF-IDF is calculated The frequency distribution that the main thought of method is word-based or phrase occurs in document sets determines the keyword of every document, then Word frequency vector is constructed according to the number that keyword occurs in document sets, the similarity between word frequency vector by calculating document Determine the similarity between document, it is seen then that TF-IDF algorithm only accounts for the word frequency of word in document, only accounts for document in other words The significance level of middle word.
Therefore, it in order to improve Text similarity computing precision, needs that existing similarity calculation algorithm is continued to improve.
Summary of the invention
The embodiment of the present invention provides a kind of Text similarity computing method, apparatus, electronic equipment and storage medium, passes through institute The computational accuracy of text similarity can be improved in the method for stating.
To achieve the above object, the embodiment of the present invention adopts the following technical scheme that
In a first aspect, the embodiment of the invention provides a kind of Text similarity computing methods, which comprises
Based on the part of speech similarity between two texts of default part of speech weight calculation;
The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency;
The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.
Further, the part of speech similarity based between two texts of default part of speech weight calculation includes:
The part of speech similarity between two texts is calculated according to following formula:
Wherein, Simwordpro(A, B) indicates the part of speech similarity between text A and text B, giIt indicates in text A i-th The part of speech weight of word, g 'iIndicate the part of speech weight of i-th of word in text B, n indicates the phrase in word and text B in text A At set in word sum, LAIndicate the sum of word in text A, LBIndicate the sum of word in text B.
Further, described to be calculated between described two texts based on improved word frequency against text frequency TF-IDF algorithm Text similarity, comprising:
The corresponding TF-IDF weight of each word in each text is calculated according to following formula:
Wherein, WijIndicate the corresponding TF-IDF weight of word j in text i, tfijIndicate the number that word j occurs in text i, N indicates the text for including in text set sum, njIndicate that the text sum in text set comprising word j, i are Text Flags, j is text The mark of word in this;
The text phase between described two texts is calculated based on the corresponding TF-IDF weight of word each in described two texts Like degree.
Further, described that described two texts are calculated based on the corresponding TF-IDF weight of word each in described two texts Between text similarity, comprising:
The text similarity between described two texts is calculated according to following formula:
Wherein, Simtf-idf(A, B) indicates the text similarity between text A and text B, WaiIt indicates in text A i-th The corresponding TF-IDF weight of word, WbiIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate word and text in text A The sum of word in the set of word composition in this B.
Further, described to be determined between described two texts according to the part of speech similarity and the text similarity Comprehensive similarity, comprising:
The comprehensive similarity between described two texts is determined according to following formula:
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A,B)
Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Simwordpro(A, B) indicates text A Part of speech similarity between text B, Simtf-idf(A, B) indicates the text similarity between text A and text B.
Further, the part of speech similarity based between two texts of default part of speech weight calculation or based on improving Word frequency calculate the text similarity between described two texts against text frequency TF-IDF algorithm before, the method is also wrapped It includes:
Participle and part-of-speech tagging processing are carried out to described two texts.
It is further, described that participle and part-of-speech tagging processing are carried out to described two texts, comprising:
Participle is carried out to described two texts using the jieba participle tool in python and part-of-speech tagging is handled.
Second aspect, the embodiment of the invention provides a kind of Text similarity computing device, described device includes:
Part of speech similarity calculation module, for based on the part of speech similarity between two texts of default part of speech weight calculation;
Text similarity calculation module, it is described two for being calculated based on improved word frequency against text frequency TF-IDF algorithm Text similarity between text;
Comprehensive similarity computing module, it is described two for being determined according to the part of speech similarity and the text similarity Comprehensive similarity between text.
Further, the part of speech similarity calculation module is specifically used for calculating between two texts according to following formula Part of speech similarity:
Wherein, Simwordpro(A, B) indicates the part of speech similarity between text A and text B, giIt indicates in text A i-th The part of speech weight of word, gi' indicate text B in i-th of word part of speech weight, n indicate text A in word and text B in phrase At set in word sum, LAIndicate the sum of word in text A, LBIndicate the sum of word in text B.
Further, the text similarity calculation module includes:
TF-IDF weight computing unit, for calculating the corresponding TF-IDF of each word in each text according to following formula Weight:
Wherein, WijIndicate the corresponding TF-IDF weight of word j in text i, tfijIndicate the number that word j occurs in text i, N indicates the text for including in text set sum, njIndicate that the text sum in text set comprising word j, i are Text Flags, j is text The mark of word in this;
Text similarity calculated, for calculating institute based on the corresponding TF-IDF weight of word each in described two texts State the text similarity between two texts.
Further, the text similarity calculated is specifically used for:
The text similarity between described two texts is calculated according to following formula:
Wherein, Simtf-idf(A, B) indicates the text similarity between text A and text B, WaiIt indicates in text A i-th The corresponding TF-IDF weight of word, WbiIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate word and text in text A The sum of word in the set of word composition in this B.
Further, the comprehensive similarity computing module is specifically used for:
The comprehensive similarity between described two texts is determined according to following formula:
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A,B)
Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Simwordpro(A, B) indicates text A Part of speech similarity between text B, Simtf-idf(A, B) indicates the text similarity between text A and text B.
Further, described device further include: processing module, for being based on default two texts of part of speech weight calculation described Part of speech similarity between this is calculated between described two texts based on improved word frequency against text frequency TF-IDF algorithm Text similarity before, to described two texts carry out participle and part-of-speech tagging handle.
Further, the processing module is specifically used for: using the jieba participle tool in python to described two texts This carries out participle and part-of-speech tagging processing.
The third aspect the embodiment of the invention provides a kind of electronic equipment, including memory, processor and is stored in storage On device and the computer program that can run on a processor, the processor realizes such as above-mentioned the when executing the computer program Text similarity computing method described in one side.
Fourth aspect, the embodiment of the invention provides a kind of storage medium comprising computer executable instructions, the meters Calculation machine executable instruction realizes the Text similarity computing side as described in above-mentioned first aspect when being executed as computer processor Method.
A kind of Text similarity computing method provided in an embodiment of the present invention, by combining the part of speech similarity between text And text similarity carries out overall merit to the similarity between text, improves the computational accuracy of text similarity, in turn Improve the matching accuracy of Similar Text.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, institute in being described below to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also implement according to the present invention The content of example and these attached drawings obtain other attached drawings.
Fig. 1 is a kind of Text similarity computing method flow schematic diagram that the embodiment of the present invention one provides;
Fig. 2 is a kind of Text similarity computing apparatus structure schematic diagram provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.
Specific embodiment
To keep the technical problems solved, the adopted technical scheme and the technical effect achieved by the invention clearer, below It will the technical scheme of the embodiment of the invention will be described in further detail in conjunction with attached drawing, it is clear that described embodiment is only It is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those skilled in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Embodiment one
Fig. 1 is a kind of Text similarity computing method flow schematic diagram that the embodiment of the present invention one provides.The present embodiment is public The Text similarity computing method opened is suitable for machine conference field, matches from corpus similar to read statement semanteme Highest answer sentence is spent, to be replied automatically for read statement, is particularly suitable for matching in direct broadcasting room and water friend The highest sentence of barrage similarity, so that water friend's barrage is automatically replied by robot.The Text similarity computing method It can be executed by Text similarity computing device, wherein the device can be implemented by software and/or hardware, and be typically integrated in end In end, such as server etc..Referring specifically to shown in Fig. 1, this method comprises the following steps:
110, based on the part of speech similarity between two texts of default part of speech weight calculation.
Wherein, the part of speech specifically includes: noun, verb, interrogative, adjective and adverbial word etc..It is right in two texts The part of speech of word is answered to react the similitude of two texts to a certain extent, therefore, by combining the part of speech phase between two texts The text similarity between two texts is calculated like degree, the computational accuracy of text similarity can be improved.
Illustratively, according to following formula based on the part of speech similarity between two texts of default part of speech weight calculation:
Wherein, Simwordpro(A, B) indicates the part of speech similarity between text A and text B, giIt indicates in text A i-th The part of speech weight of word, g 'iIndicate the part of speech weight of i-th of word in text B, n indicates the phrase in word and text B in text A At set in word sum, LAIndicate the sum of word in text A, LBThe sum for indicating word in text B, when i is greater than LAWhen, gi =0, when i is greater than LBWhen, g 'i=0, concrete meaning may refer to subsequent illustration.By the way that the denominator of formula (1) is arranged ForThe case where avoidable denominator is zero improves the scope of application of formula (1).
The default part of speech weight is pressed in combination with specific business scenario by the text to part of speech similarity known to multiple groups It is calculated according to above-mentioned formula (1), the anti-weight for releasing corresponding part of speech.In general, the noun and verb in sentence can express sentence It is most of semantic, i.e. the meaning that indicates in sentence of noun and verb is relatively heavy, and hence it is also possible to according to business experience, By the relatively high of noun and verb corresponding part of speech weight setting, and by the relatively low of the weight of other parts of speech setting.It is excellent Choosing, when specific business scenario is in the barrage text sent based on direct broadcasting room platform, the part of speech weight of noun can be Value between 0.7-0.8, the part of speech weight of verb can between 0.6-07 value, the part of speech weight of interrogative can be in 0.5-0.6 Between value, for the present embodiment with the part of speech weight of noun for 0.7, the part of speech weight of verb is 0.6, and the part of speech weight of interrogative is 0.5, the part of speech weight of other words is illustrated for being 0.
Assuming that text A are as follows: I wants to go to Beijing and study in college;
Text B are as follows: university of Pekinese is very joyful;
After carrying out participle and part-of-speech tagging processing to text A and text B, obtain:
A=I/n wants to go to the/Beijing adv/n reading/v university/n
The Beijing B=/n/adv university/n is true/and adj is joyful/adj
The set of word composition in word and text B in text A are as follows: I, wants to go to, Beijing, reads, university, it is very, good Play, therefore, n is equal to 8, the corresponding part of speech weight of each word in set are as follows: U={ 0.7,0,0.7,0.6,0.7,0,0,0 }.Cause The corresponding part of speech weight g of each word in this text Ai={ 0.7,0,0.7,0.6,0.7,0,0,0 }, each word is corresponding in text B Part of speech weight g 'i={ 0.7,0,0.7,0,0,0,0,0 }.It include five words in text B due to including five words in text A, Therefore LA=5, LB=5.
Therefore, the part of speech similarity between text A and text B is calculated based on above-mentioned formula (1)
120, the text calculated between described two texts against text frequency TF-IDF algorithm based on improved word frequency is similar Degree.
Specifically, described calculate the text between described two texts against text frequency TF-IDF algorithm based on improved word frequency Word similarity, comprising:
The corresponding TF-IDF weight of each word in each text is calculated according to following formula:
Wherein, WijIndicate the corresponding TF-IDF weight of word j in text i, tfijIndicate the number that word j occurs in text i, N indicates the text for including in text set sum, njIndicate that the text sum in text set comprising word j, i are Text Flags, j is text The mark of word in this;
The text phase between described two texts is calculated based on the corresponding TF-IDF weight of word each in described two texts Like degree.
For specific business scenario, a corpus relevant to specific transactions scene can be arranged in advance, for example, institute State specific transactions scene are as follows: the text similarity between the barrage text sent for No. 1 direct broadcasting room is calculated, due to each live streaming Between live content it is different, cause the ownership theme of different direct broadcasting rooms different, therefore, the barrage text sent for different direct broadcasting rooms The domain term for including in this is not quite similar.For example, the corresponding main broadcaster a of No. 1 direct broadcasting room is especially good at and plays games, it is particularly good at and beats " king's honor ", therefore, the frequent live game video of No. 1 direct broadcasting room, the ownership theme of No. 1 direct broadcasting room then may be defined as often straight The game name broadcast, such as " king's honor ", or content relevant to game episode, such as the person names in game, dress Standby title or Mission Objective etc., such as often the ownership theme of No. 1 direct broadcasting room of live streaming " king's honor " can also be " flowers and trees It is blue ", " ermine cicada " or " Lu Na " etc..Necessarily comprising much and live game in the barrage text then sent for No. 1 direct broadcasting room Relevant domain term, at this point, can will then be directed to all barrage texts of No. 1 direct broadcasting room transmission in set period of time as the spy Determine the corpus under business scenario comprising domain term, is trained based on the corpus using above-mentioned formula (2), it is specific to obtain this TF-IDF weight vector space under business scenario, i.e., the vector space of the corresponding TF-IDF weight composition of each domain term.Then For each of No. 1 direct broadcasting room barrage text to be matched, then in the TF-IDF weight vector space under the specific transactions scene A point or a vector, therefore, available barrage text reflecting in the TF-IDF weight vector space to be matched It penetrates, the TF-IDF weight of each word in barrage text to be matched can be obtained.
By to existing TF-IDF weight calculation formulaIt improves, it is provided in this embodiment TF-IDF weight calculation formulaIt may be implemented to treat the neologisms in matched text and carry out TF-IDF Weight calculation, and then realize and the similitude of the text to be matched comprising neologisms is matched, the neologisms refer in TF-IDF corpus Or the word not included in TF-IDF dictionary, it therefore, in text set include the text sum n of neologismsj=0, existing TF-IDF power Re-computation formulaIt can not then adapt to the case where neologisms occur in text to be matched.
Further, described that described two texts are calculated based on the corresponding TF-IDF weight of word each in described two texts Between text similarity, comprising:
The text similarity between described two texts is calculated according to following formula:
Wherein, Simtf-idf(A, B) indicates the text similarity between text A and text B, WaiIt indicates in text A i-th The corresponding TF-IDF weight of word, WbiIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate word and text in text A The sum of word in the set of word composition in this B.
In TF-IDF algorithm, if the frequency tf higher that some word or phrase occur in a text, but in text The frequency occurred in other texts of collection is very low, then it is assumed that the word or phrase have good class discrimination ability, are adapted to Classification, the word or phrase can be used as keyword, and assign higher tf-idf weight for keyword, and therefore, the tf-idf of word is weighed Weight increases with the increase of word frequency rate, increases with the increase of the rare degree of word.Each text is calculated using above-mentioned formula (2) The corresponding TF-IDF weight of each word in this, then each text can be expressed as the real-valued vectors based on TF-IDF weight, Then the length of the corresponding real-valued vectors of each text is normalized, so that the length one of the corresponding real-valued vectors of each text It causes, the cosine similarity of the corresponding real-valued vectors of every two text is finally calculated based on above-mentioned formula (3), which is For the text similarity between two texts.
It should be noted that not limiting sequencing between step 110 and step 120, step 120 can be preferentially executed, Step 110 can also be preferentially executed, this implementation is illustrated for preferentially executing step 110, and but not is to step 110 With the restriction of the execution sequence of step 120.
Further, the part of speech similarity based between two texts of default part of speech weight calculation or based on improving Word frequency calculate the text similarity between described two texts against text frequency TF-IDF algorithm before, the method is also wrapped It includes:
Participle and part-of-speech tagging processing are carried out to described two texts, specifically, using the jieba in python points Word tool carries out participle to described two texts and part-of-speech tagging is handled, and the present embodiment is no longer described in detail.
130, determine that the synthesis between described two texts is similar according to the part of speech similarity and the text similarity Degree.
Illustratively, described to be determined between described two texts according to the part of speech similarity and the text similarity Comprehensive similarity, comprising:
The comprehensive similarity between described two texts is determined according to following formula:
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A,B) (4)
Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Simwordpro(A, B) indicates text A Part of speech similarity between text B, Simtf-idf(A, B) indicates the text similarity between text A and text B.
Continue to be exemplified as example with above-mentioned, it is assumed that text A are as follows: I wants to go to Beijing and study in college;
Text B are as follows: university of Pekinese is very joyful;
After carrying out participle and part-of-speech tagging processing to text A and text B, obtain:
A=I/n wants to go to the/Beijing adv/n reading/v university/n
The Beijing B=/n/adv university/n is true/and adj is joyful/adj
Text A, text B are respectively as follows: in the mapping of the vector space of TF-IDF
Wai={ 0.1,0.2,0.3,0.1,0.6,0.1,0.1,0.1 }
Wbi={ 0.1,0.2,0.5,0.2,0.6,0.3,0.4,0.3 }
The text similarity between text A and text B is then obtained according to above-mentioned formula (3) are as follows:
Text similarity Sim between two textstf-idfThe value range of (A, B)=cos θ is [- 1,1], is calculated Value closer to 1, indicate that the text similarity between two texts is higher, i.e. the semanteme of two texts is closer.
The comprehensive similarity between text A and text B is further obtained according to above-mentioned formula (4):
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A, B)=0.458*0.907=0.415
As it can be seen that the text similarity 0.907 between text A and text B is very high, if only by between text A and text B Text similarity determines the semantic similarity between text A and text B, then will appear biggish deviation, and accuracy is not high;And lead to Cross the scheme of the present embodiment it is found that the comprehensive similarity between text A and text B be not it is very high, illustrate text A and text B Semanteme be not it is much like, be consistent with actual conditions, therefore, the present embodiment, which passes through, combines the part of speech between two texts similar Degree and text similarity evaluate the comprehensive similarity between two texts, improve semantic similar between two texts The computational accuracy of degree, and then improve the matching accuracy of Similar Text.
A kind of Text similarity computing method provided in this embodiment, by based on default two texts of part of speech weight calculation Between part of speech similarity;The text between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency Similarity;The skill of the comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity Art means realize the computational accuracy for improving semantic similarity between two texts, and then the matching for improving Similar Text is accurate The purpose of degree.
Embodiment two
Fig. 2 is a kind of Text similarity computing apparatus structure schematic diagram provided by Embodiment 2 of the present invention.Institute referring to fig. 2 Show, described device includes: that part of speech similarity calculation module 210, text similarity calculation module 220 and comprehensive similarity calculate mould Block 230;
Wherein, part of speech similarity calculation module 210, for based on the part of speech between two texts of default part of speech weight calculation Similarity;
Text similarity calculation module 220, for being based on improved word frequency against described in the calculating of text frequency TF-IDF algorithm Text similarity between two texts;
Comprehensive similarity computing module 230, for according to the part of speech similarity and text similarity determination Comprehensive similarity between two texts.
Further, part of speech similarity calculation module 210 is specifically used for calculating between two texts according to following formula Part of speech similarity:
Wherein, Simwordpro(A, B) indicates the part of speech similarity between text A and text B, giIt indicates in text A i-th The part of speech weight of word, gi' indicate text B in i-th of word part of speech weight, n indicate text A in word and text B in phrase At set in word sum, LAIndicate the sum of word in text A, LBIndicate the sum of word in text B.
Further, text similarity calculation module 220 includes:
TF-IDF weight computing unit, for calculating the corresponding TF-IDF of each word in each text according to following formula Weight:
Wherein, WijIndicate the corresponding TF-IDF weight of word j in text i, tfijIndicate the number that word j occurs in text i, N indicates the text for including in text set sum, njIndicate that the text sum in text set comprising word j, i are Text Flags, j is text The mark of word in this;
Text similarity calculated, for calculating institute based on the corresponding TF-IDF weight of word each in described two texts State the text similarity between two texts.
Further, the text similarity calculated is specifically used for:
The text similarity between described two texts is calculated according to following formula:
Wherein, Simtf-idf(A, B) indicates the text similarity between text A and text B, WaiIt indicates in text A i-th The corresponding TF-IDF weight of word, WbiIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate word and text in text A The sum of word in the set of word composition in this B.
Further, comprehensive similarity computing module 230 is specifically used for:
The comprehensive similarity between described two texts is determined according to following formula:
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A,B)
Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Simwordpro(A, B) indicates text A Part of speech similarity between text B, Simtf-idf(A, B) indicates the text similarity between text A and text B.
Further, described device further include: processing module, for being based on default two texts of part of speech weight calculation described Part of speech similarity between this is calculated between described two texts based on improved word frequency against text frequency TF-IDF algorithm Text similarity before, to described two texts carry out participle and part-of-speech tagging handle.
Further, the processing module is specifically used for: using the jieba participle tool in python to described two texts This carries out participle and part-of-speech tagging processing.
A kind of Text similarity computing device provided in this embodiment, by based on default two texts of part of speech weight calculation Between part of speech similarity;The text between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency Similarity;The skill of the comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity Art means realize the computational accuracy for improving semantic similarity between two texts, and then the matching for improving Similar Text is accurate The purpose of degree.
Embodiment three
Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.As shown in figure 3, the electronics is set It is standby to include: processor 670, memory 671 and be stored in the computer journey that run on memory 671 and on processor 670 Sequence;Wherein, the quantity of processor 670 can be one or more, in Fig. 3 by taking a processor 670 as an example;Processor 670 is held The Text similarity computing method as described in above-described embodiment one is realized when the row computer program.As shown in figure 3, described Electronic equipment can also include input unit 672 and output device 673.Processor 670, memory 671,672 and of input unit Output device 673 can be connected by bus or other modes, in Fig. 3 for being connected by bus.
Memory 671 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, as in the embodiment of the present invention Text similarity computing device/module (for example, in Text similarity computing device Part of speech similarity calculation module 210, text similarity calculation module 220 and comprehensive similarity computing module 230 etc.).Processing Software program, instruction and the module that device 670 is stored in memory 671 by operation, thereby executing the various of electronic equipment Above-mentioned Text similarity computing method is realized in functional application and data processing.
Memory 671 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This Outside, memory 671 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 671 can be into one Step includes the memory remotely located relative to processor 670, these remote memories can be set by network connection to electronics Standby/storage medium.The example of above-mentioned network include but is not limited to internet, intranet, local area network, mobile radio communication and its Combination.
Input unit 672 can be used for receiving the number or character information of input, and generates and set with the user of electronic equipment It sets and the related key signals of function control inputs.Output device 673 may include that display screen etc. shows equipment.
Example IV
The embodiment of the present invention four also provides a kind of storage medium comprising computer executable instructions, and the computer can be held Row instruction is used to execute a kind of Text similarity computing method when being executed by computer processor, this method comprises:
Based on the part of speech similarity between two texts of default part of speech weight calculation;
The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency;
The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The method operation that executable instruction is not limited to the described above, it is similar to can also be performed text provided by any embodiment of the invention Degree calculates relevant operation.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, storage medium or the network equipment etc.) executes described in each embodiment of the present invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (10)

1. a kind of Text similarity computing method characterized by comprising
Based on the part of speech similarity between two texts of default part of speech weight calculation;
The text similarity between described two texts is calculated against text frequency TF-IDF algorithm based on improved word frequency;
The comprehensive similarity between described two texts is determined according to the part of speech similarity and the text similarity.
2. the method according to claim 1, wherein described based between default two texts of part of speech weight calculation Part of speech similarity include:
The part of speech similarity between two texts is calculated according to following formula:
Wherein, Simwordpro(A, B) indicates the part of speech similarity between text A and text B, giIndicate i-th word in text A Part of speech weight, gi' indicate text B in i-th of word part of speech weight, n indicate text A in word and text B in word composition The sum of word, L in setAIndicate the sum of word in text A, LBIndicate the sum of word in text B.
3. the method according to claim 1, wherein described calculated based on improved word frequency against text frequency TF-IDF Method calculates the text similarity between described two texts, comprising:
The corresponding TF-IDF weight of each word in each text is calculated according to following formula:
Wherein, WijIndicate the corresponding TF-IDF weight of word j in text i, tfijIndicate that the number that word j occurs in text i, N indicate The text sum for including in text set, njIndicate the text sum in text set comprising word j, i is Text Flag, and j is in text Word mark;
The text similarity between described two texts is calculated based on the corresponding TF-IDF weight of word each in described two texts.
4. according to the method described in claim 3, it is characterized in that, described corresponding based on each word in described two texts TF-IDF weight calculates the text similarity between described two texts, comprising:
The text similarity between described two texts is calculated according to following formula:
Wherein, Simtf-idf(A, B) indicates the text similarity between text A and text B, WaiIndicate i-th of word pair in text A The TF-IDF weight answered, WbiIndicate that the corresponding TF-IDF weight of i-th of word in text B, n indicate in word and text B in text A Word composition set in word sum.
5. method according to claim 1-4, which is characterized in that described according to the part of speech similarity and described Text similarity determines the comprehensive similarity between described two texts, comprising:
The comprehensive similarity between described two texts is determined according to following formula:
Sim (A, B)=Simwordpro(A,B)*Simtf-idf(A,B)
Wherein, Sim (A, B) indicates the comprehensive similarity between text A and text B, Simwordpro(A, B) indicates text A and text Part of speech similarity between B, Simtf-idf(A, B) indicates the text similarity between text A and text B.
6. method according to claim 1-4, which is characterized in that described based on default part of speech weight calculation two Part of speech similarity between text or based on improved word frequency against text frequency TF-IDF algorithm calculate described two texts it Between text similarity before, the method also includes:
Participle and part-of-speech tagging processing are carried out to described two texts.
7. according to the method described in claim 6, it is characterized in that, described carry out participle and part of speech mark to described two texts Note processing, comprising:
Participle is carried out to described two texts using the jieba participle tool in python and part-of-speech tagging is handled.
8. a kind of Text similarity computing device, which is characterized in that described device includes:
Part of speech similarity calculation module, for based on the part of speech similarity between two texts of default part of speech weight calculation;
Text similarity calculation module, for calculating described two texts against text frequency TF-IDF algorithm based on improved word frequency Between text similarity;
Comprehensive similarity computing module, for determining described two texts according to the part of speech similarity and the text similarity Between comprehensive similarity.
9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor is realized as described in any one of claim 1-7 when executing the computer program Text similarity computing method.
10. a kind of storage medium comprising computer executable instructions, the computer executable instructions are by computer disposal Such as Text similarity computing method of any of claims 1-7 is realized when device executes.
CN201811067314.8A 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium Active CN109284490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811067314.8A CN109284490B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811067314.8A CN109284490B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109284490A true CN109284490A (en) 2019-01-29
CN109284490B CN109284490B (en) 2024-02-27

Family

ID=65180498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811067314.8A Active CN109284490B (en) 2018-09-13 2018-09-13 Text similarity calculation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109284490B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871540A (en) * 2019-02-21 2019-06-11 武汉斗鱼鱼乐网络科技有限公司 A kind of calculation method and relevant device of text similarity
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110674363A (en) * 2019-08-30 2020-01-10 中国人民财产保险股份有限公司 Similarity matching method and device between interface services and electronic equipment
CN111160028A (en) * 2019-12-31 2020-05-15 东软集团股份有限公司 Method, device, storage medium and equipment for judging semantic similarity of two texts
CN111225227A (en) * 2020-01-03 2020-06-02 网易(杭州)网络有限公司 Bullet screen publishing method, bullet screen model generating method and bullet screen publishing device
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002054279A1 (en) * 2001-01-04 2002-07-11 Agency For Science, Technology And Research Improved method of text similarity measurement
CN101695082A (en) * 2009-09-30 2010-04-14 北京航空航天大学 Service organization method based on relation mining and device thereof
CN105677634A (en) * 2015-07-18 2016-06-15 孙维国 Method for extracting sentences with similar meanings and standard grammar from academic documents
JP5936698B2 (en) * 2012-08-27 2016-06-22 株式会社日立製作所 Word semantic relation extraction device
CN106776548A (en) * 2016-12-06 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the Similarity Measure of text
CN106934005A (en) * 2017-03-07 2017-07-07 重庆邮电大学 A kind of Text Clustering Method based on density

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002054279A1 (en) * 2001-01-04 2002-07-11 Agency For Science, Technology And Research Improved method of text similarity measurement
CN101695082A (en) * 2009-09-30 2010-04-14 北京航空航天大学 Service organization method based on relation mining and device thereof
JP5936698B2 (en) * 2012-08-27 2016-06-22 株式会社日立製作所 Word semantic relation extraction device
CN105677634A (en) * 2015-07-18 2016-06-15 孙维国 Method for extracting sentences with similar meanings and standard grammar from academic documents
CN106776548A (en) * 2016-12-06 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the Similarity Measure of text
CN106934005A (en) * 2017-03-07 2017-07-07 重庆邮电大学 A kind of Text Clustering Method based on density

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
废柴社: "利用TF-IDF及余弦公式处理文本相似性的计算", 《HTTPS://WWW.JIANSHU.COM/P/68B0B3126E8C》 *
张超等: "一种PST_LDA中文文本相似度计算方法", 《计算机应用研究》 *
陈二静等: "文本相似度计算方法研究综述", 《数据分析与知识发现》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN109871540A (en) * 2019-02-21 2019-06-11 武汉斗鱼鱼乐网络科技有限公司 A kind of calculation method and relevant device of text similarity
CN109871540B (en) * 2019-02-21 2022-12-23 武汉斗鱼鱼乐网络科技有限公司 Text similarity calculation method and related equipment
CN110674363A (en) * 2019-08-30 2020-01-10 中国人民财产保险股份有限公司 Similarity matching method and device between interface services and electronic equipment
CN110674363B (en) * 2019-08-30 2022-04-22 中国人民财产保险股份有限公司 Similarity matching method and device between interface services and electronic equipment
CN111160028A (en) * 2019-12-31 2020-05-15 东软集团股份有限公司 Method, device, storage medium and equipment for judging semantic similarity of two texts
CN111160028B (en) * 2019-12-31 2023-05-16 东软集团股份有限公司 Method, device, storage medium and equipment for judging semantic similarity of two texts
CN111225227A (en) * 2020-01-03 2020-06-02 网易(杭州)网络有限公司 Bullet screen publishing method, bullet screen model generating method and bullet screen publishing device
CN113505196A (en) * 2021-06-30 2021-10-15 和美(深圳)信息技术股份有限公司 Part-of-speech-based text retrieval method and device, electronic equipment and storage medium
CN113505196B (en) * 2021-06-30 2024-01-30 和美(深圳)信息技术股份有限公司 Text retrieval method and device based on parts of speech, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109284490B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN109284490A (en) A kind of Text similarity computing method, apparatus, electronic equipment and storage medium
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
US10191892B2 (en) Method and apparatus for establishing sentence editing model, sentence editing method and apparatus
US8990065B2 (en) Automatic story summarization from clustered messages
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN106407280B (en) Query target matching method and device
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
WO2021134524A1 (en) Data processing method, apparatus, electronic device, and storage medium
CN108241613A (en) A kind of method and apparatus for extracting keyword
CN108664465B (en) Method and related device for automatically generating text
CN110909120A (en) Resume searching/delivering method, device and system and electronic equipment
CN110895656B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN113342968A (en) Text abstract extraction method and device
CN110297897B (en) Question-answer processing method and related product
CN110430448B (en) Bullet screen processing method and device and electronic equipment
CN108268443B (en) Method and device for determining topic point transfer and acquiring reply text
CN109472032A (en) A kind of determination method, apparatus, server and the storage medium of entity relationship diagram
CN109255066B (en) Label marking method, device, server and storage medium for business object
KR102460595B1 (en) Method and apparatus for providing real-time chat service in game broadcasting
Parizi et al. Do Character-Level Neural Network Language Models Capture Knowledge of Multiword Expression Compositionality?
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium
CN110020429A (en) Method for recognizing semantics and equipment
KR102519955B1 (en) Apparatus and method for extracting of topic keyword
CN113657116B (en) Social media popularity prediction method and device based on visual semantic relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231124

Address after: Room 205, Room 206, Room J1447, No. 1045 Tianyuan Road, Tianhe District, Guangzhou City, Guangdong Province, 510000

Applicant after: Guangzhou Caimeng Technology Co.,Ltd.

Address before: 11 / F, building B1, phase 4.1, software industry, No.1, Software Park East Road, Wuhan East Lake Development Zone, Wuhan City, Hubei Province, 430070

Applicant before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240124

Address after: Room 503, 5th Floor, Building E, Runqinyuan, No. 62, Nanhu Road, Tianxin District, Changsha City, Hunan Province, 410000

Applicant after: Changsha Jinlv Network Technology Co.,Ltd.

Country or region after: China

Address before: Room 205, Room 206, Room J1447, No. 1045 Tianyuan Road, Tianhe District, Guangzhou City, Guangdong Province, 510000

Applicant before: Guangzhou Caimeng Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant