CN104462060A - Method and device for calculating text similarity and realizing search processing through computer - Google Patents

Method and device for calculating text similarity and realizing search processing through computer Download PDF

Info

Publication number
CN104462060A
CN104462060A CN201410728432.4A CN201410728432A CN104462060A CN 104462060 A CN104462060 A CN 104462060A CN 201410728432 A CN201410728432 A CN 201410728432A CN 104462060 A CN104462060 A CN 104462060A
Authority
CN
China
Prior art keywords
text string
cypher
angle value
model
semantic similitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410728432.4A
Other languages
Chinese (zh)
Other versions
CN104462060B (en
Inventor
张军
吴先超
刘占一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410728432.4A priority Critical patent/CN104462060B/en
Publication of CN104462060A publication Critical patent/CN104462060A/en
Application granted granted Critical
Publication of CN104462060B publication Critical patent/CN104462060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for calculating text similarity and realizing search processing achieved through a computer. The method comprises the following steps: acquiring a first text cluster and a second cluster; decoding the first text cluster according to a preset phrase translation model and a dependency structure model to obtain K translation text clusters; respectively calculating a first semantic similarity value between the K translation text clusters and the second text cluster, and calculating a second semantic similarity value between the first text cluster and the second text cluster according to the K calculated semantic similarity values. By adopting the method and the device, the problem of long distance dependency relationship in sentences is solved, the semantics of searched sentences can be relatively well expressed, the searched sentences can be relatively well matched with webpage titles, and a user can obtain semantic matching search result items, so that the search experience of the user is improved.

Description

By computer implemented calculating text similarity and search processing method and device
Technical field
The present invention relates to natural language processing technique, particularly relate to a kind of by computer implemented calculating text similarity and search processing method and device.
Background technology
In the middle of search engine, in order on as well as possible each territory matching document of search word (or Query) that user can be inputted (such as, title, content), usually adopt the method based on the coupling of word completely to realize described coupling.
Also the method utilizing translation model is had at present, title and search word is supposed (such as from the angle of translation, Query) be under the hypothesis write as with different sublanguages, be translated into " useful " such phrase translation realize semantic coupling to being similar to " effective ".But, this method can not solve the long-distance dependence problem in the middle of target language, simply can only carry out semantic matches, making can not real embodiment and represent the semanteme of search statement, thus by search statement and web page title matching error, affect Search Results display and sequence, and then affect Consumer's Experience.Such as, sentence " why Guan Yu does not kill Cao behaviour then " is mated for " why not Cao behaviour kills Guan Yu then ", in former sentence (query), " Guan Yu " is subject, " Cao behaviour " is object, and due to unresolved long distance dependent relation problem, search statement and web page title only carry out mating of word, and the dependence of actual sentence does not embody.
Summary of the invention
The object of the invention is to, provide a kind of by computer implemented calculating text similarity and search processing method and device, portray non-local dependence better, solve long distance dependent relation, thus realize better matching effect.
According to an aspect of the present invention, a kind of method by computer implemented calculating text similarity is provided, comprises: obtain the first text string and the second text string; According to the phrase translation model pre-set and dependency structure model, described first text string is decoded, obtain K cypher text string; Calculate the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
According to an aspect of the present invention, a kind of search processing method is provided, comprises: receive search word; Multiple Search Results entry is obtained according to described search word; The semantic similitude angle value of the content title of described search word and described multiple Search Results entry is calculated according to the described method by computer implemented calculating text similarity; Described semantic similitude angle value according to calculating sorts to described multiple Search Results entry; Send the Search Results entry through sequence.
According to a further aspect in the invention, a kind of device calculating text similarity is provided, comprises: text string acquiring unit, for obtaining the first text string and the second text string; Text string decoding unit, for according to the phrase translation model pre-set and dependency structure model, decodes to described first text string, obtains K cypher text string; Similarity value computing unit, for calculating the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
According to a further aspect in the invention, a kind of search process device is provided, comprises: search word receiving element, for receiving search word; Search Results acquiring unit, for obtaining multiple Search Results entry according to described search word; Semantic similitude value computing unit, the device for described calculating text similarity calculates the semantic similitude angle value of the content title of described search word and described multiple Search Results entry; Sequencing unit, for sorting to described multiple Search Results entry according to the described semantic similitude angle value calculated; Transmitting element, for sending the Search Results entry through sequence.
The embodiment of the present invention provide by computer implemented calculating text similarity and search processing method and device, by phrase translation model and dependency structure model, decoding is carried out to the first text string (search keyword or query as user's input) and obtain multiple cypher text string, calculate the first semantic similitude angle value between described multiple cypher text string and the second text string (content title as Search Results entry) respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to the multiple first semantic similitude angle value calculated, solve the dependence problem of text string middle and long distance, can be comprehensive, calculate the similarity between text string exactly.
In search technique, Semantic Similarity Measurement as above is carried out by the content title of Search Results entry search word and search obtained, the semanteme of search statement can be represented better, and can comprehensively according to this Similarity value and the first text string, the Search Results returned is sorted, thus obtain optimum Search Results, check for user.So, solve the dependence problem of text string middle and long distance, thus better search statement is mated with web page title, the Search Results entry of semantic matches is provided to user, strengthen user search and experience.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the method by computer implemented calculating text similarity that exemplary embodiment of the present is shown.
Fig. 2 is the dependence exemplary plot of the sentence that exemplary embodiment of the present is shown.
Fig. 3 is the schematic flow sheet of the search processing method that exemplary embodiment of the present is shown.
Fig. 4 is the structured flowchart of the device of the calculating text similarity that exemplary embodiment of the present is shown.
Fig. 5 is the structured flowchart of the search process device that exemplary embodiment of the present is shown.
Embodiment
Basic conception of the present invention is, in the information processing technology, is realized the coupling of semantic structure by the dependency structure model introducing target language for translation model; In the process of text matches, translation model and dependency structure models coupling are decoded to text string, in order to produce Top K cypher text string, carry out with another text string that will compare/mate the coupling realizing semantic structure again by described multiple cypher text string, strengthen semantic structural information, and by the calculating of semantic similarity, push the web page title mated with search statement to user.
Traditional phrase translation model, when search word being translated into the title of Top K, can use NGRAM language model to investigate the language regulation translated the title obtained and whether meet target language.In the present invention, in order to investigate the dependency structure of target language further, so introduce a dependency structure model further.
Specifically, the dependence of a sentence refer to sentence S=(w1, w2 ... wn) be modify the so interdependent arc of wi with the modified relationship of descriptor wj to wi by wj between two words (wi, wj) in; In addition, in order to describe the modified relationship of chain type, increasing special root (root) node w0, representing its initial relation with (w0, wi).
The dependency structure probability of sentence S can be calculated by following equation:
P ( S ) = Π ( wi , wj ) p ( wi , wj ) p ( wi ) ,
Wherein, p (wi, wj) represent that wj modifies the interdependent arc probability of wi, p (wi) is the probability that word wi occurs, p (wi, wj) and p (wi) can obtain by adding up in the interdependent treebank that prestores or in the middle of large-scale data, i and j represents the position that word occurs in sentence.
Fig. 2 is the dependence exemplary plot of the sentence that exemplary embodiment of the present is shown.Such as, p (water conservancy)=0.6, p (water conservancy, engineering)=0.5, then can be calculated by above-mentioned equation:
By that analogy, can dependency structure probability in corresponding calculating sentence between each word and word, take advantage of tired for these dependency structure probability, then obtain the dependency structure probability of this sentence.
Based on the calculating of the dependency structure probability of preceding sentence, train dependency structure model by a large amount of dependency tree.The training of described dependency structure model is not improvement of the present invention, is not therefore described in detail at this.
Below in conjunction with accompanying drawing being described in detail by computer implemented calculating text similarity and search processing method and device exemplary embodiment of the present.
Fig. 1 is the schematic flow sheet of the method by computer implemented calculating text similarity that exemplary embodiment of the present is shown.
With reference to Fig. 1, in step S110, obtain the first text string and the second text string.Wherein, described first text string can be the search statement (or a query) that user inputs, and described second text string can be the web page title of the Search Results entry obtained by described search statement.
In step S120, according to the phrase translation model pre-set and dependency structure model, described first text string is decoded, obtain K cypher text string.
In natural language processing technique, the machine translation method of Corpus--based Method is a kind of main machine translation method, and its basic thought is process mechanical translation being regarded as information transmission, is decoded to mechanical translation by channel model.According to a preferred embodiment of the invention, by post search demoder, described first text string is decoded, obtain Top K cypher text string.
Particularly, in the process of step S120, the cypher text string corresponding with described first text string can be calculated according to phrase translation model, determine the dependency structure between the word of long distance and word according to dependency structure model again, thus determine that whether the first text string is similar semantically to described cypher text string.
Preferably, according to described phrase translation model, described dependency structure model, NGRAM language model and word order Twisting model, described first text string is decoded, obtain Top K cypher text string.Wherein, described word order Twisting model is for examining or check the model of source language and the position relationship of the corresponding phrase in target language in natural language processing technique, the probability that traditional NGRAM language model occurs for examining or check a sentence.By marking to arbitrary candidate's text string based on phrase translation model, dependency structure model, NGRAM language model and word order Twisting model, can produce from a semantically more close Top K cypher text string.
Preferably, by following formula, comprehensive grading Score (T) is calculated to arbitrary candidate's text string T:
Score(T)=λ 1LM(T)+λ 2TM(Q,T)+λ 3D(Q,T)+λ 4DEP(T)
Wherein, LM (T) is to the scoring of cypher text string T according to described NGRAM language model, TM (Q, T) be the probability score being translated as cypher text string T according to described phrase translation model by the first text string Q, D (Q, T) be the scoring being translated as cypher text string T by the first text string Q calculated according to described word order Twisting model, DEP (T) is according to described dependency structure model to the scoring of cypher text string T, λ 1~ λ 4the weight of the scoring of giving aforementioned four models respectively.After this, in the middle of candidate's text string, described K cypher text string is chosen by described comprehensive grading.
Particularly, described comprehensive grading Score (T) is sorted to candidate's cypher text string according to described by post search demoder, filter out the higher K of scoring (or scoring is TOP-K) cypher text string (TOP1, TOP2, TOP3 ... TOPK).Such as, if the first text string is " hard ", the post search demoder cypher text string obtained of being decoded has " hard ", " firm ", " firm ", " hard ", " hard " and " solid ", again such as, first text string is " peach ", the post search demoder cypher text string obtained of being decoded can have " peach ", " carambola ", " honey peach ", " honey peach ", " peach ", so, post search demoder therefrom filters out the higher K of a scoring cypher text string according to its comprehensive grading again.
In step S130, calculate the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
Preferably, described the first semantic similitude angle value calculated respectively between described K cypher text string and described second text string.Particularly, the computing of described first semantic similitude angle value comprises:
First, obtain at least one second interdependent arc that dependency analysis obtains is carried out to described second text string.Here, because the obtain from the second text string second interdependent arc is used multiple times, therefore after can obtaining at least one second interdependent arc described carrying out described second text string dependency analysis, at least one second interdependent arc described is retained in reuse in buffer memory, and does not need to re-start dependency analysis at every turn and obtain at least one second interdependent arc described again.
After this, to arbitrary described cypher text string, perform following process: dependency analysis is carried out to described cypher text string, obtain at least one first interdependent arc, calculate the first semantic similitude angle value between described arbitrary described cypher text string and described second text string based at least one first interdependent arc described and at least one the second interdependent arc.
Preferably, the cosine similarity of at least one first interdependent arc described and at least one the second interdependent arc is calculated as the first semantic similitude angle value between described arbitrary described cypher text string and described second text string.
Such as, K cypher text string is obtained by step S120, the first interdependent arc and the second interdependent arc is calculated by step S130, cosine calculating is carried out to described first interdependent arc and the second interdependent arc, and then obtain K the first semantic similitude angle value, such as, if the set of the interdependent arc of arbitrary cypher text string t and the second text string w is expressed as: arcs (t)={ (t0, ti), (ti, tj), and arcs (w) { (w0, wi), (wi, wj), , cosine similarity (i.e. the first semantic similitude angle value) Similarity (t of cypher text string t and the second text string w is so calculated by following equation, w):
Similarity ( t , w ) = Cos ( arcs ( t ) , arcs ( w ) ) = Σ ( wi , wj ) = = ( ti , tj ) numbersof ( wi , wj ) * nombersof ( ti , tj ) Σ ( wi , wj ) ∈ arcs ( w ) numbersof ( wi , wj ) 2 × Σ ( ti , tj ) ∈ arcs ( t ) numbersof ( ti , tj ) 2
Wherein, numbersof (wi, wj) and numbersof (ti, tj) represents the number of interdependent arc (wi, wj) and (ti, tj) respectively.
Preferably, give the score of each cypher text string as weight using described dependency structure model, summation is weighted to described K the first semantic similitude angle value, obtains the second semantic similitude angle value between described first text string and the second text string.
Such as, calculate the second semantic similitude angle value by following equation,
Similarity ( Q , W ) = 1 K Σ i = 1 K DEP ( t ) × Similarity ( t , w )
Wherein, DEP (t) is according to described dependency structure model to the scoring of cypher text string t, and can be obtained by above-mentioned dependency structure probability calculation, K is the number of cypher text string.
Fig. 3 is the schematic flow sheet of the search processing method that exemplary embodiment of the present is shown.
With reference to Fig. 3, in step S210, receive search word, i.e. the search keyword that always inputs at search engine of user.
In step S220, multiple Search Results entry is obtained according to described search word, such as, by step S210, the search keyword receiving user is " baby fever ", and the Search Results entry got may be the Search Results entry such as " baby's cat fever ", " baby's fever ", " neonate has a high fever ", " child's fever " or " baby's heating ".
In step S230, calculate the semantic similitude angle value of the content title of described search word and described multiple Search Results entry according to the method for aforementioned calculating text similarity.
In step S240, the described semantic similitude angle value according to calculating at described step S230 sorts to described multiple Search Results entry.
At this, still for above-mentioned " baby fever ", if calculate its semantic similitude angle value (such as representing with Similarity) by step S230 to be respectively Similarity (baby's cat fever, baby fever)=0.87, (baby's Similarity has a fever, baby fever)=0.71, (neonate's Similarity has a high fever, baby fever)=0.83, (child's Similarity has a fever, baby fever)=0.65, (baby's Similarity generates heat, baby fever)=0.79, carry out to little sequence being greatly to Similarity value: Similarity (baby's cat fever, baby fever), (neonate's Similarity has a high fever, baby fever), (baby's Similarity generates heat, baby fever), (baby's Similarity has a fever, baby fever), (child's Similarity has a fever, baby fever).
In step S250, send the Search Results entry through sequence.For above-mentioned " baby fever ", the Search Results entry finally sent in order is then: baby's cat fever, neonate has a high fever, baby generates heat, baby has a fever and child fever.
The embodiment of the present invention provide by computer implemented calculating text similarity and search processing method, by phrase translation model and dependency structure model, decoding is carried out to the first text string (search keyword or query as user's input) and obtain multiple cypher text string, calculate the first semantic similitude angle value between described multiple cypher text string and the second text string (content title as Search Results entry) respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to the multiple first semantic similitude angle value calculated, solve the dependence problem of text string middle and long distance, can be comprehensive, calculate the similarity between text string exactly.
In search technique, Semantic Similarity Measurement as above is carried out by the content title of Search Results entry search word and search obtained, the semanteme of search statement can be represented better, and can comprehensively according to this Similarity value and the first text string, the Search Results returned is sorted, thus obtain optimum Search Results, check for user.So, solve the dependence problem of text string middle and long distance, thus better search statement is mated with web page title, the Search Results entry of semantic matches is provided to user, strengthen user search and experience.
Fig. 4 is the structured flowchart of the device of the calculating text similarity that exemplary embodiment of the present is shown.
As shown in Figure 4, the device of described calculating text similarity comprises text string acquiring unit 310, text string decoding unit 320 and Similarity value computing unit 330.
Text string acquiring unit 310 is for obtaining the first text string and the second text string.
Such as, described first text string can be the search statement that user inputs, and described second text string can be the web page title of the document of the band coupling prestored.
Text string decoding unit 320, for decoding to described first text string according to the phrase translation model pre-set and dependency structure model, obtains K cypher text string.
Preferably, text string decoding unit 320 is decoded to described first text string according to described phrase translation model, described dependency structure model, NGRAM language model and word order Twisting model, obtain described K cypher text string, wherein, by post search demoder, described first text string is decoded, obtain K cypher text string.
Preferably, text string decoding unit 320 calculates comprehensive grading Score (T) by following formula to arbitrary candidate's text string T:
Score(T)=λ 1LM(T)+λ 2TM(Q,T)+λ 3D(Q,T)+λ 4DEP(T)
Wherein, LM (T) is to the scoring of cypher text string T according to described NGRAM language model, TM (Q, T) be the probability score being translated as cypher text string T according to described phrase translation model by the first text string Q, D (Q, T) be the scoring being translated as cypher text string T by the first text string Q calculated according to described word order Twisting model, DEP (T) is according to described dependency structure model to the scoring of cypher text string T, λ 1~ λ 4be the weight of the scoring of giving aforementioned four models respectively, in the middle of candidate's text string, choose described K cypher text string by described comprehensive grading.
Similarity value computing unit 330 for calculating the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculates the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
Preferably, Similarity value computing unit 330 obtains and carries out to described second text string at least one second interdependent arc that dependency analysis obtains, and to arbitrary described cypher text string, perform following process: dependency analysis is carried out to described cypher text string, obtain at least one first interdependent arc, calculate the first semantic similitude angle value between described arbitrary described cypher text string and described second text string based at least one first interdependent arc described and at least one the second interdependent arc.
Preferably, Similarity value computing unit 330 calculates the cosine similarity of at least one first interdependent arc described and at least one the second interdependent arc as the first semantic similitude angle value between described arbitrary described cypher text string and described second text string.
Preferably, Similarity value computing unit 330 gives the score of each cypher text string as weight using described dependency structure model, summation is weighted to described K the first semantic similitude angle value, obtains the second semantic similitude angle value between described first text string and the second text string.
Fig. 5 is the structured flowchart of the search process device that exemplary embodiment of the present is shown.
With reference to Fig. 5, described search process device comprises: search word receiving element 410, Search Results acquiring unit 420, semantic similitude value computing unit 430, sequencing unit 440 and transmitting element 450.
Search word receiving element 410 for receiving search word, i.e. the search keyword that always inputs at search engine of user.
Search Results acquiring unit 420 obtains multiple Search Results entry for the search word received according to described search word receiving element 410.
Semantic similitude value computing unit 430 is for calculating the semantic similitude angle value of the content title of described search word and described multiple Search Results entry by the device of aforesaid calculating text similarity.
Sequencing unit 440 is for sorting to described multiple Search Results entry according to the described semantic similitude angle value calculated.
Transmitting element 450 is for sending the Search Results entry through sequence.
The embodiment of the present invention provide by computer implemented calculating text similarity and search process device, by phrase translation model and dependency structure model, decoding is carried out to the first text string (search keyword or query as user's input) and obtain multiple cypher text string, calculate the first semantic similitude angle value between described multiple cypher text string and the second text string (content title as Search Results entry) respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to the multiple first semantic similitude angle value calculated, solve the dependence problem of text string middle and long distance, can be comprehensive, calculate the similarity between text string exactly.
In search technique, Semantic Similarity Measurement as above is carried out by the content title of Search Results entry search word and search obtained, the semanteme of search statement can be represented better, and can comprehensively according to this Similarity value and the first text string, the Search Results returned is sorted, thus obtain optimum Search Results, check for user.So, solve the dependence problem of text string middle and long distance, thus better search statement is mated with web page title, the Search Results entry of semantic matches is provided to user, strengthen user search and experience.
It may be noted that the needs according to implementing, each step described can be split as more multi-step, also the part operation of two or more step or step can be combined into new step, to realize object of the present invention in the application.
Above-mentioned can at hardware according to method of the present invention, realize in firmware, or be implemented as and can be stored in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk) in software or computer code, or be implemented and will be stored in the computer code in local recording medium by the original storage of web download in remote logging medium or nonvolatile machine readable media, thus method described here can be stored in use multi-purpose computer, such software process on the recording medium of application specific processor or able to programme or specialized hardware (such as ASIC or FPGA).Be appreciated that, computing machine, processor, microprocessor controller or programmable hardware comprise and can store or receive the memory module of software or computer code (such as, RAM, ROM, flash memory etc.), when described software or computer code by computing machine, processor or hardware access and perform time, realize disposal route described here.In addition, when the code for realizing the process shown in this accessed by multi-purpose computer, multi-purpose computer is converted to the special purpose computer for performing the process shown in this by the execution of code.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims (16)

1. by a method for computer implemented calculating text similarity, it is characterized in that, described method comprises:
Obtain the first text string and the second text string;
According to the phrase translation model pre-set and dependency structure model, described first text string is decoded, obtain K cypher text string;
Calculate the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
2. method according to claim 1, is characterized in that, the described process calculating the first semantic similitude angle value between described K cypher text string and described second text string respectively comprises:
Obtain and at least one second interdependent arc that dependency analysis obtains carried out to described second text string, and to arbitrary described cypher text string, perform following process:
Dependency analysis is carried out to described cypher text string, obtains at least one first interdependent arc,
The first semantic similitude angle value between described arbitrary described cypher text string and described second text string is calculated based at least one first interdependent arc described and at least one the second interdependent arc.
3. method according to claim 2, it is characterized in that, the described process based at least one first interdependent arc described and the first semantic similitude angle value between at least one the second interdependent arc described arbitrary described cypher text string of calculating and described second text string comprises:
Calculate the cosine similarity of at least one first interdependent arc described and at least one the second interdependent arc as the first semantic similitude angle value between described arbitrary described cypher text string and described second text string.
4. the method according to any one of claims 1 to 3, is characterized in that, the process of the described K according to calculating a second semantic similitude angle value that the first semantic similitude angle value calculates between described first text string and the second text string comprises:
Give the score of each cypher text string as weight using described dependency structure model, summation is weighted to described K the first semantic similitude angle value, obtains the second semantic similitude angle value between described first text string and the second text string.
5. method according to claim 4, is characterized in that, the phrase translation model that described basis pre-sets and dependency structure model are decoded to described first text string, and the process obtaining K cypher text string comprises:
According to described phrase translation model, described dependency structure model, NGRAM language model and word order Twisting model, described first text string is decoded, obtain described K cypher text string.
6. method according to claim 5, it is characterized in that, according to described phrase translation model, described dependency structure model, NGRAM language model and word order Twisting model, described first text string Q is decoded described, obtain in the process of described K cypher text string T
By following formula, comprehensive grading Score (T) is calculated to arbitrary candidate's text string T:
Score(T)=λ 1LM(T)+λ 2TM(Q,T)+λ 3D(Q,T)+λ 4DEP(T)
Wherein, LM (T) is to the scoring of cypher text string T according to described NGRAM language model, TM (Q, T) be the probability score being translated as cypher text string T according to described phrase translation model by the first text string Q, D (Q, T) be the scoring being translated as cypher text string T by the first text string Q calculated according to described word order Twisting model, DEP (T) is according to described dependency structure model to the scoring of cypher text string T, λ 1~ λ 4the weight of the scoring of giving aforementioned four models respectively,
In the middle of candidate's text string, described K cypher text string is chosen by described comprehensive grading.
7. method according to claim 6, is characterized in that, is decoded, obtain K cypher text string by post search demoder to described first text string.
8. a search processing method, is characterized in that, comprising:
Receive search word;
Multiple Search Results entry is obtained according to described search word;
Search word as described in calculating according to the method according to any one of claim 1 ~ 7 and as described in the semantic similitude angle value of content title of multiple Search Results entry;
Described semantic similitude angle value according to calculating sorts to described multiple Search Results entry;
Send the Search Results entry through sequence.
9. calculate a device for text similarity, it is characterized in that, described device comprises:
Text string acquiring unit, for obtaining the first text string and the second text string;
Text string decoding unit, for according to the phrase translation model pre-set and dependency structure model, decodes to described first text string, obtains K cypher text string;
Similarity value computing unit, for calculating the first semantic similitude angle value between described K cypher text string and described second text string respectively, and calculate the second semantic similitude angle value between described first text string and the second text string according to K the first semantic similitude angle value calculated.
10. device according to claim 9, is characterized in that, described Similarity value computing unit obtains and carries out to described second text string at least one second interdependent arc that dependency analysis obtains,
To arbitrary described cypher text string, perform following process:
Dependency analysis is carried out to described cypher text string, obtains at least one first interdependent arc,
The first semantic similitude angle value between described arbitrary described cypher text string and described second text string is calculated based at least one first interdependent arc described and at least one the second interdependent arc.
11. devices according to claim 10, it is characterized in that, described Similarity value computing unit calculates the cosine similarity of at least one first interdependent arc described and at least one the second interdependent arc as the first semantic similitude angle value between described arbitrary described cypher text string and described second text string.
12. devices according to any one of claim 9 ~ 11, it is characterized in that, described Similarity value computing unit gives the score of each cypher text string as weight using described dependency structure model, summation is weighted to described K the first semantic similitude angle value, obtains the second semantic similitude angle value between described first text string and the second text string.
13. devices according to claim 12, it is characterized in that, described text string decoding unit is decoded to described first text string according to described phrase translation model, described dependency structure model, NGRAM language model and word order Twisting model, obtains described K cypher text string.
14. devices according to claim 13, is characterized in that, described text string decoding unit calculates comprehensive grading Score (T) by following formula to arbitrary candidate's text string T:
Score(T)=λ 1LM(T)+λ 2TM(Q,T)+λ 3D(Q,T)+λ 4DEP(T)
Wherein, LM (T) is to the scoring of cypher text string T according to described NGRAM language model, TM (Q, T) be the probability score being translated as cypher text string T according to described phrase translation model by the first text string Q, D (Q, T) be the scoring being translated as cypher text string T by the first text string Q calculated according to described word order Twisting model, DEP (T) is according to described dependency structure model to the scoring of cypher text string T, λ 1~ λ 4the weight of the scoring of giving aforementioned four models respectively,
In the middle of candidate's text string, described K cypher text string is chosen by described comprehensive grading.
15. devices according to claim 14, is characterized in that, decoded, obtain K cypher text string by post search demoder to described first text string.
16. 1 kinds of search process devices, is characterized in that, comprising:
Search word receiving element, for receiving search word;
Search Results acquiring unit, for obtaining multiple Search Results entry according to described search word;
Semantic similitude value computing unit, for search word as described in calculating according to the device according to any one of claim 9 ~ 15 and as described in the semantic similitude angle value of content title of multiple Search Results entry;
Sequencing unit, for sorting to described multiple Search Results entry according to the described semantic similitude angle value calculated;
Transmitting element, for sending the Search Results entry through sequence.
CN201410728432.4A 2014-12-03 2014-12-03 Pass through computer implemented calculating text similarity and search processing method and device Active CN104462060B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410728432.4A CN104462060B (en) 2014-12-03 2014-12-03 Pass through computer implemented calculating text similarity and search processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410728432.4A CN104462060B (en) 2014-12-03 2014-12-03 Pass through computer implemented calculating text similarity and search processing method and device

Publications (2)

Publication Number Publication Date
CN104462060A true CN104462060A (en) 2015-03-25
CN104462060B CN104462060B (en) 2017-08-01

Family

ID=52908130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410728432.4A Active CN104462060B (en) 2014-12-03 2014-12-03 Pass through computer implemented calculating text similarity and search processing method and device

Country Status (1)

Country Link
CN (1) CN104462060B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021346A (en) * 2016-05-09 2016-10-12 北京百度网讯科技有限公司 A retrieval processing method and device
CN106227771A (en) * 2016-07-15 2016-12-14 浙江大学 A kind of domain expert based on socialization's programming website finds method
CN106503175A (en) * 2016-11-01 2017-03-15 上海智臻智能网络科技股份有限公司 The inquiry of Similar Text, problem extended method, device and robot
CN106776782A (en) * 2016-11-21 2017-05-31 北京百度网讯科技有限公司 Semantic similarity acquisition methods and device based on artificial intelligence
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN107784037A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 Information processing method and device, the device for information processing
CN107885737A (en) * 2017-12-27 2018-04-06 传神语联网网络科技股份有限公司 A kind of human-computer interaction interpretation method and system
CN111708942A (en) * 2020-06-12 2020-09-25 北京达佳互联信息技术有限公司 Multimedia resource pushing method, device, server and storage medium
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
CN112182348A (en) * 2020-11-09 2021-01-05 百度国际科技(深圳)有限公司 Semantic matching judgment method and device, electronic equipment and computer readable medium
US11216844B2 (en) * 2017-03-29 2022-01-04 Ebay Inc. Generating keywords by associative context with input words
CN111414531B (en) * 2020-03-20 2023-08-08 北京百度网讯科技有限公司 Event searching method and device and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010029443A1 (en) * 2000-03-27 2001-10-11 International Business Machines Corporation Machine translation system, machine translation method, and storage medium storing program for executing machine translation method
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102567306A (en) * 2011-11-07 2012-07-11 苏州大学 Acquisition method and acquisition system for similarity of vocabularies between different languages
CN102637163A (en) * 2011-01-09 2012-08-15 华东师范大学 Method and system for controlling multi-level ontology matching based on semantemes
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation
EP2541435A1 (en) * 2010-02-26 2013-01-02 National Institute of Information and Communication Technology Relational information expansion device, relational information expansion method and program

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010029443A1 (en) * 2000-03-27 2001-10-11 International Business Machines Corporation Machine translation system, machine translation method, and storage medium storing program for executing machine translation method
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
EP2541435A1 (en) * 2010-02-26 2013-01-02 National Institute of Information and Communication Technology Relational information expansion device, relational information expansion method and program
CN102637163A (en) * 2011-01-09 2012-08-15 华东师范大学 Method and system for controlling multi-level ontology matching based on semantemes
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102567306A (en) * 2011-11-07 2012-07-11 苏州大学 Acquisition method and acquisition system for similarity of vocabularies between different languages

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021346A (en) * 2016-05-09 2016-10-12 北京百度网讯科技有限公司 A retrieval processing method and device
CN106021346B (en) * 2016-05-09 2020-01-07 北京百度网讯科技有限公司 Retrieval processing method and device
CN106227771B (en) * 2016-07-15 2019-05-07 浙江大学 A kind of domain expert's discovery method based on socialization programming website
CN106227771A (en) * 2016-07-15 2016-12-14 浙江大学 A kind of domain expert based on socialization's programming website finds method
CN107784037B (en) * 2016-08-31 2022-02-01 北京搜狗科技发展有限公司 Information processing method and device, and device for information processing
CN107784037A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 Information processing method and device, the device for information processing
CN106503175A (en) * 2016-11-01 2017-03-15 上海智臻智能网络科技股份有限公司 The inquiry of Similar Text, problem extended method, device and robot
CN106503175B (en) * 2016-11-01 2019-03-29 上海智臻智能网络科技股份有限公司 Inquiry, problem extended method, device and the robot of Similar Text
CN106776782B (en) * 2016-11-21 2020-05-22 北京百度网讯科技有限公司 Semantic similarity obtaining method and device based on artificial intelligence
CN106776782A (en) * 2016-11-21 2017-05-31 北京百度网讯科技有限公司 Semantic similarity acquisition methods and device based on artificial intelligence
US11216844B2 (en) * 2017-03-29 2022-01-04 Ebay Inc. Generating keywords by associative context with input words
US11769173B2 (en) 2017-03-29 2023-09-26 Ebay Inc. Generating keywords by associative context with input words
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN107729300B (en) * 2017-09-18 2021-12-24 百度在线网络技术(北京)有限公司 Text similarity processing method, device and equipment and computer storage medium
CN107885737A (en) * 2017-12-27 2018-04-06 传神语联网网络科技股份有限公司 A kind of human-computer interaction interpretation method and system
CN107885737B (en) * 2017-12-27 2021-04-27 传神语联网网络科技股份有限公司 Man-machine interactive translation method and system
CN111414531B (en) * 2020-03-20 2023-08-08 北京百度网讯科技有限公司 Event searching method and device and electronic equipment
CN111708942B (en) * 2020-06-12 2023-08-08 北京达佳互联信息技术有限公司 Multimedia resource pushing method, device, server and storage medium
CN111708942A (en) * 2020-06-12 2020-09-25 北京达佳互联信息技术有限公司 Multimedia resource pushing method, device, server and storage medium
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
CN112182348A (en) * 2020-11-09 2021-01-05 百度国际科技(深圳)有限公司 Semantic matching judgment method and device, electronic equipment and computer readable medium
CN112182348B (en) * 2020-11-09 2024-03-29 百度国际科技(深圳)有限公司 Semantic matching judging method, device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN104462060B (en) 2017-08-01

Similar Documents

Publication Publication Date Title
CN104462060A (en) Method and device for calculating text similarity and realizing search processing through computer
Thakur et al. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models
Zhao et al. Dense text retrieval based on pretrained language models: A survey
CN104850554B (en) Searching method and system
CN104462057B (en) For the method and system for the lexicon for producing language analysis
US9965726B1 (en) Adding to a knowledge base using an ontological analysis of unstructured text
US10157174B2 (en) Utilizing a dialectical model in a question answering system
US20180293294A1 (en) Similar Term Aggregation Method and Apparatus
CN105930314A (en) Text summarization generation system and method based on coding-decoding deep neural networks
CN104049755A (en) Information processing method and device
CN110083696B (en) Global citation recommendation method and system based on meta-structure technology
US9569525B2 (en) Techniques for entity-level technology recommendation
CN107656921B (en) Short text dependency analysis method based on deep learning
Zvonarev et al. A Comparison of Machine Learning Methods of Sentiment Analysis Based on Russian Language Twitter Data.
Plu et al. A hybrid approach for entity recognition and linking
CN111814477A (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN104391969A (en) User query statement syntactic structure determining method and device
Omeliyanenko et al. Lm4kg: Improving common sense knowledge graphs with language models
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
US11550794B2 (en) Automated determination of document utility for a document corpus
Hecht The mining and application of diverse cultural perspectives in user-generated content
Agarwal et al. Towards Effective Paraphrasing for Information Disguise
Setiawan Topic Detection on Twitter using GloVe with Convolutional Neural Network and Gated Recurrent Unit
Parniani et al. Relation extraction with sentence simplification process and entity information
Valiaiev Detection of Machine-Generated Text: Literature Survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20180116

Address after: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer 2

Patentee after: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Address before: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer three

Patentee before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right