CN106484664A

CN106484664A - Similarity calculating method between a kind of short text

Info

Publication number: CN106484664A
Application number: CN201610920608.5A
Authority: CN
Inventors: 简仁贤; 陈秀龙
Original assignee: Intelligent Technology (shanghai) Co Ltd
Current assignee: Intelligent Technology (shanghai) Co Ltd
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2017-03-08
Anticipated expiration: 2036-10-21
Also published as: CN106484664B

Abstract

The invention discloses similarity calculating method between a kind of short text, obtain corpus data, pretreatment is carried out to corpus data, obtains corpus；According to corpus, obtain keyword extraction model, using participle instrument to corpus participle, and train acquisition term vector collection with word2vec；Obtain the problem of user input text and candidate's question and answer pair, respectively obtain word segmentation result and keyword extraction result；According to word segmentation result and keyword extraction result, calculate the problem of candidate's question and answer pair and the term vector of user input text by term vector collection, sentence vector is obtained by term vector, calculate the similarity between two sentence vectors；Carry out the correction of similarity by the information comprising in the text of user input and the problem of candidate's question and answer pair, obtain revised similarity.The present invention is calculated by the cosine similarity between to the problem sentence of user input and candidate's question and answer pair vector, and by the sentence pattern of sentence, name entity and pronoun correction similarity.

Description

Similarity calculating method between a kind of short text

Technical field

The present invention relates to Internet technical field, more particularly, to intelligent human computer conversation field.

Background technology

With the continuous rising of the informationalized continuous evolution of human society and manual service cost, people increasingly wish Exchanged with computer by natural language, human-machine intelligence's chat system becomes the product being born under such historical background.

Mainly there are two kinds of implementation methods in existing interactive system, one kind is retrieval model, another kind is to generate mould Type.It is the process of primary information retrieval that retrieval model treats as a wheel human computer conversation, is asked by getting out certain data volume in advance Answer questions, and question and answer are established index to the problem in (it is made up of a problem and several answers).When user input one When sentence or some words, just it is treated as primary retrieval, finds the problem the most close with its semanteme in all candidate's question and answer centerings, Then the answer of this problem is returned to this user, complete a wheel human computer conversation.Therefore want to obtain appropriate answer key It is that how to find the semantic most like problem with the input of user.Due to user input and candidate in interactive system The short text that the problem of question and answer centering is typically made up of one or some short sentences, so just fall problem phase between short text Calculate like degree.

In prior art, the method calculating similarity between short text is exactly will be each for the problem of user input and candidate's question and answer pair The sentence vector of same dimension is changed in rotation, and each dimension values of vector are each in this user input or the problem of candidate's question and answer pair From word (or referred to as participle) TF*IDF value, then weigh similarity between the two by such as calculating cosine similarity Come to all candidate's question and answer to sequence, the method is common method in search engine.But calculated remaining using the TF*IDF of vector String similarity looking for the method for most like problem to only considered the text similarity between sentence, that is, literal upper how many repeat Participle judging similarity between sentence, this is clearly inadequate, and such as " I am very tired " and " I want to sleep " semanteme are the same But almost there is no too many dittograph on literal, the method just cannot tackle this situation.Further, since interactive system leads to Often use short sentence, therefore TF is substantially 1, no too big effect, this also can affect the effect of the method.

Defect therefore of the prior art is, calculates user input and candidate's question and answer to asking by the TF*IDF value of participle The cosine similarity of the term vector of topic, only considered the text similarity between sentence, how many may only be gone up by literal The similarity to judge sentence for the participle repeating, so can make the judgement of similarity very inaccurate, directly result in human computer conversation The information replying user in system is inaccurate.

Content of the invention

The technical problem to be solved in the present invention is to provide similarity calculating method between a kind of short text, employs defeated to user Enter or the problem of candidate's question and answer pair carries out participle and keyword extraction, obtain corresponding term vector, according to term vector, calculating obtains Corresponding sentence vector, be then calculated the similarity between two sentences vectors, finally by the sentence pattern of sentence, name entity and Pronoun is modified to similarity, makes similarity become more accurate, and then improves the standard replying user in interactive system Really property.

For solving above-mentioned technical problem, the technical scheme that the present invention provides is：

The present invention provides similarity calculating method between a kind of short text, including：

Step S1, obtains corpus data, carries out pretreatment to described corpus data, obtain corpus；

Step S2, according to described corpus, obtains keyword extraction model, using participle instrument to described corpus Participle, and train acquisition term vector collection with word2vec；

Step S3, obtains problem, the problem to described candidate's question and answer pair and the institute of user input text and candidate's question and answer pair State user input text and participle is carried out respectively by described participle instrument, defeated to the problem and described user of described candidate's question and answer pair Enter text and respectively keyword extraction is carried out by described keyword extraction model, obtain the participle of the problem of described candidate's question and answer pair Result and keyword extraction result, and the word segmentation result of described user input text and keyword extraction result；

Step S4, the word segmentation result of the problem according to described candidate's question and answer pair and keyword extraction result, by institute's predicate Vector set obtains term vector, the word segmentation result according to described user input text and the key word of the problem of described candidate's question and answer pair Extract result, obtain the term vector of described user input text by described term vector collection；

Step S5, the term vector of the problem according to described candidate's question and answer pair, calculate the problem obtaining described candidate's question and answer pair Sentence vector, according to the term vector of described user input text, calculate the sentence vector obtaining described user input text；

Step S6, the sentence vector of the user input text described in sentence vector sum of the problem according to described candidate's question and answer pair, meter Calculate the similarity between two sentence vectors；

Step S7, according to the similarity between described sentence vector, by text and described candidate's question and answer of described user input To problem in the information that comprises carry out the correction of similarity, obtain revised similarity.

The present invention is a kind of technical scheme of similarity calculating method between short text is first to obtain corpus data, to institute's predicate Material data carries out pretreatment, obtains corpus, according to described corpus, obtains keyword extraction model, using participle work Tool is to described corpus participle, and trains acquisition term vector collection with word2vec；Obtain user input text and candidate's question and answer To problem, the problem to described candidate's question and answer pair and described user input text are carried out point respectively by described participle instrument Word, the problem to described candidate's question and answer pair and described user input text carry out key by described keyword extraction model respectively Word extracts, and obtains the word segmentation result of problem and the keyword extraction result of described candidate's question and answer pair, and described user input text Word segmentation result and keyword extraction result；Word segmentation result then according to the problem of described candidate's question and answer pair and keyword extraction As a result, obtain the term vector of the problem of described candidate's question and answer pair by described term vector collection, according to described user input text Word segmentation result and keyword extraction result, obtain the term vector of described user input text by described term vector collection；According to institute State the term vector of the problem of candidate's question and answer pair, calculate the sentence vector of the problem obtaining described candidate's question and answer pair, according to described user The term vector of input text, calculates the sentence vector obtaining described user input text；According to two sentence vectors, calculate two sentences to Similarity between amount；Finally according to the similarity between described sentence vector, asked by the text and described candidate of described user input The information comprising in the problem answered questions carries out the correction of similarity, obtains revised similarity.

The present invention is that between a kind of short text, similarity calculating method employs to user input text and candidate's question and answer pair The process that problem carries out participle and extracts key word, then calculates user input text and candidate's question and answer according to participle and key word To problem term vector, then calculate this two term vectors respectively, obtain user input text and the problem of candidate's question and answer pair Sentence vector, finally by the cosine similarity being calculated between two sentences vector, pass through further user input text and The information comprising in the problem of candidate's question and answer pair is modified to similarity, obtains more accurate similarity, makes human computer conversation In system, the answer replying user is more accurate.

Further, the information comprising in the problem of the text of described user input and described candidate's question and answer pair is text sentence Type, name entity and personal pronoun, described name entity includes place name and organization names.

The sentence vector of the problem of the text according to user input and candidate's question and answer pair, calculates similar between two sentence vectors Degree, the similarity that this method obtains is accurate as a rule, but is needing to consider sentence pattern, name entity and person generation Only still not accurate as the whether similar foundation of the semanteme judging text according to the similarity between two sentence vectors when word Really, therefore similarity to be modified, the present invention also analyzes the text of user input and the information of the problem of candidate's question and answer pair, It is exactly to the sentence pattern in text, name entity and personal pronoun to be analyzed, revise similarity further, and then it is just right to improve people Telephone system replys the accuracy of customer problem.

Further, in described step S2, obtain described keyword extraction model and include：

Step S21, obtains key word corpus, according to described key word corpus, carries out participle, obtains participle knot Really；

Step S22, according to described word segmentation result, marks the key in described word segmentation result by way of artificial mark Word, the key word corpus after manually being marked；

Step S23, according to the key word corpus after described artificial mark, obtains key word by maximum entropy training and carries Delivery type.

The key word in participle can be extracted by keyword extraction model, that is, participle comprises key word, because key word The semanteme of text more can be represented, so extracting the key word in participle, can be more similar using participle calculating than only in conjunction with key word Degree is more accurate.In order to train keyword extraction model, first obtain key word corpus, these key word language materials can and word The corpus of vector are different, and the method then passing through artificial mark marks out key word, by the side of maximum entropy in participle Method sets up training pattern, and in this model, which this model can export automatically is to close to any new text input not marked Which is not key word to keyword, obtains keyword set with this, to help improve the similarity between sentence vector.

Further, described keyword extraction model is the grader of one 2 classification.The model of said extracted key word is By this grader, the grader of one 2 classification, can predict that in sentence, which word is key word, improves and extracts key word Accuracy.

Further, described term vector collection is obtained by word2vec model training.Word2vec training tool is a kind of Neural network model, the semanteme of the term vector that this model training method obtains be by with it often together with more phase occurs Seemingly, that is the semantic information of the term vector that this model training method obtains is co-occurrence according to word to be captured.Pass through Word2vec model training obtains term vector, and the information in conjunction with key word can be calculated more accurately sentence vector, makes phase More accurate like degree.

Further, the term vector dimension phase of the term vector of problem of described candidate's question and answer pair and described user input text With.The dimension of term vector is identical, so just facilitates the meansigma methodss calculating below participle vector, key word term vector average Value, then calculates the sentence vector of the sentence vector sum user input text to problem for the corresponding candidate's question and answer again, finally counts Calculate the similarity between this two sentence vectors, therefore, the dimension of the term vector obtaining is identical, and the dimension obtaining term vector wants phase With.

Further, described name entity and personal pronoun are obtained by dictionary method.Similarity is modified permissible By naming entity and personal pronoun, present invention primarily contemplates place name or mechanism's name in name entity.Name entity correction be Directly result in two language justice dissmilarities in order to solve the difference of place name such as, if " Beijing has anything to be fond of eating " is with " Tianjin has assorted Nice ", this two word is different except place name, and semanteme is similar it is therefore desirable to be repaiied to similarity by naming entity Just, the similarity of two words whether unanimously can directly be judged according to the place name occurring in sentence or mechanism's name, it is right to improve The judgement of similarity between sentence.Therefore the present invention uses dictionary method, and dictionary file comprises main prefecture-level city of China, and each ground Separate no inclusion relation between name.

Further, described corpus data is obtained by web crawlers technology.A large amount of language materials are obtained by crawler technology, climbs Worm technology is a kind of program of automatic acquisition web page contents, by reptile obtain some mhkcs, Ask-Answer Community, forum, microblogging, hundred The longer semantic information of the particularly content such as section, news enriches simultaneously again relatively colloquial model and reply etc. as training language So that corpus information comprehensive and abundant, the selection of language material influences whether the quality of training pattern to material, finally affects similarity.

Further, the similarity between described sentence vector is calculated by the method for cosine similarity.According to two sentence vectors Between cosine computational methods calculate cosine value between two sentences vector, cosine value, closer to 1, indicates that angle closer to 0 degree, It is exactly that two sentence vectors are more similar.Cosine computational methods quickly and easily, can improve systematic function.

Further, described participle instrument is Chinese handling implement bag hanlp.By participle instrument, corpus are entered Row participle, the participle instrument that the present invention selects is hanlp (Han Language Processing), and hanlp is to increase income freely Chinese processes bag, a series of Java tool kit being made up of models and algorithm, it is possible to achieve Chinese word segmentation, key word carries Take, a series of functions such as index participle, also there is the complete functions such as offer morphological analysis, syntactic analysis, semantic understanding. Hanlp possesses perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.Therefore the present invention selects Hanlp is as participle instrument.

Brief description

In order to be illustrated more clearly that the specific embodiment of the invention or technical scheme of the prior art, below will be to concrete In embodiment or description of the prior art, the accompanying drawing of required use is briefly described.

The flow chart that Fig. 1 shows similarity calculating method between a kind of short text that first embodiment of the invention is provided.

Specific embodiment

Below in conjunction with accompanying drawing, the embodiment of technical solution of the present invention is described in detail.Following examples are only used for Clearly technical scheme is described, is therefore intended only as example, and the protection of the present invention can not be limited with this Scope.

Embodiment one

The flow chart that Fig. 1 shows similarity calculating method between a kind of short text that first embodiment of the invention is provided. As shown in figure 1, similarity calculating method includes between short text according to a first embodiment of the present invention：

Step S1, obtains corpus data, carries out pretreatment to corpus data, obtain corpus；

Step S2, according to corpus, obtains keyword extraction model, using participle instrument to corpus participle, and Trained with word2vec and obtain term vector collection；

Step S3, obtains the problem of user input text and candidate's question and answer pair, defeated to the problem and user of candidate's question and answer pair Enter text and participle is carried out respectively by participle instrument, the problem to candidate's question and answer pair and user input text pass through key word respectively Extraction model carries out keyword extraction, obtains word segmentation result and the keyword extraction result of the problem of candidate's question and answer pair, and user The word segmentation result of input text and keyword extraction result；

Step S4, the word segmentation result of the problem according to candidate's question and answer pair and keyword extraction result, by term vector collection meter Calculate the term vector of the problem of candidate's question and answer pair, the word segmentation result according to user input text and keyword extraction result, pass through Term vector collection calculates the term vector of user input text；

Step S5, the term vector of the problem according to candidate's question and answer pair, calculate the sentence vector of the problem obtaining candidate's question and answer pair, According to the term vector of user input text, calculate the sentence vector obtaining user input text；

Step S6, the sentence vector of the sentence vector sum user input text of the problem according to candidate's question and answer pair, calculate two sentences Similarity between vector；

Step S7, according to the similarity between sentence vector, by wrapping in the text of user input and the problem of candidate's question and answer pair The information containing carries out the correction of similarity, obtains revised similarity.

The present invention is a kind of technical scheme of similarity calculating method between short text, first obtains corpus data, to language material number According to carrying out pretreatment, obtain corpus, according to corpus, obtain keyword extraction model, using participle instrument to training Language material participle, and train acquisition term vector collection with word2vec；Obtain the problem of user input text and candidate's question and answer pair, to time The problem of question and answer pair and user input text is selected to carry out participle, the problem to candidate's question and answer pair and user respectively by participle instrument Input text carries out keyword extraction by keyword extraction model respectively, obtain the word segmentation result of problem of candidate's question and answer pair with Keyword extraction result, and the word segmentation result of user input text and keyword extraction result；Then according to candidate's question and answer pair The term vector of problem, calculates the sentence vector of the problem obtaining candidate's question and answer pair, according to the term vector of user input text, calculating obtains Obtain the sentence vector of user input text；The sentence vector of the sentence vector sum user input text of the problem according to candidate's question and answer pair, meter Calculate the similarity between two sentence vectors；According to the similarity between sentence vector, by the text of user input and candidate's question and answer pair The information comprising in problem carries out the correction of similarity, obtains revised similarity.

The present invention is similarity calculating method between a kind of short text, employs to user input text and candidate's question and answer pair The process that problem carries out participle and extracts key word, then obtains user input text according to participle and key word and term vector collection With the term vector of the problem of candidate's question and answer pair, then calculate the term vector of both respectively, obtain user input text and candidate The sentence vector of the problem of question and answer pair, finally by the cosine similarity being calculated between two sentence vectors, passes through user further The information comprising in the problem of the text of input and candidate's question and answer pair is modified to similarity, obtains more accurately similar Degree, makes in interactive system, and the answer replying user is more accurate.

Specifically, corpus data is obtained by web crawlers technology.A large amount of language materials, crawler technology are obtained by crawler technology It is a kind of program of automatic acquisition web page contents, some mhkcs, Ask-Answer Community, forum, microblogging, encyclopaedia, new are obtained by reptile Hear etc. the particularly longer semantic information of content enrich simultaneously again relatively colloquial model and reply etc. as corpus so that Corpus information comprehensive and abundant, the selection of language material influences whether the quality of training pattern, also just directly affects participle and key word carries Take and term vector collection, finally affect similarity.

Also pretreatment to be carried out to the corpus data climbed to, obtain corpus, mainly non-Chinese content, yellow be believed Breath and advertisement etc. have done certain filtration.Afterwards the multistage word of same content is spliced in a row, the complex form of Chinese characters changes into simplified Chinese character, Do participle again, punctuation mark is removed and is replaced with space.

After corpus data pretreatment, specifically, the term vector of each word is obtained by word2vec model training. Word2vec training tool is a kind of neural network model, and the semantic information of the term vector that this model training method obtains is root To capture according to the contribution of word.Term vector collection is obtained by word2vec model training, the information in conjunction with key word can be counted Calculation obtains more accurately sentence vector, makes similarity more accurate.

Obtain keyword extraction model by corpus.First participle is carried out to corpus, then again by manually going Key word (be not marked is non-key word) in mark sentence, then trains the grader of one 2 classification with maximum entropy. User input text inputs keyword abstraction model with the problem of candidate's question and answer pair after participle, and model can do one to each participle Individual 2 classification, predict whether for key word, just to obtain respective keyword set with this.For further lift system performance, institute There are the participle of the problem of question and answer centering and keyword extraction can carry out in advance.Obtained by participle instrument and keyword extraction model The corresponding participle of problem and key word to user input text and question and answer centering.And then obtained by the term vector collection of word2vec The corresponding term vector of each word in the problem of user input text and question and answer centering.

Calculate the problem corresponding sentence vector of user input text and question and answer centering, computational methods are all (all points of 0.8* The meansigma methodss of the term vector of word)+0.2* (meansigma methodss of key word term vector).The meansigma methodss of vector are exactly that each vector is right Dimension values are answered to be added and then divided by vectorial number.In addition participle contains key word, so this computational methods are to key word It is weighted, because key word more can represent the semanteme of text.0.8 and 0.2 weight is the conclusion being drawn by test of many times. Because term vector is all 300 dimensions, so user input is also all 300 dimensions with the sentence vector of the problem of candidate's question and answer centering.

Specifically, participle instrument is Chinese handling implement bag hanlp.Corpus are carried out point by participle instrument Word, the participle instrument that the present invention selects is hanlp (Han Language Processing), and hanlp is free Chinese of increasing income Speech processes bag, a series of Java tool kit being made up of models and algorithm, it is possible to achieve Chinese word segmentation, keyword extraction, rope Draw a series of function such as participle, also there is the complete functions such as offer morphological analysis, syntactic analysis, semantic understanding.Hanlp possesses Perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.Therefore the present invention select hanlp as point Word instrument.

After obtaining user input text and the problem corresponding sentence vector of question and answer centering, calculate similar between sentence vector Degree, is calculated by the method for cosine similarity, its codomain is [0,1].Calculated according to the cosine computational methods between two vectors Cosine value between two vectors, cosine value, closer to 1, indicates that angle closer to 0 degree, that is, two sentence vectors is more similar. Cosine computational methods quickly and easily, can improve systematic function.

The present invention most importantly passes through the information comprising in user input text and the problem of candidate's question and answer pair to similar Degree is modified, and this information mainly includes text sentence pattern, name entity and personal pronoun, and name entity is present invention primarily contemplates ground Name and organization names etc..These information are not considered it is therefore desirable to be entered using this information in the sentence vector similarity in described Row is revised.

From the point of view of every-day language experience and experimental result, in conjunction with following three kinds of situations, similarity is modified：

The first situation, is modified to similarity according to text sentence pattern.

When the text message of user input is " being non-question sentence ", for example：" you removed the Temple of Heaven yesterday？", or " positive and negative Question sentence ", for example：" you were either with or without removing the Temple of Heaven？", generally semantically differ greatly with the problem of the question and answer pair of " assertive sentence " type, If i.e. the short text of user input is " being non-question sentence " or the sentence pattern of " A-not-A question ", and the problem of candidate's question and answer centering is When " assertive sentence ", gained similarity need to reduce further, if in the same manner user input for " assertive sentence " problem of candidate's question and answer pair During for " being non-question sentence " or " A-not-A question ", gained similarity needs to reduce (the question and answer that concrete reduction ratio need to use according to system To determine to this language material and by experiment, the present invention is according to existing language material and experiment it is proposed that similarity reduces 30% about).

Similar also have " assertive sentence " and " negative ".System judges sentence using linguistic rules with Sentence Template on realizing Type.

Second situation, is modified to similarity according to name entity.

If the name entity of each self-contained same type of the problem of user input text and candidate's question and answer centering is (as all Have a place name, or a Dou Youyige mechanism name), but both comprise that place name is different and place name between no inclusion relation (as Beijing and sea Shallow lake area just belongs to inclusion relation) when, gained similarity needs reduction, and (concrete reduction ratio needs and to be passed through using language material according to system Test and to determine, the present invention is according to existing language material and experiment it is proposed that similarity reduces 50% about).In realization, in order to control solution The border of problem certainly and running efficiency of system, the present invention uses dictionary method, and dictionary file comprises main prefecture-level city of China, often Separate no inclusion relation between individual place name.So avoid because place name correlation leads to the language calculating in corpus Adopted similarity too high (as " Beijing has anything to be fond of eating？" with " Shanghai has anything to be fond of eating？", " Beijing " is often existed with " Shanghai " Occur together in corpus, their term vector is very related, but this two sentence semantics differ greatly).Mechanism's name also similarity Reason.

The third situation, is modified to similarity according to personal pronoun.

If each self-contained pronoun of the problem of user input text and candidate's question and answer centering, such as user input text is " I go today the Temple of Heaven play " and the problem of candidate's question and answer centering are " he goes the Temple of Heaven to play today ", now, people in two words Pronoun is claimed to have differences, gained similarity need to reduce that (concrete reduction ratio needs and to pass through using language material according to system further Test and to determine, the present invention is according to existing language material and experiment it is proposed that similarity reduces 50% about).Also using dictionary in realization Method, dictionary file comprises conventional pronoun.

It should be noted that by three of the above mode correction similarity in the present invention, can also sentence by other means The semanteme of disconnected text, revises similarity further.

The present invention can accurately calculate the semanteme between this short text in interactive system by above method The accuracy of similarity, preferably makes full use of limited question and answer to data, improves the Consumer's Experience of interactive system.

Finally it should be noted that：Various embodiments above only in order to technical scheme to be described, is not intended to limit；To the greatest extent Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that：Its according to So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered Row equivalent；And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme, it all should be covered in the middle of the claim of the present invention and the scope of description.

Claims

1. between a kind of short text similarity calculating method it is characterised in that include：

Step S2, according to described corpus, is obtained keyword extraction model, using participle instrument, described corpus is divided Word, and train acquisition term vector collection with word2vec；

Step S3, obtains problem, the problem to described candidate's question and answer pair and the described use of user input text and candidate's question and answer pair Family input text carries out participle respectively by described participle instrument, and the problem to described candidate's question and answer pair and described user input are civilian This carries out keyword extraction by described keyword extraction model respectively, obtains the word segmentation result of the problem of described candidate's question and answer pair With keyword extraction result, and the word segmentation result of described user input text and keyword extraction result；

Step S4, the word segmentation result of the problem according to described candidate's question and answer pair and keyword extraction result, by described term vector Collection obtains term vector, the word segmentation result according to described user input text and the keyword extraction of the problem of described candidate's question and answer pair As a result, the term vector of described user input text is obtained by described term vector collection；

Step S5, the term vector of the problem according to described candidate's question and answer pair, calculate the sentence of the problem obtaining described candidate's question and answer pair Vector, according to the term vector of described user input text, calculates the sentence vector obtaining described user input text；

Step S6, the sentence vector of the user input text described in sentence vector sum of the problem according to described candidate's question and answer pair, calculate two Similarity between individual sentence vector；

Step S7, according to the similarity between described sentence vector, by the text of described user input and described candidate's question and answer pair The information comprising in problem carries out the correction of similarity, obtains revised similarity.

2. according to claim 1 between short text similarity calculating method it is characterised in that

The information comprising in the problem of the text of described user input and described candidate's question and answer pair be text sentence pattern, name entity and Personal pronoun, described name entity includes place name and organization names.

3. according to claim 1 between short text similarity calculating method it is characterised in that

In described step S2, obtain described keyword extraction model and include：

Step S21, obtains key word corpus, according to described key word corpus, carries out participle, obtain word segmentation result；

Step S22, according to described word segmentation result, is marked the key word in described word segmentation result by way of artificial mark, obtains Key word corpus to after artificial mark；

Step S23, according to the key word corpus after described artificial mark, obtains keyword extraction mould by maximum entropy training Type.

4. according to claim 1 between short text similarity calculating method it is characterised in that

Described keyword extraction model is the grader of one 2 classification.

5. according to claim 1 between short text similarity calculating method it is characterised in that

Described term vector collection is obtained by word2vec model training.

6. according to claim 1 between short text similarity calculating method it is characterised in that

The term vector of problem of described candidate's question and answer pair is identical with the term vector dimension of described user input text.

7. according to claim 2 between short text similarity calculating method it is characterised in that

Described name entity and personal pronoun are obtained by dictionary method.

8. according to claim 1 between short text similarity calculating method it is characterised in that

Described corpus data is obtained by web crawlers technology.

9. according to claim 1 between short text similarity calculating method it is characterised in that

Similarity between described sentence vector is calculated by the method for cosine similarity.

10. according to claim 1 between short text similarity calculating method it is characterised in that

Described participle instrument is Chinese handling implement bag hanlp.