CN106484664B - Similarity calculating method between a kind of short text - Google Patents

Similarity calculating method between a kind of short text Download PDF

Info

Publication number
CN106484664B
CN106484664B CN201610920608.5A CN201610920608A CN106484664B CN 106484664 B CN106484664 B CN 106484664B CN 201610920608 A CN201610920608 A CN 201610920608A CN 106484664 B CN106484664 B CN 106484664B
Authority
CN
China
Prior art keywords
text
similarity
sentence
user
candidate question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610920608.5A
Other languages
Chinese (zh)
Other versions
CN106484664A (en
Inventor
简仁贤
陈秀龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intelligent Technology (shanghai) Co Ltd
Original Assignee
Intelligent Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intelligent Technology (shanghai) Co Ltd filed Critical Intelligent Technology (shanghai) Co Ltd
Priority to CN201610920608.5A priority Critical patent/CN106484664B/en
Publication of CN106484664A publication Critical patent/CN106484664A/en
Application granted granted Critical
Publication of CN106484664B publication Critical patent/CN106484664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Abstract

The invention discloses similarity calculating methods between a kind of short text, obtain corpus data, pre-process to corpus data, obtain training corpus;According to training corpus, keyword extraction model is obtained, training corpus is segmented using participle tool, and obtain term vector collection with word2vec training;The problem that user inputs text and candidate question and answer pair is obtained, word segmentation result and keyword extraction result are respectively obtained;According to word segmentation result and keyword extraction as a result, calculating the problem of candidate question and answer pair and the term vector of user's input text by term vector collection, sentence vector is obtained by term vector, calculates the similarity between two sentence vectors;The information for including in the problem of the text and candidate question and answer pair that are inputted by user carries out the amendment of similarity, obtains revised similarity.The present invention corrects similarity by the sentence pattern of sentence, name entity and pronoun by calculating the cosine similarity between user's input and the problem sentence vector of candidate question and answer pair.

Description

Similarity calculating method between a kind of short text
Technical field
The present invention relates to Internet technical fields, more particularly to intelligent human-computer dialogue field.
Background technique
With the continuous rising of the continuous evolution and manual service cost of human society informationization, people increasingly wish It is exchanged by natural language with computer, human-machine intelligence's chat system becomes the product being born under such historical background.
In existing interactive system, there are mainly two types of implementation methods, and one is retrieval model, another kind is to generate mould Type.Retrieval model as the process for being primary information retrieval, asks a wheel human-computer dialogue by getting out certain data volume in advance It answers questions, and question and answer is established into index to the problems in (it is made of a problem and several answers).When user inputs one It, is just treated as primary retrieval by sentence or when several words, found in all candidate question and answer pair with its most close problem of semanteme, Then the answer of the problem is returned into the user, completes a wheel human-computer dialogue.Therefore crucial to obtain appropriate answer It is that the semantic most like problem of input how found with user.Due to user's input and candidate in interactive system As soon as question and answer to the problems in be usually by or the short text that forms of several short sentences, so problem has been fallen in phase between short text It is calculated like degree.
In the prior art, the method for similarity is exactly that the problem of user's input and candidate question and answer pair is each between calculating short text The sentence vector of same dimension is changed in rotation, and each dimension values of vector are each in the problem of user input or candidate question and answer pair From word (or for participle) TF*IDF value, similitude between the two is then measured by such as calculating cosine similarity Come to all candidate question and answer to sequence, this method is method common in search engine.But more than the TF*IDF of vector calculating String similarity only considered the text similarity between sentence the method for looking for most like problem, i.e., it is literal it is upper how many repeat Participle judge similarity between sentence, this is clearly inadequate, for example " I am very tired " and " I want to sleep " semanteme are the same But almost without too many dittograph on literal, this method can not just cope with such case.In addition, since interactive system is logical What is be often used is short sentence, therefore TF is substantially 1, and without too big effect, this also will affect the effect of this method.
Therefore defect in the prior art is, calculates user's input and candidate's question and answer to asking by the TF*IDF value of participle The cosine similarity of the term vector of topic only considered the text similarity between sentence, i.e., how many may only be gone up by literal The duplicate similarity segmented to judge sentence, can make the judgement of similarity very inaccurate, directly result in human-computer dialogue in this way It is inaccurate that the information of user is replied in system.
Summary of the invention
The technical problem to be solved in the present invention is to provide similarity calculating methods between a kind of short text, use defeated to user Enter or the problem of candidate question and answer pair carries out participle and keyword extraction, obtains corresponding term vector, according to term vector, calculating is obtained Corresponding sentence vector, the similarity between two sentence vectors is then calculated, finally by the sentence pattern of sentence, name entity and Pronoun is modified similarity, and similarity is made to become more acurrate, and then improves the standard that user is replied in interactive system True property.
In order to solve the above technical problems, present invention provide the technical scheme that
The present invention provides similarity calculating method between a kind of short text, comprising:
Step S1 obtains corpus data, pre-processes to the corpus data, obtain training corpus;
Step S2 obtains keyword extraction model according to the training corpus, using participle tool to the training corpus Participle, and term vector collection is obtained with word2vec training;
Step S3 obtains the problem that user inputs text and candidate question and answer pair, problem and institute to the candidate question and answer pair User's input text is stated to segment respectively by the participle tool, it is defeated to the problem and the user of the candidate question and answer pair Enter text and keyword extraction is carried out by the keyword extraction model respectively, obtains the participle of the problem of the candidate question and answer pair As a result with keyword extraction as a result, and the user input text word segmentation result and keyword extraction result;
Step S4, according to the word segmentation result of the problem of the candidate question and answer pair and keyword extraction as a result, passing through institute's predicate Vector set obtains the term vector of the problem of the candidate question and answer pair, and the word segmentation result and keyword of text are inputted according to the user It extracts as a result, obtaining the term vector that the user inputs text by the term vector collection;
Step S5 calculates the problem for obtaining the candidate question and answer pair according to the term vector of the problem of the candidate question and answer pair Sentence vector, the term vector of text is inputted according to the user, calculates and obtains the sentence vector that the user inputs text;
Step S6, according to the candidate question and answer problem sentence vector sum described in user input the sentence vector of text, meter Calculate the similarity between two sentence vectors;
Step S7, according to the similarity between the sentence vector, the text inputted by the user and the candidate question and answer Pair problem in include information carry out similarity amendment, obtain revised similarity.
The present invention is that a kind of technical solution of similarity calculating method between short text is first to obtain corpus data, to institute's predicate Material data are pre-processed, and are obtained training corpus, according to the training corpus, are obtained keyword extraction model, and participle work is utilized Tool segments the training corpus, and obtains term vector collection with word2vec training;It obtains user and inputs text and candidate question and answer Pair problem, problem and the user to the candidate question and answer pair input text and are divided respectively by the participle tool Word, problem and the user to the candidate question and answer pair input text and carry out key by the keyword extraction model respectively Word extracts, and the word segmentation result and keyword extraction for obtaining the problem of the candidate question and answer pair are as a result, and user input text Word segmentation result and keyword extraction result;Then according to the word segmentation result and keyword extraction of the problem of the candidate question and answer pair As a result, obtaining the term vector of the problem of the candidate question and answer pair by the term vector collection, text is inputted according to the user Word segmentation result and keyword extraction are as a result, obtain the term vector that the user inputs text by the term vector collection;According to institute The term vector of the problem of candidate question and answer pair is stated, the sentence vector for obtaining the problem of the candidate question and answer pair is calculated, according to the user The term vector of text is inputted, calculates and obtains the sentence vector that the user inputs text;According to two sentence vectors, calculate two sentences to Similarity between amount;Finally according to the similarity between the sentence vector, the text inputted by the user and the candidate are asked The information for including in the problem of answering questions carries out the amendment of similarity, obtains revised similarity.
The present invention is that similarity calculating method uses and inputs text and candidate question and answer pair to user between a kind of short text Problem is segmented and is extracted the processing of keyword, is then calculated user according to participle and keyword and is inputted text and candidate question and answer Pair problem term vector, then calculate separately the two term vectors, obtain the problem that user inputs text and candidate question and answer pair Sentence vector, finally by the cosine similarity being calculated between two sentence vectors, further by user input text and The information for including in the problem of candidate question and answer pair is modified similarity, obtains more accurate similarity, makes human-computer dialogue In system, the answer for replying user is more acurrate.
Further, the information for including in the problem of the text of user's input and the candidate question and answer pair is text sentence Type, name entity and personal pronoun, the name entity includes place name and organization names.
According to the sentence vector of the text of user's input and the problem of candidate question and answer pair, calculate similar between two sentence vectors Degree, the similarity that this method obtains are accurate in most cases, but are needing to consider sentence pattern, name entity and person generation When word only according to the similarity between two sentence vectors as judge text semanteme whether similar foundation still not enough standard Really, therefore similarity is modified, the present invention also analyzes the information of the text of user's input and the problem of candidate question and answer pair, It is exactly to name entity and personal pronoun to be analyzed the sentence pattern in text, further corrects similarity, and then it is just right to improve people The accuracy of telephone system reply customer problem.
Further, in the step S2, obtaining the keyword extraction model includes:
Step S21 obtains keyword training corpus and is segmented according to the keyword training corpus, obtains participle knot Fruit;
Step S22 marks the key in the word segmentation result according to the word segmentation result by way of manually marking Word, the keyword training corpus after manually being marked;
Step S23 obtains keyword by maximum entropy training and mentions according to the keyword training corpus after the artificial mark Modulus type.
The keyword in participle can be extracted by keyword extraction model, i.e., participle includes keyword, because of keyword More representative of the semanteme of text, so the keyword in participle is extracted, it can be than only using the similar of participle calculating in conjunction with keyword It spends more accurate.In order to train keyword extraction model, first obtain keyword training corpus, these keyword corpus can and word The training corpus of vector is different, then marks out keyword in participle by the method manually marked, passes through the side of maximum entropy Method establishes training pattern, and for any new text input not marked into this model, it is to close which can export which automatically Which is not keyword to keyword, obtains keyword set with this, to help to improve the similarity between a vector.
Further, the keyword extraction model is the classifier of one 2 classification.The model of said extracted keyword is The classifier of one 2 classification can predict which word is keyword in sentence, improves and extracts keyword by this classifier Accuracy.
Further, the term vector collection is obtained by word2vec model training.Word2vec training tool is a kind of Neural network model, the semanteme for the term vector that this model training method obtains are by often occurring getting over phase together with it Seemingly, that is the semantic information for the term vector that this model training method obtains is captured according to the co-occurrence of word.Pass through Word2vec model training obtains term vector, and more accurate sentence vector can be calculated in conjunction with the information of keyword, make phase It is more accurate like spending.
Further, the term vector of the problem of the candidate question and answer pair and the user input the term vector dimension phase of text Together.The dimension of term vector wants identical, just facilitates the average value for calculating participle vector below in this way, and keyword term vector is averaged Then value calculates the sentence vector that corresponding candidate question and answer input text to the sentence vector sum user of problem again, finally counts The similarity between the two vectors is calculated, therefore, the dimension of obtained term vector wants identical, and the dimension for obtaining term vector wants phase Together.
Further, the name entity and personal pronoun are obtained by dictionary method.Being modified to similarity can be with By name entity and personal pronoun, name in entity that present invention primarily contemplates place name or mechanism names.Naming the amendment of entity is In order to which to directly result in two language justice dissimilar for the difference for solving such as place name, as " Beijing has anything to be fond of eating " with " Tianjin has assorted Nice ", for this two word in addition to place name difference, semanteme is similar, it is therefore desirable to by naming entity to repair similarity Just, whether the similarities of two words unanimously can directly be judged according to the place name or mechanism name occurred in sentence, improved pair The judgement of similarity between sentence.Therefore the present invention uses dictionary method, and dictionary file includes main prefecture-level city, China, and each Independently of each other without inclusion relation between name.
Further, the corpus data is obtained by web crawlers technology.A large amount of corpus are obtained by crawler technology, are climbed Worm technology is a kind of automatic program for obtaining web page contents, obtains some discussion bars, Ask-Answer Community, forum, microblogging, hundred by crawler Section, news etc. are especially the longer semantic information of content and enrich simultaneously again more colloquial model and reply etc. as training language Material, so that corpus information comprehensive and abundant, the selection of corpus influences whether the quality of training pattern, finally influences similarity.
Further, the similarity between the sentence vector is calculated by the method for cosine similarity.According to two sentence vectors Between cosine calculation method calculate the cosine value between two sentence vectors, cosine value closer to 1, indicate that angle closer to 0 degree, It is exactly that two sentence vectors are more similar.Cosine calculation method is quick and easy, and system performance can be improved.
Further, the participle tool is Chinese handling implement packet hanlp.By participle tool to training corpus into Row participle, the participle tool that the present invention selects are hanlp (Han Language Processing), and hanlp is that open source is free Chinese word segmentation may be implemented, and keyword mentions in Chinese processing packet, a series of Java kit being made of models and algorithm It takes, a series of functions such as index participle also have the function of that offer morphological analysis, syntactic analysis, semantic understanding etc. are complete. Hanlp has the characteristics of perfect in shape and function, performance efficiency, framework is clear, corpus is stylish, can customize.Therefore the present invention selects Hanlp is as participle tool.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.
Fig. 1 shows the flow chart of similarity calculating method between a kind of short text provided by first embodiment of the invention.
Specific embodiment
It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for Clearly illustrate technical solution of the present invention, therefore be intended only as example, and cannot be used as a limitation and limit protection of the invention Range.
Embodiment one
Fig. 1 shows the flow chart of similarity calculating method between a kind of short text provided by first embodiment of the invention. As shown in Figure 1, similarity calculating method includes: between short text according to a first embodiment of the present invention
Step S1 obtains corpus data, pre-processes to corpus data, obtain training corpus;
Step S2 obtains keyword extraction model according to training corpus, is segmented using participle tool to training corpus, and Term vector collection is obtained with word2vec training;
Step S3 obtains the problem that user inputs text and candidate question and answer pair, defeated to the problem and user of candidate question and answer pair Enter text to be segmented respectively by participle tool, the problem and user to candidate question and answer pair input text and pass through keyword respectively It extracts model and carries out keyword extraction, the word segmentation result and keyword extraction for obtaining the problem of candidate question and answer pair are as a result, and user Input the word segmentation result and keyword extraction result of text;
Step S4, according to the word segmentation result of the problem of candidate question and answer pair and keyword extraction as a result, passing through term vector collection meter The term vector for calculating the problem of candidate question and answer pair, the word segmentation result and keyword extraction for inputting text according to user are as a result, pass through Term vector collection calculates the term vector that user inputs text;
Step S5 calculates the sentence vector for obtaining the problem of candidate question and answer pair according to the term vector of the problem of candidate question and answer pair, The term vector of text is inputted according to user, is calculated and is obtained the sentence vector that user inputs text;
Step S6 inputs the sentence vector of text according to the sentence vector sum user of the problem of candidate question and answer pair, calculates two sentences Similarity between vector;
Step S7 is wrapped in the problem of the text inputted by user and candidate question and answer pair according to the similarity between sentence vector The information contained carries out the amendment of similarity, obtains revised similarity.
The present invention is a kind of technical solution of similarity calculating method between short text, corpus data is first obtained, to corpus number According to being pre-processed, training corpus is obtained, according to training corpus, obtains keyword extraction model, using participle tool to training Corpus participle, and term vector collection is obtained with word2vec training;The problem that user inputs text and candidate question and answer pair is obtained, to time It selects the problem of question and answer pair and user to input text to segment respectively by participle tool, problem and user to candidate question and answer pair Input text pass through respectively keyword extraction model carry out keyword extraction, obtain the problem of candidate question and answer pair word segmentation result and Keyword extraction as a result, and user input text word segmentation result and keyword extraction result;Then according to candidate question and answer pair The term vector of problem calculates the sentence vector for obtaining the problem of candidate question and answer pair, the term vector of text is inputted according to user, and calculating obtains Obtain the sentence vector that user inputs text;The sentence vector of text, meter are inputted according to the sentence vector sum user of the problem of candidate question and answer pair Calculate the similarity between two sentence vectors;According to the similarity between sentence vector, the text inputted by user and candidate question and answer pair The information for including in problem carries out the amendment of similarity, obtains revised similarity.
The present invention is similarity calculating method between a kind of short text, inputs text and candidate question and answer pair using to user Problem is segmented and is extracted the processing of keyword, is then obtained user according to participle and keyword and term vector collection and is inputted text With the term vector of the problem of candidate question and answer pair, the term vector of the two is then calculated separately, user is obtained and inputs text and candidate The sentence vector of the problem of question and answer pair further passes through user finally by the cosine similarity being calculated between two sentence vectors The information for including in the problem of the text of input and candidate question and answer pair is modified similarity, obtains more accurately similar Degree, makes in interactive system, the answer for replying user is more acurrate.
Specifically, corpus data is obtained by web crawlers technology.A large amount of corpus, crawler technology are obtained by crawler technology It is a kind of automatic program for obtaining web page contents, some discussion bars, Ask-Answer Community, forum, microblogging, encyclopaedia, new is obtained by crawler The especially longer semantic information of content such as news enriches simultaneously again more colloquial model and reply etc. as training corpus, so that Corpus information comprehensive and abundant, the selection of corpus influence whether the quality of training pattern, also just directly affect participle and keyword mentions It takes and term vector collection, finally influences similarity.
Also the corpus data climbed to is pre-processed, obtain training corpus, mainly non-Chinese content, yellow are believed Breath and advertisement etc. have done certain filtering.Later in a row by the multistage text splicing of same content, the complex form of Chinese characters changes into simplified Chinese character, It segments, punctuation mark is removed and is replaced with space again.
After corpus data pretreatment, specifically, the term vector of each word is obtained by word2vec model training. Word2vec training tool is a kind of neural network model, and the semantic information for the term vector that this model training method obtains is root It is captured according to the contribution of word.Term vector collection is obtained by word2vec model training, can be counted in conjunction with the information of keyword Calculation obtains more accurate sentence vector, keeps similarity more accurate.
Keyword extraction model is obtained by training corpus.First training corpus is segmented, then again by manually going Keyword (what is be not marked is non-key word) in mark sentence, then with the classifier of maximum entropy one 2 classification of training. User inputs text and the problem of candidate question and answer pair inputs keyword abstraction model after participle, and model can do one to each participle A 2 classification, predicts whether that respective keyword set can be obtained with this for keyword.In order to further enhance system performance, institute Have question and answer to the problems in participle and keyword extraction can carry out in advance.It is obtained by participle tool and keyword extraction model To user input text and question and answer to the problems in corresponding participle and keyword.And then it is obtained by the term vector collection of word2vec To user input text and question and answer to the problems in the corresponding term vector of each word.
Calculate user input text and question and answer to the problems in corresponding sentence vector, calculation method be all (all points of 0.8* The average value of the term vector of word)+0.2* (average value of keyword term vector).The average value of vector is exactly pair each vector Dimension values are answered to be added then divided by vector number.In addition participle contains keyword, so this calculation method is to keyword It is weighted, because keyword is more representative of the semanteme of text.0.8 and 0.2 weight is the conclusion obtained by test of many times. Since term vector is all 300 dimensions, thus user's input and candidate question and answer to the problems in sentence vector be also all 300 dimensions.
Specifically, participle tool is Chinese handling implement packet hanlp.Training corpus is divided by participle tool Word, the participle tool that the present invention selects are hanlp (Han Language Processing), and hanlp is the free Chinese of open source Chinese word segmentation, keyword extraction, rope may be implemented in speech processing packet, a series of Java kit being made of models and algorithm Draw a series of functions such as participle, also has the function of that offer morphological analysis, syntactic analysis, semantic understanding etc. are complete.Hanlp has Perfect in shape and function, performance efficiency, framework is clear, corpus is stylish, the characteristics of can customize.Therefore the present invention selects hanlp to be used as and divides Word tool.
Obtain user input text and question and answer to the problems in after corresponding sentence vector, to calculate similar between a vector Degree is calculated by the method for cosine similarity, its codomain is [0,1].It is calculated according to the cosine calculation method between two vectors Cosine value between two vectors, cosine value indicate that angle closer to 0 degree, that is, two sentence vectors are more similar closer to 1. Cosine calculation method is quick and easy, and system performance can be improved.
The present invention is most importantly the information in the problem for inputting text and candidate question and answer pair by user included to similar Degree is modified, this information mainly includes text sentence pattern, name entity and personal pronoun, and present invention primarily contemplates ground for name entity Name and organization names etc..These information do not consider in the sentence vector similarity in described, it is therefore desirable to using this information into Row amendment.
From the point of view of every-day language experience and experimental result, similarity is modified in conjunction with following three kinds of situations:
The first situation is modified similarity according to text sentence pattern.
When the text information of user's input is " being non-question sentence ", such as: " you removed the Temple of Heaven yesterday? ", or it is " positive and negative Question sentence ", such as: " were you either with or without removing the Temple of Heaven? ", usually semantically differ greatly with the problem of the question and answer pair of " declarative sentence " type, I.e. if the short text of user's input is the sentence pattern of " being non-question sentence " or " A-not-A question ", and candidate question and answer to the problems in be When " declarative sentence ", gained similarity need to be further decreased, if similarly user's input is the problem of " declarative sentence " and candidate question and answer pair When for " being non-question sentence " or " A-not-A question ", gained similarity needs to reduce (the question and answer that specific reduction ratio need to be used according to system It is determined to this corpus and by testing, the present invention is according to existing corpus and experiment, it is proposed that similarity reduces 30% or so).
Similar also " assertive sentence " and " negative ".System, which is realized, judges sentence with Sentence Template using linguistic rules Type.
Second situation is modified similarity according to name entity.
If user input text and candidate question and answer to the problems in respectively contain the name entity an of same type (as all Have a place name or Dou Youyige mechanism name), but the two includes between place name difference and place name without inclusion relation (such as Beijing and sea Shallow lake area just belongs to inclusion relation) when, gained similarity needs to reduce that (specific reduction ratio needs according to system using corpus and passes through Experiment is to determine, the present invention is according to existing corpus and experiment, it is proposed that similarity reduces 50% or so).In realization, in order to control solution The boundary of certainly the problem of and running efficiency of system, the present invention use dictionary method, and dictionary file includes main prefecture-level city, China, often Independently of each other without inclusion relation between a place name.It is avoided in this way because place name correlation leads to calculated language in training corpus Adopted similarity it is excessively high (such as " what Beijing has nice? " with " Shanghai has anything to be fond of eating? ", " Beijing " and " Shanghai " often exists Occur together in training corpus, their term vector is very related, but the two sentence semantics differ greatly).Mechanism name also similarity Reason.
The third situation is modified similarity according to personal pronoun.
If user input text and candidate question and answer to the problems in respectively contain a pronoun, be as user inputs text " I goes the Temple of Heaven to play today " and candidate question and answer to the problems in be " he goes the Temple of Heaven to play today ", at this point, the people in two words Pronoun is claimed to have differences, gained similarity need to further decrease that (specific reduction ratio needs according to system using corpus and passes through Experiment is to determine, the present invention is according to existing corpus and experiment, it is proposed that similarity reduces 50% or so).Also dictionary is used in realization Method, dictionary file include common pronoun.
It should be noted that correcting similarity by three of the above mode in the present invention, can also sentence by other means The semanteme of disconnected text, further corrects similarity.
The present invention can accurately calculate the semanteme between this short text in interactive system by above method The accuracy of similarity preferably makes full use of limited question and answer to data, improves the user experience of interactive system.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme should all cover within the scope of the claims and the description of the invention.

Claims (9)

1. similarity calculating method between a kind of short text characterized by comprising
Step S1 obtains corpus data, pre-processes to the corpus data, obtain training corpus;
Step S2 obtains keyword extraction model according to the training corpus, using participle tool to the training corpus point Word, and term vector collection is obtained with word2vec training;
Step S3 obtains the problem that user inputs text and candidate question and answer pair, to the problem and the use of the candidate question and answer pair Family input text is segmented respectively by the participle tool, inputs text to the problem and the user of the candidate question and answer pair This carries out keyword extraction by the keyword extraction model respectively, obtains the word segmentation result of the problem of the candidate question and answer pair With keyword extraction as a result, and the user input text word segmentation result and keyword extraction result;
Step S4, according to the word segmentation result of the problem of the candidate question and answer pair and keyword extraction as a result, passing through the term vector Collection obtains the term vector of the problem of the candidate question and answer pair, and the word segmentation result and keyword extraction of text are inputted according to the user As a result, obtaining the term vector that the user inputs text by the term vector collection;
Step S5 calculates the sentence for obtaining the problem of the candidate question and answer pair according to the term vector of the problem of the candidate question and answer pair Vector inputs the term vector of text according to the user, calculates and obtains the sentence vector that the user inputs text;
Step S6, according to the candidate question and answer problem sentence vector sum described in user input the sentence vector of text, calculate two Similarity between a vector;
Step S7 inputs text and the candidate question and answer to asking by the user according to the similarity between the sentence vector The information for including in topic carries out the amendment of similarity, obtains revised similarity.
2. similarity calculating method between short text according to claim 1, which is characterized in that
The information for including in the problem of the text of user input and the candidate question and answer pair be text sentence pattern, name entity and Personal pronoun, the name entity includes place name and organization names.
3. similarity calculating method between short text according to claim 1, which is characterized in that
In the step S2, obtaining the keyword extraction model includes:
Step S21 obtains keyword training corpus and is segmented according to the keyword training corpus, obtain word segmentation result;
Step S22 marks the keyword in the word segmentation result by way of manually marking, obtains according to the word segmentation result Keyword training corpus to after artificial mark;
Step S23 obtains keyword extraction mould by maximum entropy training according to the keyword training corpus after the artificial mark Type.
4. similarity calculating method between short text according to claim 1, which is characterized in that
The keyword extraction model is the classifier of one 2 classification.
5. similarity calculating method between short text according to claim 1, which is characterized in that
The term vector dimension that the term vector of the problem of candidate's question and answer pair and the user input text is identical.
6. similarity calculating method between short text according to claim 2, which is characterized in that
The name entity and personal pronoun are obtained by dictionary method.
7. similarity calculating method between short text according to claim 1, which is characterized in that
The corpus data is obtained by web crawlers technology.
8. similarity calculating method between short text according to claim 1, which is characterized in that
Similarity between the sentence vector is calculated by the method for cosine similarity.
9. similarity calculating method between short text according to claim 1, which is characterized in that
The participle tool is Chinese handling implement packet hanlp.
CN201610920608.5A 2016-10-21 2016-10-21 Similarity calculating method between a kind of short text Active CN106484664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610920608.5A CN106484664B (en) 2016-10-21 2016-10-21 Similarity calculating method between a kind of short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610920608.5A CN106484664B (en) 2016-10-21 2016-10-21 Similarity calculating method between a kind of short text

Publications (2)

Publication Number Publication Date
CN106484664A CN106484664A (en) 2017-03-08
CN106484664B true CN106484664B (en) 2019-03-01

Family

ID=58271016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610920608.5A Active CN106484664B (en) 2016-10-21 2016-10-21 Similarity calculating method between a kind of short text

Country Status (1)

Country Link
CN (1) CN106484664B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776559B (en) * 2016-12-14 2020-08-11 东软集团股份有限公司 Text semantic similarity calculation method and device
CN107329949B (en) * 2017-05-24 2021-01-01 北京捷通华声科技股份有限公司 Semantic matching method and system
CN107229753A (en) * 2017-06-29 2017-10-03 济南浪潮高新科技投资发展有限公司 A kind of article classification of countries method based on word2vec models
CN107391614A (en) * 2017-07-04 2017-11-24 重庆智慧思特大数据有限公司 A kind of Chinese question and answer matching process based on WMD
CN107577658B (en) * 2017-07-18 2021-01-29 创新先进技术有限公司 Word vector processing method and device and electronic equipment
CN107577708A (en) * 2017-07-31 2018-01-12 北京北信源软件股份有限公司 Class base construction method and system based on SparkMLlib document classifications
WO2019084867A1 (en) * 2017-11-02 2019-05-09 深圳前海达闼云端智能科技有限公司 Automatic answering method and apparatus, storage medium, and electronic device
CN107729322B (en) * 2017-11-06 2021-01-12 广州杰赛科技股份有限公司 Word segmentation method and device and sentence vector generation model establishment method and device
CN108305057B (en) * 2018-01-22 2021-01-15 平安科技(深圳)有限公司 Device and method for issuing electronic red packet and computer readable storage medium
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN108427735A (en) * 2018-02-28 2018-08-21 东华大学 Clinical knowledge map construction method based on electronic health record
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108920604B (en) 2018-06-27 2019-08-13 百度在线网络技术(北京)有限公司 Voice interactive method and equipment
CN109062977A (en) * 2018-06-29 2018-12-21 厦门快商通信息技术有限公司 A kind of automatic question answering text matching technique, automatic question-answering method and system based on semantic similarity
CN109241240A (en) * 2018-08-17 2019-01-18 国家电网有限公司客户服务中心 Power failure repairing information automatically forwarding method
CN109739956B (en) * 2018-11-08 2020-04-10 第四范式(北京)技术有限公司 Corpus cleaning method, apparatus, device and medium
CN109815996A (en) * 2019-01-07 2019-05-28 北京首钢自动化信息技术有限公司 It is a kind of based on the scene of Recognition with Recurrent Neural Network from adaptation method and device
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN110245219A (en) * 2019-04-25 2019-09-17 义语智能科技(广州)有限公司 A kind of answering method and equipment based on automatic extension Q & A database
CN110275946A (en) * 2019-05-14 2019-09-24 闽江学院 A kind of FAQ automatic question-answering method and device
CN110287295A (en) * 2019-05-14 2019-09-27 闽江学院 Question and answer robot construction method and system based on small routine
CN110135551B (en) * 2019-05-15 2020-07-21 西南交通大学 Robot chatting method based on word vector and recurrent neural network
CN110309278A (en) * 2019-05-23 2019-10-08 泰康保险集团股份有限公司 Keyword retrieval method, apparatus, medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2833271A1 (en) * 2012-05-14 2015-02-04 Huawei Technologies Co., Ltd Multimedia question and answer system and method
CN105095444A (en) * 2015-07-24 2015-11-25 百度在线网络技术(北京)有限公司 Information acquisition method and device
CN105426354A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Sentence vector fusion method and apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218356B (en) * 2012-01-18 2017-12-08 深圳市世纪光速信息技术有限公司 A kind of enquirement quality judging method and system towards open platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2833271A1 (en) * 2012-05-14 2015-02-04 Huawei Technologies Co., Ltd Multimedia question and answer system and method
CN105095444A (en) * 2015-07-24 2015-11-25 百度在线网络技术(北京)有限公司 Information acquisition method and device
CN105426354A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Sentence vector fusion method and apparatus

Also Published As

Publication number Publication date
CN106484664A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
US9779085B2 (en) Multilingual embeddings for natural language processing
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN101231634B (en) Autoabstract method for multi-document
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN102662931B (en) Semantic role labeling method based on synergetic neural network
CN104915340B (en) Natural language question-answering method and device
US8069027B2 (en) Word alignment apparatus, method, and program product, and example sentence bilingual dictionary
CN103425635B (en) Method and apparatus are recommended in a kind of answer
CN108108351B (en) Text emotion classification method based on deep learning combination model
CN107315737A (en) A kind of semantic logic processing method and system
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN105808768B (en) A kind of construction method of the concept based on books-descriptor knowledge network
CN106055675B (en) A kind of Relation extraction method based on convolutional neural networks and apart from supervision
US20160147736A1 (en) Creating ontologies by analyzing natural language texts
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN105930452A (en) Smart answering method capable of identifying natural language
CN107818138A (en) A kind of case legal regulation recommends method and system
CN104008166B (en) Dialogue short text clustering method based on form and semantic similarity
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
Berardi et al. Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN105095204B (en) The acquisition methods and device of synonym

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant