CN106484664A - Similarity calculating method between a kind of short text - Google Patents
Similarity calculating method between a kind of short text Download PDFInfo
- Publication number
- CN106484664A CN106484664A CN201610920608.5A CN201610920608A CN106484664A CN 106484664 A CN106484664 A CN 106484664A CN 201610920608 A CN201610920608 A CN 201610920608A CN 106484664 A CN106484664 A CN 106484664A
- Authority
- CN
- China
- Prior art keywords
- question
- user input
- similarity
- candidate
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 132
- 238000000605 extraction Methods 0.000 claims abstract description 46
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000012937 correction Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 17
- 238000005516 engineering process Methods 0.000 claims description 9
- 230000008520 organization Effects 0.000 claims description 3
- 239000000463 material Substances 0.000 description 17
- 230000006870 function Effects 0.000 description 8
- 230000002452 interceptive effect Effects 0.000 description 7
- 238000000205 computational method Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 241000270322 Lepidosauria Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009897 systematic effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses similarity calculating method between a kind of short text, obtain corpus data, pretreatment is carried out to corpus data, obtains corpus;According to corpus, obtain keyword extraction model, using participle instrument to corpus participle, and train acquisition term vector collection with word2vec;Obtain the problem of user input text and candidate's question and answer pair, respectively obtain word segmentation result and keyword extraction result;According to word segmentation result and keyword extraction result, calculate the problem of candidate's question and answer pair and the term vector of user input text by term vector collection, sentence vector is obtained by term vector, calculate the similarity between two sentence vectors;Carry out the correction of similarity by the information comprising in the text of user input and the problem of candidate's question and answer pair, obtain revised similarity.The present invention is calculated by the cosine similarity between to the problem sentence of user input and candidate's question and answer pair vector, and by the sentence pattern of sentence, name entity and pronoun correction similarity.
Description
Technical field
The present invention relates to Internet technical field, more particularly, to intelligent human computer conversation field.
Background technology
With the continuous rising of the informationalized continuous evolution of human society and manual service cost, people increasingly wish
Exchanged with computer by natural language, human-machine intelligence's chat system becomes the product being born under such historical background.
Mainly there are two kinds of implementation methods in existing interactive system, one kind is retrieval model, another kind is to generate mould
Type.It is the process of primary information retrieval that retrieval model treats as a wheel human computer conversation, is asked by getting out certain data volume in advance
Answer questions, and question and answer are established index to the problem in (it is made up of a problem and several answers).When user input one
When sentence or some words, just it is treated as primary retrieval, finds the problem the most close with its semanteme in all candidate's question and answer centerings,
Then the answer of this problem is returned to this user, complete a wheel human computer conversation.Therefore want to obtain appropriate answer key
It is that how to find the semantic most like problem with the input of user.Due to user input and candidate in interactive system
The short text that the problem of question and answer centering is typically made up of one or some short sentences, so just fall problem phase between short text
Calculate like degree.
In prior art, the method calculating similarity between short text is exactly will be each for the problem of user input and candidate's question and answer pair
The sentence vector of same dimension is changed in rotation, and each dimension values of vector are each in this user input or the problem of candidate's question and answer pair
From word (or referred to as participle) TF*IDF value, then weigh similarity between the two by such as calculating cosine similarity
Come to all candidate's question and answer to sequence, the method is common method in search engine.But calculated remaining using the TF*IDF of vector
String similarity looking for the method for most like problem to only considered the text similarity between sentence, that is, literal upper how many repeat
Participle judging similarity between sentence, this is clearly inadequate, and such as " I am very tired " and " I want to sleep " semanteme are the same
But almost there is no too many dittograph on literal, the method just cannot tackle this situation.Further, since interactive system leads to
Often use short sentence, therefore TF is substantially 1, no too big effect, this also can affect the effect of the method.
Defect therefore of the prior art is, calculates user input and candidate's question and answer to asking by the TF*IDF value of participle
The cosine similarity of the term vector of topic, only considered the text similarity between sentence, how many may only be gone up by literal
The similarity to judge sentence for the participle repeating, so can make the judgement of similarity very inaccurate, directly result in human computer conversation
The information replying user in system is inaccurate.
Content of the invention
The technical problem to be solved in the present invention is to provide similarity calculating method between a kind of short text, employs defeated to user
Enter or the problem of candidate's question and answer pair carries out participle and keyword extraction, obtain corresponding term vector, according to term vector, calculating obtains
Corresponding sentence vector, be then calculated the similarity between two sentences vectors, finally by the sentence pattern of sentence, name entity and
Pronoun is modified to similarity, makes similarity become more accurate, and then improves the standard replying user in interactive system
Really property.
For solving above-mentioned technical problem, the technical scheme that the present invention provides is:
The present invention provides similarity calculating method between a kind of short text, including:
Step S1, obtains corpus data, carries out pretreatment to described corpus data, obtain corpus;
Step S2, according to described corpus, obtains keyword extraction model, using participle instrument to described corpus
Participle, and train acquisition term vector collection with word2vec;
Step S3, obtains problem, the problem to described candidate's question and answer pair and the institute of user input text and candidate's question and answer pair
State user input text and participle is carried out respectively by described participle instrument, defeated to the problem and described user of described candidate's question and answer pair
Enter text and respectively keyword extraction is carried out by described keyword extraction model, obtain the participle of the problem of described candidate's question and answer pair
Result and keyword extraction result, and the word segmentation result of described user input text and keyword extraction result;
Step S4, the word segmentation result of the problem according to described candidate's question and answer pair and keyword extraction result, by institute's predicate
Vector set obtains term vector, the word segmentation result according to described user input text and the key word of the problem of described candidate's question and answer pair
Extract result, obtain the term vector of described user input text by described term vector collection;
Step S5, the term vector of the problem according to described candidate's question and answer pair, calculate the problem obtaining described candidate's question and answer pair
Sentence vector, according to the term vector of described user input text, calculate the sentence vector obtaining described user input text;
Step S6, the sentence vector of the user input text described in sentence vector sum of the problem according to described candidate's question and answer pair, meter
Calculate the similarity between two sentence vectors;
Step S7, according to the similarity between described sentence vector, by text and described candidate's question and answer of described user input
To problem in the information that comprises carry out the correction of similarity, obtain revised similarity.
The present invention is a kind of technical scheme of similarity calculating method between short text is first to obtain corpus data, to institute's predicate
Material data carries out pretreatment, obtains corpus, according to described corpus, obtains keyword extraction model, using participle work
Tool is to described corpus participle, and trains acquisition term vector collection with word2vec;Obtain user input text and candidate's question and answer
To problem, the problem to described candidate's question and answer pair and described user input text are carried out point respectively by described participle instrument
Word, the problem to described candidate's question and answer pair and described user input text carry out key by described keyword extraction model respectively
Word extracts, and obtains the word segmentation result of problem and the keyword extraction result of described candidate's question and answer pair, and described user input text
Word segmentation result and keyword extraction result;Word segmentation result then according to the problem of described candidate's question and answer pair and keyword extraction
As a result, obtain the term vector of the problem of described candidate's question and answer pair by described term vector collection, according to described user input text
Word segmentation result and keyword extraction result, obtain the term vector of described user input text by described term vector collection;According to institute
State the term vector of the problem of candidate's question and answer pair, calculate the sentence vector of the problem obtaining described candidate's question and answer pair, according to described user
The term vector of input text, calculates the sentence vector obtaining described user input text;According to two sentence vectors, calculate two sentences to
Similarity between amount;Finally according to the similarity between described sentence vector, asked by the text and described candidate of described user input
The information comprising in the problem answered questions carries out the correction of similarity, obtains revised similarity.
The present invention is that between a kind of short text, similarity calculating method employs to user input text and candidate's question and answer pair
The process that problem carries out participle and extracts key word, then calculates user input text and candidate's question and answer according to participle and key word
To problem term vector, then calculate this two term vectors respectively, obtain user input text and the problem of candidate's question and answer pair
Sentence vector, finally by the cosine similarity being calculated between two sentences vector, pass through further user input text and
The information comprising in the problem of candidate's question and answer pair is modified to similarity, obtains more accurate similarity, makes human computer conversation
In system, the answer replying user is more accurate.
Further, the information comprising in the problem of the text of described user input and described candidate's question and answer pair is text sentence
Type, name entity and personal pronoun, described name entity includes place name and organization names.
The sentence vector of the problem of the text according to user input and candidate's question and answer pair, calculates similar between two sentence vectors
Degree, the similarity that this method obtains is accurate as a rule, but is needing to consider sentence pattern, name entity and person generation
Only still not accurate as the whether similar foundation of the semanteme judging text according to the similarity between two sentence vectors when word
Really, therefore similarity to be modified, the present invention also analyzes the text of user input and the information of the problem of candidate's question and answer pair,
It is exactly to the sentence pattern in text, name entity and personal pronoun to be analyzed, revise similarity further, and then it is just right to improve people
Telephone system replys the accuracy of customer problem.
Further, in described step S2, obtain described keyword extraction model and include:
Step S21, obtains key word corpus, according to described key word corpus, carries out participle, obtains participle knot
Really;
Step S22, according to described word segmentation result, marks the key in described word segmentation result by way of artificial mark
Word, the key word corpus after manually being marked;
Step S23, according to the key word corpus after described artificial mark, obtains key word by maximum entropy training and carries
Delivery type.
The key word in participle can be extracted by keyword extraction model, that is, participle comprises key word, because key word
The semanteme of text more can be represented, so extracting the key word in participle, can be more similar using participle calculating than only in conjunction with key word
Degree is more accurate.In order to train keyword extraction model, first obtain key word corpus, these key word language materials can and word
The corpus of vector are different, and the method then passing through artificial mark marks out key word, by the side of maximum entropy in participle
Method sets up training pattern, and in this model, which this model can export automatically is to close to any new text input not marked
Which is not key word to keyword, obtains keyword set with this, to help improve the similarity between sentence vector.
Further, described keyword extraction model is the grader of one 2 classification.The model of said extracted key word is
By this grader, the grader of one 2 classification, can predict that in sentence, which word is key word, improves and extracts key word
Accuracy.
Further, described term vector collection is obtained by word2vec model training.Word2vec training tool is a kind of
Neural network model, the semanteme of the term vector that this model training method obtains be by with it often together with more phase occurs
Seemingly, that is the semantic information of the term vector that this model training method obtains is co-occurrence according to word to be captured.Pass through
Word2vec model training obtains term vector, and the information in conjunction with key word can be calculated more accurately sentence vector, makes phase
More accurate like degree.
Further, the term vector dimension phase of the term vector of problem of described candidate's question and answer pair and described user input text
With.The dimension of term vector is identical, so just facilitates the meansigma methodss calculating below participle vector, key word term vector average
Value, then calculates the sentence vector of the sentence vector sum user input text to problem for the corresponding candidate's question and answer again, finally counts
Calculate the similarity between this two sentence vectors, therefore, the dimension of the term vector obtaining is identical, and the dimension obtaining term vector wants phase
With.
Further, described name entity and personal pronoun are obtained by dictionary method.Similarity is modified permissible
By naming entity and personal pronoun, present invention primarily contemplates place name or mechanism's name in name entity.Name entity correction be
Directly result in two language justice dissmilarities in order to solve the difference of place name such as, if " Beijing has anything to be fond of eating " is with " Tianjin has assorted
Nice ", this two word is different except place name, and semanteme is similar it is therefore desirable to be repaiied to similarity by naming entity
Just, the similarity of two words whether unanimously can directly be judged according to the place name occurring in sentence or mechanism's name, it is right to improve
The judgement of similarity between sentence.Therefore the present invention uses dictionary method, and dictionary file comprises main prefecture-level city of China, and each ground
Separate no inclusion relation between name.
Further, described corpus data is obtained by web crawlers technology.A large amount of language materials are obtained by crawler technology, climbs
Worm technology is a kind of program of automatic acquisition web page contents, by reptile obtain some mhkcs, Ask-Answer Community, forum, microblogging, hundred
The longer semantic information of the particularly content such as section, news enriches simultaneously again relatively colloquial model and reply etc. as training language
So that corpus information comprehensive and abundant, the selection of language material influences whether the quality of training pattern to material, finally affects similarity.
Further, the similarity between described sentence vector is calculated by the method for cosine similarity.According to two sentence vectors
Between cosine computational methods calculate cosine value between two sentences vector, cosine value, closer to 1, indicates that angle closer to 0 degree,
It is exactly that two sentence vectors are more similar.Cosine computational methods quickly and easily, can improve systematic function.
Further, described participle instrument is Chinese handling implement bag hanlp.By participle instrument, corpus are entered
Row participle, the participle instrument that the present invention selects is hanlp (Han Language Processing), and hanlp is to increase income freely
Chinese processes bag, a series of Java tool kit being made up of models and algorithm, it is possible to achieve Chinese word segmentation, key word carries
Take, a series of functions such as index participle, also there is the complete functions such as offer morphological analysis, syntactic analysis, semantic understanding.
Hanlp possesses perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.Therefore the present invention selects
Hanlp is as participle instrument.
Brief description
In order to be illustrated more clearly that the specific embodiment of the invention or technical scheme of the prior art, below will be to concrete
In embodiment or description of the prior art, the accompanying drawing of required use is briefly described.
The flow chart that Fig. 1 shows similarity calculating method between a kind of short text that first embodiment of the invention is provided.
Specific embodiment
Below in conjunction with accompanying drawing, the embodiment of technical solution of the present invention is described in detail.Following examples are only used for
Clearly technical scheme is described, is therefore intended only as example, and the protection of the present invention can not be limited with this
Scope.
Embodiment one
The flow chart that Fig. 1 shows similarity calculating method between a kind of short text that first embodiment of the invention is provided.
As shown in figure 1, similarity calculating method includes between short text according to a first embodiment of the present invention:
Step S1, obtains corpus data, carries out pretreatment to corpus data, obtain corpus;
Step S2, according to corpus, obtains keyword extraction model, using participle instrument to corpus participle, and
Trained with word2vec and obtain term vector collection;
Step S3, obtains the problem of user input text and candidate's question and answer pair, defeated to the problem and user of candidate's question and answer pair
Enter text and participle is carried out respectively by participle instrument, the problem to candidate's question and answer pair and user input text pass through key word respectively
Extraction model carries out keyword extraction, obtains word segmentation result and the keyword extraction result of the problem of candidate's question and answer pair, and user
The word segmentation result of input text and keyword extraction result;
Step S4, the word segmentation result of the problem according to candidate's question and answer pair and keyword extraction result, by term vector collection meter
Calculate the term vector of the problem of candidate's question and answer pair, the word segmentation result according to user input text and keyword extraction result, pass through
Term vector collection calculates the term vector of user input text;
Step S5, the term vector of the problem according to candidate's question and answer pair, calculate the sentence vector of the problem obtaining candidate's question and answer pair,
According to the term vector of user input text, calculate the sentence vector obtaining user input text;
Step S6, the sentence vector of the sentence vector sum user input text of the problem according to candidate's question and answer pair, calculate two sentences
Similarity between vector;
Step S7, according to the similarity between sentence vector, by wrapping in the text of user input and the problem of candidate's question and answer pair
The information containing carries out the correction of similarity, obtains revised similarity.
The present invention is a kind of technical scheme of similarity calculating method between short text, first obtains corpus data, to language material number
According to carrying out pretreatment, obtain corpus, according to corpus, obtain keyword extraction model, using participle instrument to training
Language material participle, and train acquisition term vector collection with word2vec;Obtain the problem of user input text and candidate's question and answer pair, to time
The problem of question and answer pair and user input text is selected to carry out participle, the problem to candidate's question and answer pair and user respectively by participle instrument
Input text carries out keyword extraction by keyword extraction model respectively, obtain the word segmentation result of problem of candidate's question and answer pair with
Keyword extraction result, and the word segmentation result of user input text and keyword extraction result;Then according to candidate's question and answer pair
The term vector of problem, calculates the sentence vector of the problem obtaining candidate's question and answer pair, according to the term vector of user input text, calculating obtains
Obtain the sentence vector of user input text;The sentence vector of the sentence vector sum user input text of the problem according to candidate's question and answer pair, meter
Calculate the similarity between two sentence vectors;According to the similarity between sentence vector, by the text of user input and candidate's question and answer pair
The information comprising in problem carries out the correction of similarity, obtains revised similarity.
The present invention is similarity calculating method between a kind of short text, employs to user input text and candidate's question and answer pair
The process that problem carries out participle and extracts key word, then obtains user input text according to participle and key word and term vector collection
With the term vector of the problem of candidate's question and answer pair, then calculate the term vector of both respectively, obtain user input text and candidate
The sentence vector of the problem of question and answer pair, finally by the cosine similarity being calculated between two sentence vectors, passes through user further
The information comprising in the problem of the text of input and candidate's question and answer pair is modified to similarity, obtains more accurately similar
Degree, makes in interactive system, and the answer replying user is more accurate.
Specifically, corpus data is obtained by web crawlers technology.A large amount of language materials, crawler technology are obtained by crawler technology
It is a kind of program of automatic acquisition web page contents, some mhkcs, Ask-Answer Community, forum, microblogging, encyclopaedia, new are obtained by reptile
Hear etc. the particularly longer semantic information of content enrich simultaneously again relatively colloquial model and reply etc. as corpus so that
Corpus information comprehensive and abundant, the selection of language material influences whether the quality of training pattern, also just directly affects participle and key word carries
Take and term vector collection, finally affect similarity.
Also pretreatment to be carried out to the corpus data climbed to, obtain corpus, mainly non-Chinese content, yellow be believed
Breath and advertisement etc. have done certain filtration.Afterwards the multistage word of same content is spliced in a row, the complex form of Chinese characters changes into simplified Chinese character,
Do participle again, punctuation mark is removed and is replaced with space.
After corpus data pretreatment, specifically, the term vector of each word is obtained by word2vec model training.
Word2vec training tool is a kind of neural network model, and the semantic information of the term vector that this model training method obtains is root
To capture according to the contribution of word.Term vector collection is obtained by word2vec model training, the information in conjunction with key word can be counted
Calculation obtains more accurately sentence vector, makes similarity more accurate.
Obtain keyword extraction model by corpus.First participle is carried out to corpus, then again by manually going
Key word (be not marked is non-key word) in mark sentence, then trains the grader of one 2 classification with maximum entropy.
User input text inputs keyword abstraction model with the problem of candidate's question and answer pair after participle, and model can do one to each participle
Individual 2 classification, predict whether for key word, just to obtain respective keyword set with this.For further lift system performance, institute
There are the participle of the problem of question and answer centering and keyword extraction can carry out in advance.Obtained by participle instrument and keyword extraction model
The corresponding participle of problem and key word to user input text and question and answer centering.And then obtained by the term vector collection of word2vec
The corresponding term vector of each word in the problem of user input text and question and answer centering.
Calculate the problem corresponding sentence vector of user input text and question and answer centering, computational methods are all (all points of 0.8*
The meansigma methodss of the term vector of word)+0.2* (meansigma methodss of key word term vector).The meansigma methodss of vector are exactly that each vector is right
Dimension values are answered to be added and then divided by vectorial number.In addition participle contains key word, so this computational methods are to key word
It is weighted, because key word more can represent the semanteme of text.0.8 and 0.2 weight is the conclusion being drawn by test of many times.
Because term vector is all 300 dimensions, so user input is also all 300 dimensions with the sentence vector of the problem of candidate's question and answer centering.
Specifically, participle instrument is Chinese handling implement bag hanlp.Corpus are carried out point by participle instrument
Word, the participle instrument that the present invention selects is hanlp (Han Language Processing), and hanlp is free Chinese of increasing income
Speech processes bag, a series of Java tool kit being made up of models and algorithm, it is possible to achieve Chinese word segmentation, keyword extraction, rope
Draw a series of function such as participle, also there is the complete functions such as offer morphological analysis, syntactic analysis, semantic understanding.Hanlp possesses
Perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.Therefore the present invention select hanlp as point
Word instrument.
After obtaining user input text and the problem corresponding sentence vector of question and answer centering, calculate similar between sentence vector
Degree, is calculated by the method for cosine similarity, its codomain is [0,1].Calculated according to the cosine computational methods between two vectors
Cosine value between two vectors, cosine value, closer to 1, indicates that angle closer to 0 degree, that is, two sentence vectors is more similar.
Cosine computational methods quickly and easily, can improve systematic function.
The present invention most importantly passes through the information comprising in user input text and the problem of candidate's question and answer pair to similar
Degree is modified, and this information mainly includes text sentence pattern, name entity and personal pronoun, and name entity is present invention primarily contemplates ground
Name and organization names etc..These information are not considered it is therefore desirable to be entered using this information in the sentence vector similarity in described
Row is revised.
From the point of view of every-day language experience and experimental result, in conjunction with following three kinds of situations, similarity is modified:
The first situation, is modified to similarity according to text sentence pattern.
When the text message of user input is " being non-question sentence ", for example:" you removed the Temple of Heaven yesterday?", or " positive and negative
Question sentence ", for example:" you were either with or without removing the Temple of Heaven?", generally semantically differ greatly with the problem of the question and answer pair of " assertive sentence " type,
If i.e. the short text of user input is " being non-question sentence " or the sentence pattern of " A-not-A question ", and the problem of candidate's question and answer centering is
When " assertive sentence ", gained similarity need to reduce further, if in the same manner user input for " assertive sentence " problem of candidate's question and answer pair
During for " being non-question sentence " or " A-not-A question ", gained similarity needs to reduce (the question and answer that concrete reduction ratio need to use according to system
To determine to this language material and by experiment, the present invention is according to existing language material and experiment it is proposed that similarity reduces 30% about).
Similar also have " assertive sentence " and " negative ".System judges sentence using linguistic rules with Sentence Template on realizing
Type.
Second situation, is modified to similarity according to name entity.
If the name entity of each self-contained same type of the problem of user input text and candidate's question and answer centering is (as all
Have a place name, or a Dou Youyige mechanism name), but both comprise that place name is different and place name between no inclusion relation (as Beijing and sea
Shallow lake area just belongs to inclusion relation) when, gained similarity needs reduction, and (concrete reduction ratio needs and to be passed through using language material according to system
Test and to determine, the present invention is according to existing language material and experiment it is proposed that similarity reduces 50% about).In realization, in order to control solution
The border of problem certainly and running efficiency of system, the present invention uses dictionary method, and dictionary file comprises main prefecture-level city of China, often
Separate no inclusion relation between individual place name.So avoid because place name correlation leads to the language calculating in corpus
Adopted similarity too high (as " Beijing has anything to be fond of eating?" with " Shanghai has anything to be fond of eating?", " Beijing " is often existed with " Shanghai "
Occur together in corpus, their term vector is very related, but this two sentence semantics differ greatly).Mechanism's name also similarity
Reason.
The third situation, is modified to similarity according to personal pronoun.
If each self-contained pronoun of the problem of user input text and candidate's question and answer centering, such as user input text is
" I go today the Temple of Heaven play " and the problem of candidate's question and answer centering are " he goes the Temple of Heaven to play today ", now, people in two words
Pronoun is claimed to have differences, gained similarity need to reduce that (concrete reduction ratio needs and to pass through using language material according to system further
Test and to determine, the present invention is according to existing language material and experiment it is proposed that similarity reduces 50% about).Also using dictionary in realization
Method, dictionary file comprises conventional pronoun.
It should be noted that by three of the above mode correction similarity in the present invention, can also sentence by other means
The semanteme of disconnected text, revises similarity further.
The present invention can accurately calculate the semanteme between this short text in interactive system by above method
The accuracy of similarity, preferably makes full use of limited question and answer to data, improves the Consumer's Experience of interactive system.
Finally it should be noted that:Various embodiments above only in order to technical scheme to be described, is not intended to limit;To the greatest extent
Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that:Its according to
So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered
Row equivalent;And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology
The scope of scheme, it all should be covered in the middle of the claim of the present invention and the scope of description.
Claims (10)
1. between a kind of short text similarity calculating method it is characterised in that include:
Step S1, obtains corpus data, carries out pretreatment to described corpus data, obtain corpus;
Step S2, according to described corpus, is obtained keyword extraction model, using participle instrument, described corpus is divided
Word, and train acquisition term vector collection with word2vec;
Step S3, obtains problem, the problem to described candidate's question and answer pair and the described use of user input text and candidate's question and answer pair
Family input text carries out participle respectively by described participle instrument, and the problem to described candidate's question and answer pair and described user input are civilian
This carries out keyword extraction by described keyword extraction model respectively, obtains the word segmentation result of the problem of described candidate's question and answer pair
With keyword extraction result, and the word segmentation result of described user input text and keyword extraction result;
Step S4, the word segmentation result of the problem according to described candidate's question and answer pair and keyword extraction result, by described term vector
Collection obtains term vector, the word segmentation result according to described user input text and the keyword extraction of the problem of described candidate's question and answer pair
As a result, the term vector of described user input text is obtained by described term vector collection;
Step S5, the term vector of the problem according to described candidate's question and answer pair, calculate the sentence of the problem obtaining described candidate's question and answer pair
Vector, according to the term vector of described user input text, calculates the sentence vector obtaining described user input text;
Step S6, the sentence vector of the user input text described in sentence vector sum of the problem according to described candidate's question and answer pair, calculate two
Similarity between individual sentence vector;
Step S7, according to the similarity between described sentence vector, by the text of described user input and described candidate's question and answer pair
The information comprising in problem carries out the correction of similarity, obtains revised similarity.
2. according to claim 1 between short text similarity calculating method it is characterised in that
The information comprising in the problem of the text of described user input and described candidate's question and answer pair be text sentence pattern, name entity and
Personal pronoun, described name entity includes place name and organization names.
3. according to claim 1 between short text similarity calculating method it is characterised in that
In described step S2, obtain described keyword extraction model and include:
Step S21, obtains key word corpus, according to described key word corpus, carries out participle, obtain word segmentation result;
Step S22, according to described word segmentation result, is marked the key word in described word segmentation result by way of artificial mark, obtains
Key word corpus to after artificial mark;
Step S23, according to the key word corpus after described artificial mark, obtains keyword extraction mould by maximum entropy training
Type.
4. according to claim 1 between short text similarity calculating method it is characterised in that
Described keyword extraction model is the grader of one 2 classification.
5. according to claim 1 between short text similarity calculating method it is characterised in that
Described term vector collection is obtained by word2vec model training.
6. according to claim 1 between short text similarity calculating method it is characterised in that
The term vector of problem of described candidate's question and answer pair is identical with the term vector dimension of described user input text.
7. according to claim 2 between short text similarity calculating method it is characterised in that
Described name entity and personal pronoun are obtained by dictionary method.
8. according to claim 1 between short text similarity calculating method it is characterised in that
Described corpus data is obtained by web crawlers technology.
9. according to claim 1 between short text similarity calculating method it is characterised in that
Similarity between described sentence vector is calculated by the method for cosine similarity.
10. according to claim 1 between short text similarity calculating method it is characterised in that
Described participle instrument is Chinese handling implement bag hanlp.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610920608.5A CN106484664B (en) | 2016-10-21 | 2016-10-21 | Similarity calculating method between a kind of short text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610920608.5A CN106484664B (en) | 2016-10-21 | 2016-10-21 | Similarity calculating method between a kind of short text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484664A true CN106484664A (en) | 2017-03-08 |
CN106484664B CN106484664B (en) | 2019-03-01 |
Family
ID=58271016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610920608.5A Active CN106484664B (en) | 2016-10-21 | 2016-10-21 | Similarity calculating method between a kind of short text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484664B (en) |
Cited By (61)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776559A (en) * | 2016-12-14 | 2017-05-31 | 东软集团股份有限公司 | The method and device of text semantic Similarity Measure |
CN107066621A (en) * | 2017-05-11 | 2017-08-18 | 腾讯科技(深圳)有限公司 | A kind of search method of similar video, device and storage medium |
CN107229753A (en) * | 2017-06-29 | 2017-10-03 | 济南浪潮高新科技投资发展有限公司 | A kind of article classification of countries method based on word2vec models |
CN107329949A (en) * | 2017-05-24 | 2017-11-07 | 北京捷通华声科技股份有限公司 | A kind of semantic matching method and system |
CN107391614A (en) * | 2017-07-04 | 2017-11-24 | 重庆智慧思特大数据有限公司 | A kind of Chinese question and answer matching process based on WMD |
CN107577708A (en) * | 2017-07-31 | 2018-01-12 | 北京北信源软件股份有限公司 | Class base construction method and system based on SparkMLlib document classifications |
CN107577658A (en) * | 2017-07-18 | 2018-01-12 | 阿里巴巴集团控股有限公司 | Term vector processing method, device and electronic equipment |
CN107688604A (en) * | 2017-07-26 | 2018-02-13 | 阿里巴巴集团控股有限公司 | Data answering processing method, device and server |
CN107729322A (en) * | 2017-11-06 | 2018-02-23 | 广州杰赛科技股份有限公司 | Segmenting method and device, establish sentence vector generation model method and device |
CN107980130A (en) * | 2017-11-02 | 2018-05-01 | 深圳前海达闼云端智能科技有限公司 | It is automatic to answer method, apparatus, storage medium and electronic equipment |
CN108305057A (en) * | 2018-01-22 | 2018-07-20 | 平安科技(深圳)有限公司 | Dispensing apparatus, method and the computer readable storage medium of electronics red packet |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN108388559A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | Name entity recognition method and system, computer program of the geographical space under |
CN108427735A (en) * | 2018-02-28 | 2018-08-21 | 东华大学 | Clinical knowledge map construction method based on electronic health record |
CN108664465A (en) * | 2018-03-07 | 2018-10-16 | 珍岛信息技术(上海)股份有限公司 | One kind automatically generating text method and relevant apparatus |
CN108920604A (en) * | 2018-06-27 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | Voice interactive method and equipment |
CN108932066A (en) * | 2018-06-13 | 2018-12-04 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet |
CN109062977A (en) * | 2018-06-29 | 2018-12-21 | 厦门快商通信息技术有限公司 | A kind of automatic question answering text matching technique, automatic question-answering method and system based on semantic similarity |
CN109086303A (en) * | 2018-06-21 | 2018-12-25 | 深圳壹账通智能科技有限公司 | The Intelligent dialogue method, apparatus understood, terminal are read based on machine |
CN109241240A (en) * | 2018-08-17 | 2019-01-18 | 国家电网有限公司客户服务中心 | Power failure repairing information automatically forwarding method |
CN109522394A (en) * | 2018-10-12 | 2019-03-26 | 北京奔影网络科技有限公司 | Knowledge base question and answer system and method for building up |
CN109582966A (en) * | 2018-12-03 | 2019-04-05 | 北京容联易通信息技术有限公司 | A kind of information matching method and device |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
CN109815996A (en) * | 2019-01-07 | 2019-05-28 | 北京首钢自动化信息技术有限公司 | It is a kind of based on the scene of Recognition with Recurrent Neural Network from adaptation method and device |
CN109871437A (en) * | 2018-11-30 | 2019-06-11 | 阿里巴巴集团控股有限公司 | Method and device for the processing of customer problem sentence |
CN109902159A (en) * | 2019-01-29 | 2019-06-18 | 华融融通(北京)科技有限公司 | A kind of intelligent O&M statement similarity matching process based on natural language processing |
CN110020189A (en) * | 2018-06-29 | 2019-07-16 | 武汉掌游科技有限公司 | A kind of article recommended method based on Chinese Similarity measures |
CN110135551A (en) * | 2019-05-15 | 2019-08-16 | 西南交通大学 | A kind of robot chat method of word-based vector sum Recognition with Recurrent Neural Network |
CN110245219A (en) * | 2019-04-25 | 2019-09-17 | 义语智能科技(广州)有限公司 | A kind of answering method and equipment based on automatic extension Q & A database |
CN110275946A (en) * | 2019-05-14 | 2019-09-24 | 闽江学院 | A kind of FAQ automatic question-answering method and device |
CN110287295A (en) * | 2019-05-14 | 2019-09-27 | 闽江学院 | Question and answer robot construction method and system based on small routine |
CN110309278A (en) * | 2019-05-23 | 2019-10-08 | 泰康保险集团股份有限公司 | Keyword retrieval method, apparatus, medium and electronic equipment |
WO2019200923A1 (en) * | 2018-04-19 | 2019-10-24 | 京东方科技集团股份有限公司 | Pinyin-based semantic recognition method and device and human-machine conversation system |
CN110543636A (en) * | 2019-09-06 | 2019-12-06 | 出门问问(武汉)信息科技有限公司 | training data selection method of dialogue system |
CN110597966A (en) * | 2018-05-23 | 2019-12-20 | 北京国双科技有限公司 | Automatic question answering method and device |
CN110674273A (en) * | 2019-09-17 | 2020-01-10 | 安徽信息工程学院 | Intelligent question-answering robot training method for word segmentation |
CN110727769A (en) * | 2018-06-29 | 2020-01-24 | 优视科技(中国)有限公司 | Corpus generation method and device, and man-machine interaction processing method and device |
CN110866095A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text similarity determination method and related equipment |
CN110889285A (en) * | 2018-08-16 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Method, apparatus, device and medium for determining core word |
CN111046147A (en) * | 2018-10-11 | 2020-04-21 | 马上消费金融股份有限公司 | Question answering method and device and terminal equipment |
CN111144112A (en) * | 2019-12-30 | 2020-05-12 | 广州广电运通信息科技有限公司 | Text similarity analysis method and device and storage medium |
CN111191465A (en) * | 2018-10-25 | 2020-05-22 | 中国移动通信有限公司研究院 | Question-answer matching method, device, equipment and storage medium |
CN111209373A (en) * | 2020-01-07 | 2020-05-29 | 北京启明星辰信息安全技术有限公司 | Sensitive text recognition method and device based on natural semantics |
CN111241239A (en) * | 2020-01-07 | 2020-06-05 | 科大讯飞股份有限公司 | Method for detecting repeated questions, related device and readable storage medium |
CN111401042A (en) * | 2020-03-26 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Method and system for training text key content extraction model |
CN111428486A (en) * | 2019-01-08 | 2020-07-17 | 北京沃东天骏信息技术有限公司 | Article information data processing method, apparatus, medium, and electronic device |
CN111460081A (en) * | 2020-03-30 | 2020-07-28 | 招商局金融科技有限公司 | Answer generation method based on deep learning, electronic device and readable storage medium |
CN111460783A (en) * | 2020-03-30 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
CN112084310A (en) * | 2019-06-12 | 2020-12-15 | 阿里巴巴集团控股有限公司 | Reply information generation and automatic reply method and device |
CN112182193A (en) * | 2020-10-19 | 2021-01-05 | 山东旗帜信息有限公司 | Log obtaining method, device and medium in traffic industry |
CN112257410A (en) * | 2020-10-15 | 2021-01-22 | 江苏卓易信息科技股份有限公司 | Similarity calculation method for unbalanced text |
CN112507097A (en) * | 2020-12-17 | 2021-03-16 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112559658A (en) * | 2020-12-08 | 2021-03-26 | 中国科学技术大学 | Address matching method and device |
CN112836062A (en) * | 2021-01-13 | 2021-05-25 | 哈尔滨工程大学 | Relation extraction method of text corpus |
CN112883165A (en) * | 2021-03-16 | 2021-06-01 | 山东亿云信息技术有限公司 | Intelligent full-text retrieval method and system based on semantic understanding |
CN112988970A (en) * | 2021-03-11 | 2021-06-18 | 浙江康旭科技有限公司 | Text matching algorithm serving intelligent question-answering system |
CN113240485A (en) * | 2021-05-10 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method of text generation model, and text generation method and device |
CN113343708A (en) * | 2021-06-11 | 2021-09-03 | 北京声智科技有限公司 | Method and device for realizing statement generalization based on semantics |
CN114936277A (en) * | 2022-01-28 | 2022-08-23 | 中国银联股份有限公司 | Similarity problem matching method and user similarity problem matching system |
CN116932726A (en) * | 2023-08-04 | 2023-10-24 | 重庆邮电大学 | Open domain dialogue generation method based on controllable multi-space feature decoupling |
CN118520929A (en) * | 2024-07-25 | 2024-08-20 | 国家计算机网络与信息安全管理中心 | Training method of text similarity determination model and text similarity calculation method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013107345A1 (en) * | 2012-01-18 | 2013-07-25 | 腾讯科技(深圳)有限公司 | User question processing method and system |
EP2833271A1 (en) * | 2012-05-14 | 2015-02-04 | Huawei Technologies Co., Ltd | Multimedia question and answer system and method |
CN105095444A (en) * | 2015-07-24 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Information acquisition method and device |
CN105426354A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Sentence vector fusion method and apparatus |
-
2016
- 2016-10-21 CN CN201610920608.5A patent/CN106484664B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013107345A1 (en) * | 2012-01-18 | 2013-07-25 | 腾讯科技(深圳)有限公司 | User question processing method and system |
EP2833271A1 (en) * | 2012-05-14 | 2015-02-04 | Huawei Technologies Co., Ltd | Multimedia question and answer system and method |
CN105095444A (en) * | 2015-07-24 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Information acquisition method and device |
CN105426354A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Sentence vector fusion method and apparatus |
Cited By (84)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776559A (en) * | 2016-12-14 | 2017-05-31 | 东软集团股份有限公司 | The method and device of text semantic Similarity Measure |
CN106776559B (en) * | 2016-12-14 | 2020-08-11 | 东软集团股份有限公司 | Text semantic similarity calculation method and device |
CN107066621A (en) * | 2017-05-11 | 2017-08-18 | 腾讯科技(深圳)有限公司 | A kind of search method of similar video, device and storage medium |
CN107329949A (en) * | 2017-05-24 | 2017-11-07 | 北京捷通华声科技股份有限公司 | A kind of semantic matching method and system |
CN107229753A (en) * | 2017-06-29 | 2017-10-03 | 济南浪潮高新科技投资发展有限公司 | A kind of article classification of countries method based on word2vec models |
CN107391614A (en) * | 2017-07-04 | 2017-11-24 | 重庆智慧思特大数据有限公司 | A kind of Chinese question and answer matching process based on WMD |
CN107577658A (en) * | 2017-07-18 | 2018-01-12 | 阿里巴巴集团控股有限公司 | Term vector processing method, device and electronic equipment |
CN107688604A (en) * | 2017-07-26 | 2018-02-13 | 阿里巴巴集团控股有限公司 | Data answering processing method, device and server |
CN107577708A (en) * | 2017-07-31 | 2018-01-12 | 北京北信源软件股份有限公司 | Class base construction method and system based on SparkMLlib document classifications |
CN107980130A (en) * | 2017-11-02 | 2018-05-01 | 深圳前海达闼云端智能科技有限公司 | It is automatic to answer method, apparatus, storage medium and electronic equipment |
WO2019084867A1 (en) * | 2017-11-02 | 2019-05-09 | 深圳前海达闼云端智能科技有限公司 | Automatic answering method and apparatus, storage medium, and electronic device |
CN107729322A (en) * | 2017-11-06 | 2018-02-23 | 广州杰赛科技股份有限公司 | Segmenting method and device, establish sentence vector generation model method and device |
CN108305057A (en) * | 2018-01-22 | 2018-07-20 | 平安科技(深圳)有限公司 | Dispensing apparatus, method and the computer readable storage medium of electronics red packet |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN108388559A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | Name entity recognition method and system, computer program of the geographical space under |
CN108427735A (en) * | 2018-02-28 | 2018-08-21 | 东华大学 | Clinical knowledge map construction method based on electronic health record |
CN108664465A (en) * | 2018-03-07 | 2018-10-16 | 珍岛信息技术(上海)股份有限公司 | One kind automatically generating text method and relevant apparatus |
CN108664465B (en) * | 2018-03-07 | 2023-06-27 | 珍岛信息技术(上海)股份有限公司 | Method and related device for automatically generating text |
WO2019200923A1 (en) * | 2018-04-19 | 2019-10-24 | 京东方科技集团股份有限公司 | Pinyin-based semantic recognition method and device and human-machine conversation system |
US11100921B2 (en) | 2018-04-19 | 2021-08-24 | Boe Technology Group Co., Ltd. | Pinyin-based method and apparatus for semantic recognition, and system for human-machine dialog |
CN110597966A (en) * | 2018-05-23 | 2019-12-20 | 北京国双科技有限公司 | Automatic question answering method and device |
CN108932066A (en) * | 2018-06-13 | 2018-12-04 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet |
CN109086303A (en) * | 2018-06-21 | 2018-12-25 | 深圳壹账通智能科技有限公司 | The Intelligent dialogue method, apparatus understood, terminal are read based on machine |
US10984793B2 (en) | 2018-06-27 | 2021-04-20 | Baidu Online Network Technology (Beijing) Co., Ltd. | Voice interaction method and device |
CN108920604A (en) * | 2018-06-27 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | Voice interactive method and equipment |
CN110727769B (en) * | 2018-06-29 | 2024-04-19 | 阿里巴巴(中国)有限公司 | Corpus generation method and device and man-machine interaction processing method and device |
CN110020189A (en) * | 2018-06-29 | 2019-07-16 | 武汉掌游科技有限公司 | A kind of article recommended method based on Chinese Similarity measures |
CN109062977A (en) * | 2018-06-29 | 2018-12-21 | 厦门快商通信息技术有限公司 | A kind of automatic question answering text matching technique, automatic question-answering method and system based on semantic similarity |
CN110727769A (en) * | 2018-06-29 | 2020-01-24 | 优视科技(中国)有限公司 | Corpus generation method and device, and man-machine interaction processing method and device |
CN110889285B (en) * | 2018-08-16 | 2023-06-16 | 阿里巴巴集团控股有限公司 | Method, device, equipment and medium for determining core word |
CN110889285A (en) * | 2018-08-16 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Method, apparatus, device and medium for determining core word |
CN109241240A (en) * | 2018-08-17 | 2019-01-18 | 国家电网有限公司客户服务中心 | Power failure repairing information automatically forwarding method |
CN111046147A (en) * | 2018-10-11 | 2020-04-21 | 马上消费金融股份有限公司 | Question answering method and device and terminal equipment |
CN109522394A (en) * | 2018-10-12 | 2019-03-26 | 北京奔影网络科技有限公司 | Knowledge base question and answer system and method for building up |
CN111191465A (en) * | 2018-10-25 | 2020-05-22 | 中国移动通信有限公司研究院 | Question-answer matching method, device, equipment and storage medium |
CN111191465B (en) * | 2018-10-25 | 2023-05-09 | 中国移动通信有限公司研究院 | Question-answer matching method, device, equipment and storage medium |
CN109739956B (en) * | 2018-11-08 | 2020-04-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, apparatus, device and medium |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
CN109871437A (en) * | 2018-11-30 | 2019-06-11 | 阿里巴巴集团控股有限公司 | Method and device for the processing of customer problem sentence |
CN109871437B (en) * | 2018-11-30 | 2023-04-21 | 阿里巴巴集团控股有限公司 | Method and device for processing user problem statement |
CN109582966A (en) * | 2018-12-03 | 2019-04-05 | 北京容联易通信息技术有限公司 | A kind of information matching method and device |
CN109815996A (en) * | 2019-01-07 | 2019-05-28 | 北京首钢自动化信息技术有限公司 | It is a kind of based on the scene of Recognition with Recurrent Neural Network from adaptation method and device |
CN109815996B (en) * | 2019-01-07 | 2021-05-04 | 北京首钢自动化信息技术有限公司 | Scene self-adaptation method and device based on recurrent neural network |
CN111428486A (en) * | 2019-01-08 | 2020-07-17 | 北京沃东天骏信息技术有限公司 | Article information data processing method, apparatus, medium, and electronic device |
CN111428486B (en) * | 2019-01-08 | 2023-06-23 | 北京沃东天骏信息技术有限公司 | Article information data processing method, device, medium and electronic equipment |
CN109902159A (en) * | 2019-01-29 | 2019-06-18 | 华融融通(北京)科技有限公司 | A kind of intelligent O&M statement similarity matching process based on natural language processing |
CN110245219A (en) * | 2019-04-25 | 2019-09-17 | 义语智能科技(广州)有限公司 | A kind of answering method and equipment based on automatic extension Q & A database |
CN110275946A (en) * | 2019-05-14 | 2019-09-24 | 闽江学院 | A kind of FAQ automatic question-answering method and device |
CN110287295A (en) * | 2019-05-14 | 2019-09-27 | 闽江学院 | Question and answer robot construction method and system based on small routine |
CN110135551A (en) * | 2019-05-15 | 2019-08-16 | 西南交通大学 | A kind of robot chat method of word-based vector sum Recognition with Recurrent Neural Network |
CN110309278B (en) * | 2019-05-23 | 2021-11-16 | 泰康保险集团股份有限公司 | Keyword retrieval method, device, medium and electronic equipment |
CN110309278A (en) * | 2019-05-23 | 2019-10-08 | 泰康保险集团股份有限公司 | Keyword retrieval method, apparatus, medium and electronic equipment |
CN112084310A (en) * | 2019-06-12 | 2020-12-15 | 阿里巴巴集团控股有限公司 | Reply information generation and automatic reply method and device |
CN110543636A (en) * | 2019-09-06 | 2019-12-06 | 出门问问(武汉)信息科技有限公司 | training data selection method of dialogue system |
CN110674273A (en) * | 2019-09-17 | 2020-01-10 | 安徽信息工程学院 | Intelligent question-answering robot training method for word segmentation |
CN110866095A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text similarity determination method and related equipment |
CN111144112A (en) * | 2019-12-30 | 2020-05-12 | 广州广电运通信息科技有限公司 | Text similarity analysis method and device and storage medium |
CN111241239A (en) * | 2020-01-07 | 2020-06-05 | 科大讯飞股份有限公司 | Method for detecting repeated questions, related device and readable storage medium |
CN111241239B (en) * | 2020-01-07 | 2022-12-02 | 科大讯飞股份有限公司 | Method for detecting repeated questions, related device and readable storage medium |
CN111209373A (en) * | 2020-01-07 | 2020-05-29 | 北京启明星辰信息安全技术有限公司 | Sensitive text recognition method and device based on natural semantics |
CN111401042B (en) * | 2020-03-26 | 2023-04-14 | 支付宝(杭州)信息技术有限公司 | Method and system for training text key content extraction model |
CN111401042A (en) * | 2020-03-26 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Method and system for training text key content extraction model |
CN111460783B (en) * | 2020-03-30 | 2021-07-27 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
CN111460081B (en) * | 2020-03-30 | 2023-04-07 | 招商局金融科技有限公司 | Answer generation method based on deep learning, electronic device and readable storage medium |
CN111460783A (en) * | 2020-03-30 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
CN111460081A (en) * | 2020-03-30 | 2020-07-28 | 招商局金融科技有限公司 | Answer generation method based on deep learning, electronic device and readable storage medium |
CN112257410A (en) * | 2020-10-15 | 2021-01-22 | 江苏卓易信息科技股份有限公司 | Similarity calculation method for unbalanced text |
CN112182193B (en) * | 2020-10-19 | 2023-01-13 | 山东旗帜信息有限公司 | Log obtaining method, device and medium in traffic industry |
CN112182193A (en) * | 2020-10-19 | 2021-01-05 | 山东旗帜信息有限公司 | Log obtaining method, device and medium in traffic industry |
CN112559658B (en) * | 2020-12-08 | 2022-12-30 | 中国科学技术大学 | Address matching method and device |
CN112559658A (en) * | 2020-12-08 | 2021-03-26 | 中国科学技术大学 | Address matching method and device |
CN112507097B (en) * | 2020-12-17 | 2022-11-18 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112507097A (en) * | 2020-12-17 | 2021-03-16 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112836062A (en) * | 2021-01-13 | 2021-05-25 | 哈尔滨工程大学 | Relation extraction method of text corpus |
CN112836062B (en) * | 2021-01-13 | 2022-05-13 | 哈尔滨工程大学 | Relation extraction method of text corpus |
CN112988970A (en) * | 2021-03-11 | 2021-06-18 | 浙江康旭科技有限公司 | Text matching algorithm serving intelligent question-answering system |
CN112883165A (en) * | 2021-03-16 | 2021-06-01 | 山东亿云信息技术有限公司 | Intelligent full-text retrieval method and system based on semantic understanding |
CN113240485A (en) * | 2021-05-10 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method of text generation model, and text generation method and device |
CN113240485B (en) * | 2021-05-10 | 2024-09-20 | 北京沃东天骏信息技术有限公司 | Training method of text generation model, text generation method and device |
CN113343708A (en) * | 2021-06-11 | 2021-09-03 | 北京声智科技有限公司 | Method and device for realizing statement generalization based on semantics |
CN114936277A (en) * | 2022-01-28 | 2022-08-23 | 中国银联股份有限公司 | Similarity problem matching method and user similarity problem matching system |
CN116932726A (en) * | 2023-08-04 | 2023-10-24 | 重庆邮电大学 | Open domain dialogue generation method based on controllable multi-space feature decoupling |
CN116932726B (en) * | 2023-08-04 | 2024-05-10 | 重庆邮电大学 | Open domain dialogue generation method based on controllable multi-space feature decoupling |
CN118520929A (en) * | 2024-07-25 | 2024-08-20 | 国家计算机网络与信息安全管理中心 | Training method of text similarity determination model and text similarity calculation method |
Also Published As
Publication number | Publication date |
---|---|
CN106484664B (en) | 2019-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106484664B (en) | Similarity calculating method between a kind of short text | |
CN108287922B (en) | Text data viewpoint abstract mining method fusing topic attributes and emotional information | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN107798140B (en) | Dialog system construction method, semantic controlled response method and device | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN106354710A (en) | Neural network relation extracting method | |
CN103207860B (en) | The entity relation extraction method and apparatus of public sentiment event | |
CN104008166B (en) | Dialogue short text clustering method based on form and semantic similarity | |
CN107305539A (en) | A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries | |
CN107025299B (en) | A kind of financial public sentiment cognitive method based on weighting LDA topic models | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN107133214A (en) | A kind of product demand preference profiles based on comment information are excavated and its method for evaluating quality | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
DE112013004082T5 (en) | Search system of the emotion entity for the microblog | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN107862087A (en) | Sentiment analysis method, apparatus and storage medium based on big data and deep learning | |
CN103631859A (en) | Intelligent review expert recommending method for science and technology projects | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN106446147A (en) | Emotion analysis method based on structuring features | |
CN111460158B (en) | Microblog topic public emotion prediction method based on emotion analysis | |
CN112100365A (en) | Two-stage text summarization method | |
CN105930509A (en) | Method and system for automatic extraction and refinement of domain concept based on statistics and template matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |