CN103455535B - The method building knowledge base based on historical consultation data - Google Patents
The method building knowledge base based on historical consultation data Download PDFInfo
- Publication number
- CN103455535B CN103455535B CN201310168964.2A CN201310168964A CN103455535B CN 103455535 B CN103455535 B CN 103455535B CN 201310168964 A CN201310168964 A CN 201310168964A CN 103455535 B CN103455535 B CN 103455535B
- Authority
- CN
- China
- Prior art keywords
- answer
- sentence
- similarity
- question
- knowledge base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active - Reinstated
Links
- 230000000875 corresponding Effects 0.000 claims abstract description 13
- 238000001914 filtration Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- SHXWCVYOXRDMCX-UHFFFAOYSA-N MDMA Chemical compound CNC(C)CC1=CC=C2OCOC2=C1 SHXWCVYOXRDMCX-UHFFFAOYSA-N 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 11
- 238000010276 construction Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000005562 fading Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 1
- 230000036633 rest Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Abstract
The present invention discloses a kind of method building knowledge base based on historical consultation data, automatically to build knowledge base based on historical consultation data, including historical consultation data is carried out cutting consulting scene, extract the question and answer of each scene to, calculate answer similarity, filter the low answer of similar answer frequency, extract question sentence corresponding to altofrequency answer, extract the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection, stored knowledge.The present invention builds knowledge base automatically by historical consultation data, decreases the artificial workload built knowledge base and safeguard near synonym storehouse.
Description
Technical field
The present invention relates to the knowledge base field of computer and question answering system, particularly relate to build based on historical consultation data know
The method knowing storehouse.
Background technology
In automatically request-answering system, knowledge base is the significant data source of question answering system, and it serves in the entire system
Very important effect, a high-quality knowledge base can be greatly improved efficiency and the accuracy rate of question answering system.
At present construction to knowledge base typically has 2 kinds of modes the most in the industry:
The first is that Knowledge Database also rests on the manual construction period, owing to industry field limits, so very
Many knowledge are all pure manual construction, and the technical staff that manual construction is typically all certain industry special completes, and its work is imitated
Rate is low, of low quality, and maintenance difficulties is big.
The second is to use semantic matching degree, carries out building knowledge base.Question answering system typically uses and knows net (How-
Net) or near synonym table realizes, but know that net (How-net) and near synonym table are all manually to arrange, it is achieved the amount of getting up to work
The hugest, and coverage is narrow.
Assume automatically request-answering system knowledge base exists following many-to-one FAQ, and include that a user inputs certainly
So language sentence (hereinafter referred to as question sentence) and systems response (hereinafter referred to as Answer Sentence), such as example below:
Question-answer sentence common in the apparel industry of Taobao:
Question sentence: these part jeans can or can not fade?
Question sentence: this part dotey can or can not fade?
Question sentence: really will not fade?
Question sentence: jeans are washed and can be faded several times?
Question sentence: the jeans that you sell are to fade?
Answer Sentence: will not fade, parent.
When user's input " jeans are washed and can be faded several times " when, system can find this to organize FAQ, and this is answered
Case sentence returns to user.But, when user's input, " this part dotey has washed and can fade several times?" when, technical staff is necessary
Manual arrange in knowing net (How-net) or near synonym table " dotey " (electricity firm industry generally replaces trade name with " dotey ") and
" jeans " associate, and " fading " associates with " fading ".System accurately could return to user Answer Sentence, and not so system cannot be counted
Calculate real answer.Below the association of not only technical staff's near synonym to be arranged, also have Answer Sentence " will not fade,
Parent." corresponding n kind way to put questions all collects, this kind of way, either from the perspective of workload or system effectiveness, it is all
Unacceptable.
Summary of the invention
It is an object of the invention to, it is provided that a kind of method building knowledge base based on historical consultation data, solve existing to know
Know the problem that storehouse builds inefficiency.
To achieve these goals, the present invention provides a kind of method building knowledge base based on historical consultation data, its bag
Include following steps:
1) historical consultation data is read;
2) cutting consulting scene;
3) question and answer pair of each scene are extracted;
4) answer similarity is calculated;
5) answer that similar answer frequency is low is filtered;
6) question sentence that altofrequency answer is corresponding is extracted;
7) the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection is extracted;
8) stored knowledge.
Wherein, in described step 2) in, carry out cutting scene according to consultant, be cut into the single customer service of many groups and single consulting
The consulting scene of person.
Wherein, in described step 3) in, extract question and answer pair according to the identity of customer service Yu consultant, customer service is installed with in saying
For answer, the content that consultant is said is set to question sentence.
Wherein, in described step 4) in, calculating answer similarity is that the answer of the question and answer centering calculating all scenes is similar
Value, first carries out participle to Answer Sentence, and next filters stop words, finally calculates the similar value between every Answer Sentence.
Wherein, the similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, its
Between relation be
SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (x, y)+λ3* OrderSim (X, Y),
SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers
Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X
Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points
It is not constant, and meets λ 1+ λ 2+ λ 3=1.
Wherein, the computing formula of described WordSim (X, Y) is:
Described LenSim (X, Y) computing formula is:
Described word order calculating formula of similarity is:
Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, and Len (X), Len (X) divide
Not Biao Shi Answer Sentence x and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, Onews (x, y) represents: Answer Sentence X and
Answer Sentence Y occurs and the most only occurs the set of word once, Reword (X, Y), represents the permutation number between each adjacent word.
Wherein, in described step 5) in, similar answer frequency refers to every Answer Sentence institute in whole historical consultation data
The ratio accounted for, then according to the low-frequency answer of threshold filtering, regards as the i.e. high-quality answer of altofrequency answer higher than threshold value.
Wherein, described step 6) according to altofrequency answer, search every corresponding question sentence of answer.
Wherein, described step 7) use statistical principle, by question sentence collection being carried out participle, extract frequency more than threshold value
And have the keyword of Cahn-Ingold-Prelog sequence rule.
Wherein, described step 8) content that stores includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, this includes answer, asks
Relation between sentence and core keyword Cahn-Ingold-Prelog sequence rule is the relation of multi-to-multi.
Beneficial effects of the present invention: the present invention, based on historical consultation data, quickly can construct knowledge base, and
And this knowledge base not only contains the FAQ of multi-to-multi, and contain the key sequence collection of core, seeking advice from history
Under premised on data, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and saves
Large quantities of manpower maintenance costs, facilitates technical staff's Fast Construction knowledge base.
In order to be able to be further understood that inventive feature and technology contents, refer to below in connection with the present invention is detailed
Illustrate and accompanying drawing, but accompanying drawing only provides reference and explanation use, be not used for the present invention is any limitation as.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings, by the detailed description of the invention of the present invention is described in detail, technical scheme will be made
And other beneficial effect is apparent.
In accompanying drawing,
Fig. 1 is the schematic flow sheet of the present invention;
Fig. 2 is the historical consultation data scene format of the present invention.
Detailed description of the invention
By further illustrating the technological means and effect, being preferable to carry out below in conjunction with the present invention that the present invention taked
Example and accompanying drawing thereof are described in detail.
The implementation environment of the present invention uses the shopping consulting of Taobao to do analysis of cases, with both parties' advisory data conduct
Build the Data Source of question answering system knowledge base.
Refer to Fig. 1, the one provided for the embodiment of the present invention builds knowledge base method based on historical consultation data.The party
Method, by the analysis to shopping history data, extracts the question-response of both parties, and is stored as FAQ;To in FAQ
Answer Sentence, use Similarity Measure, when similar value reaches certain threshold value, predicate similar answer, and add up similar and answer
Case frequency.When similar answer frequency reaches certain threshold value, extract the question sentence set that answer is corresponding.Finally by question sentence participle,
Extract the keyword string that frequency is high.The method is described in detail below in conjunction with Fig. 1.
Step 101, starts.
Step 102, reads historical consultation data.
Historical consultation data is typically all and is provided by customer service system, it is common that read by the form of API or importing.
Step 103, cutting consulting scene.
Historical consultation data is typically all the corresponding multiple consultants of multiple customer service and is entrained in together, therefore must be according to consultant
Carrying out cutting scene, the definition of scene here refers to single customer service and one complete dialogue of single consultant both sides, such as: figure
In 2, the full dialog of " buyer abc " and " women's dress customer service " both sides is known as a scene.
Step 104, extracts the question sentence in each scene and Answer Sentence.
Each scene typically contains multiple question sentence and Answer Sentence, therefore must carry according to the identity of customer service with consultant
Taking question and answer pair, the content that usual customer service is said is answer, and the content that consultant is said is question sentence.
Step 105, calculates answer similarity.
The advisory data of usual more than 3 months all contains n consulting scene, and each scene contains m question and answer pair, logical
Cross the answer set of the question and answer centering extracting n*m, calculate the similarity between Answer Sentence and Answer Sentence, this similarity meter in answer set
Calculating and Answer Sentence first carries out participle and filters stop words, then calculate similarity, sentence similarity is long by the similarity of word, sentence
Degree similarity, word order similarity determine, wherein morphology similarity plays Main Function, and sentence length similarity plays secondary work
With, the effect of word order similarity is minimum.When similar value is more than threshold value r (value of r may be set to 0.9 in such as the present embodiment),
Regard as similar answer.
Concrete similarity algorithm is as follows:
The similarity of word a: sentence (sentence), S can see and write words and one of special symbol (hereinafter referred to as individual character)
Ordered set.The length of S is i.e. the number of word in S, represents with Len (S) herein, and SameWC (X, Y) represents Answer Sentence X, phase in Y
With the number of word, when the number of times difference that a word occurs in X, Y with counting that occurrence number is few.The morphology of Answer Sentence X, Y
Similarity WordSim (X, Y) is determined by following formula:
Wherein: WordSim (X, Y) ∈ [0,1];Meaning: the number of words that two Answer Sentences are identical is the most, two Answer Sentences are more
Similar.
Sentence length similarity: Len (X), Len (Y) represent length that is two answer of Answer Sentence x and Answer Sentence Y respectively
The number of word in Ju.Then Answer Sentence length similarity LenSim (X, Y) is determined by following formula:
Wherein: Lensim (x, y) ∈ [0,1];Meaning: the length of two statements closer to, two statements are the most similar.
(x, y) represents the similarity of word order: Onews: all occur and the most only occur the set of word once in X, Y.
Reword (X, Y), represents the permutation number between each adjacent word.Ordersim (x, y)
Wherein: OrderSim (X, Y) ∈ [0,1], the advantage so defining word order similarity is: when a subordinate sentence or phrase
After overall generation distance moves, still much like with original statement.Realizing quick, algorithm complex is 0 (m), wherein m=|
Onews (X, Y) |.
Similarity SentenceSim (X, Y) of sentence X, Y is determined by following formula:
SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (x, y)+λ3* OrderSim (X, Y)
Wherein, it is constant that λ 1, λ 2, λ 3 is divided into, and meets λ 1+ λ 2+ λ 3=1, it is clear that SentenceSim (X, Y) ∈ [0,
1].In sentence similarity, morphology similarity plays Main Function, and statement length similarity and word order similarity play secondary
Effect, therefore λ 1, λ 2, λ 3, should there is λ 1 > λ 2 > λ 3 during value, current default takes λ 1=0.8, λ 2=0.15, λ 3=0.05.
One threshold value can be set the most in the calculation as a similar condition, when the similarity of two statements is higher than this threshold value
Time, it being considered as the two statement similar, threshold value is set as 0.9 the most in the present embodiment, regards as similar answering higher than 0.9
Case.
Step 106, filters the answer that similar answer frequency is low.
After similar answer frequency refers to by Similarity Measure, this Answer Sentence goes out in all answers of historical consultation data
Existing frequency number of times.The filtration of similar answer frequency is that (R can be arranged as the case may be, such as in this enforcement with threshold value R
Example may be set to 2), less than regarding as low frequency answer, such as: in Fig. 2, " parent, we are to buy in front working as at 17 in afternoon
Day delivery, sends, thanks the next day after 17." and " will not fade, parent " frequency number of times of occurring be just 2.
Step 107, extracts the question sentence that altofrequency answer is corresponding.
Remaining after the filtration of step 106 is exactly high-frequency answer, the question sentence that altofrequency answer is corresponding, generally
Being all high-quality question sentence, these question sentences the most all comprise near synonym, such as: Answer Sentence in Fig. 2: " will not fade, parent ", right
Respectively " this part clothes can fade the question sentence answered?" and " this part dotey can fade?" only store many-to-one FAQ,
Just can destroy huge near synonym table.
Step 108, extracts the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection.
Question sentence collection refers to the question sentence group that answer is corresponding, and it contains some question sentences.Core keyword Cahn-Ingold-Prelog sequence rule refers to ask
Sentence is concentrated and is contained multiple core keyword collection, and this set of keywords has certain word order.This method first passes through all of
Question sentence carries out participle, then uses statistical principle, and extracting frequency more than r2 (r2 is more than 1 in the present embodiment) and has order
The keyword of rule.
Such as: Answer Sentence in Fig. 2: " parent, we are to buy in 17 deliveries on the front same day in afternoon, send out the next day after 17
Go out, thanks." corresponding question sentence collection.
When you can deliver?
The most when that can deliver to me?
By participle and filter after stop words, draw following word segmentation result:
When you can deliver
The most when that can deliver to me
After participle, can quickly count frequency according to certain algorithm and be more than 2 and have the keyword of Cahn-Ingold-Prelog sequence rule
For " when ... delivery ".
Step 109, stored knowledge.
The storage format of knowledge base by number, answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule storage, wherein answer with ask
Being the relation of multi-to-multi between Ju, question sentence and core keyword Cahn-Ingold-Prelog sequence rule are the relations of multi-to-multi;Such as table 1 below:
Table 1
Step 110 terminates.
By above step, based on historical consultation data, can quickly construct knowledge base, and this knowledge
Storehouse not only contains the FAQ of multi-to-multi, and contains the key sequence collection of core, is being front with historical consultation data
Putting, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and has saved large quantities of people
Power maintenance cost, facilitates technical staff's Fast Construction knowledge base.
Above examples providing and automatically build knowledge base, above-described embodiment is mainly with Chinese as object language, it is also possible to
It is applicable to other language.What the present embodiment provided builds knowledge base method automatically, is the e-commerce industry with Taobao as representative
It is described, but is not limited to e-commerce industry, go for other industry and realize in the method, and present case
Similarity Measure is not limited to this algorithm, can realize to use other Similarity Measure modes.
The above, for the person of ordinary skill of the art, can be according to technical scheme and technology
Other various corresponding changes and deformation are made in design, and all these change and deformation all should belong to the claims in the present invention
Protection domain.
Claims (6)
1. the method building knowledge base based on historical consultation data, it is characterised in that comprise the following steps:
1) historical consultation data is read;
2) cutting consulting scene;
3) question and answer pair of each scene are extracted;
4) answer similarity is calculated;
5) answer that similar answer frequency is low is filtered;
6) question sentence that altofrequency answer is corresponding is extracted;
7) the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection is extracted;
8) stored knowledge;
In described step 4) in, calculate the answer similar value that answer similarity is the question and answer centering calculating all scenes, the most right
Answer Sentence carries out participle, and next filters stop words, finally calculates the similar value between every Answer Sentence;
The similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, the relation between it
For SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (X, Y)+λ3* OrderSim (X, Y),
SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers
Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X
Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points
It is not constant, and meets λ 1+ λ 2+ λ 3=1;
The computing formula of described WordSim (X, Y) is:
Described LenSim (X, Y) computing formula is:
Described word order calculating formula of similarity is:
Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, Len (X), Len (X) table respectively
Showing Answer Sentence X and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, and Onews (X, Y) represents: Answer Sentence X and answer
Sentence Y occurs and the most only occur the set of word once, Reword (X, Y), represents the permutation number between each adjacent word;
In described step 5) in, similar answer frequency refers to the ratio that every Answer Sentence is shared in whole historical consultation data,
Then according to the low-frequency answer of threshold filtering, the i.e. high-quality answer of altofrequency answer is regarded as higher than threshold value.
2. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step
2) in, carry out cutting scene according to consultant, be cut into the consulting scene organizing single customer service and single consultant more.
3. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step
3) in, extracting question and answer pair according to the identity of customer service Yu consultant, the content that customer service is said is set to answer, and consultant is installed with in saying
For question sentence.
4. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 6)
According to altofrequency answer, search every corresponding question sentence of answer.
5. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 7)
Use statistical principle, by question sentence collection being carried out participle, extracting frequency and be more than threshold value and have the keyword of Cahn-Ingold-Prelog sequence rule.
6. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 8)
The content of storage includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, and this includes answer, question sentence and core keyword Cahn-Ingold-Prelog sequence rule
Between relation be the relation of multi-to-multi.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310168964.2A CN103455535B (en) | 2013-05-08 | The method building knowledge base based on historical consultation data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310168964.2A CN103455535B (en) | 2013-05-08 | The method building knowledge base based on historical consultation data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103455535A CN103455535A (en) | 2013-12-18 |
CN103455535B true CN103455535B (en) | 2016-11-30 |
Family
ID=
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1928864A (en) * | 2006-09-22 | 2007-03-14 | 浙江大学 | FAQ based Chinese natural language ask and answer method |
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN102637192A (en) * | 2012-02-17 | 2012-08-15 | 清华大学 | Method for answering with natural language |
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1928864A (en) * | 2006-09-22 | 2007-03-14 | 浙江大学 | FAQ based Chinese natural language ask and answer method |
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN102637192A (en) * | 2012-02-17 | 2012-08-15 | 清华大学 | Method for answering with natural language |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Masry et al. | Chartqa: A benchmark for question answering about charts with visual and logical reasoning | |
CN106484664B (en) | Similarity calculating method between a kind of short text | |
CN105808526B (en) | Commodity short text core word extracting method and device | |
CN104756106B (en) | Data source in characterize data storage system | |
CN103678670B (en) | Micro-blog hot word and hot topic mining system and method | |
CN104778209B (en) | A kind of opining mining method for millions scale news analysis | |
CN107609052A (en) | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle | |
CN106372938A (en) | Abnormal account identification method and system | |
Kherwa et al. | An approach towards comprehensive sentimental data analysis and opinion mining | |
CN105786991A (en) | Chinese emotion new word recognition method and system in combination with user emotion expression ways | |
CN106202211A (en) | A kind of integrated microblogging rumour recognition methods based on microblogging type | |
CN104077407B (en) | A kind of intelligent data search system and method | |
CN105631018B (en) | Article Feature Extraction Method based on topic model | |
CN109101493A (en) | A kind of intelligence house-purchase assistant based on dialogue robot | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN103279478A (en) | Method for extracting features based on distributed mutual information documents | |
CN104077417A (en) | Figure tag recommendation method and system in social network | |
CN107908733A (en) | A kind of querying method of global trade data, apparatus and system | |
CN107305545A (en) | A kind of recognition methods of the network opinion leader based on text tendency analysis | |
CN102880631A (en) | Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method | |
CN108614814A (en) | A kind of abstracting method of evaluation information, device and equipment | |
CN107766431A (en) | It is a kind of that Parameter Function Unit method and system are gone based on syntax parsing | |
CN105955960B (en) | Grounding grid defect text mining method based on semantic frame | |
CN104035969B (en) | Feature Words base construction method and system in social networks | |
Su et al. | An improved BERT method for the evolution of network public opinion of major infectious diseases: Case Study of COVID-19 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161130 Termination date: 20190508 |
|
RR01 | Reinstatement of patent right |
Former decision: termination of patent right due to unpaid annual fee Former decision publication date: 20200424 |