CN103455535B - The method building knowledge base based on historical consultation data - Google Patents

The method building knowledge base based on historical consultation data Download PDF

Info

Publication number
CN103455535B
CN103455535B CN201310168964.2A CN201310168964A CN103455535B CN 103455535 B CN103455535 B CN 103455535B CN 201310168964 A CN201310168964 A CN 201310168964A CN 103455535 B CN103455535 B CN 103455535B
Authority
CN
China
Prior art keywords
answer
sentence
similarity
question
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active - Reinstated
Application number
CN201310168964.2A
Other languages
Chinese (zh)
Other versions
CN103455535A (en
Inventor
冯梓洋
刁应君
卢铄波
胡欢
刘洋
杨大川
宋战
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Mingtang Communication Co Ltd
Original Assignee
Shenzhen Mingtang Communication Co Ltd
Filing date
Publication date
Application filed by Shenzhen Mingtang Communication Co Ltd filed Critical Shenzhen Mingtang Communication Co Ltd
Priority to CN201310168964.2A priority Critical patent/CN103455535B/en
Publication of CN103455535A publication Critical patent/CN103455535A/en
Application granted granted Critical
Publication of CN103455535B publication Critical patent/CN103455535B/en
Active - Reinstated legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention discloses a kind of method building knowledge base based on historical consultation data, automatically to build knowledge base based on historical consultation data, including historical consultation data is carried out cutting consulting scene, extract the question and answer of each scene to, calculate answer similarity, filter the low answer of similar answer frequency, extract question sentence corresponding to altofrequency answer, extract the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection, stored knowledge.The present invention builds knowledge base automatically by historical consultation data, decreases the artificial workload built knowledge base and safeguard near synonym storehouse.

Description

The method building knowledge base based on historical consultation data
Technical field
The present invention relates to the knowledge base field of computer and question answering system, particularly relate to build based on historical consultation data know The method knowing storehouse.
Background technology
In automatically request-answering system, knowledge base is the significant data source of question answering system, and it serves in the entire system Very important effect, a high-quality knowledge base can be greatly improved efficiency and the accuracy rate of question answering system.
At present construction to knowledge base typically has 2 kinds of modes the most in the industry:
The first is that Knowledge Database also rests on the manual construction period, owing to industry field limits, so very Many knowledge are all pure manual construction, and the technical staff that manual construction is typically all certain industry special completes, and its work is imitated Rate is low, of low quality, and maintenance difficulties is big.
The second is to use semantic matching degree, carries out building knowledge base.Question answering system typically uses and knows net (How- Net) or near synonym table realizes, but know that net (How-net) and near synonym table are all manually to arrange, it is achieved the amount of getting up to work The hugest, and coverage is narrow.
Assume automatically request-answering system knowledge base exists following many-to-one FAQ, and include that a user inputs certainly So language sentence (hereinafter referred to as question sentence) and systems response (hereinafter referred to as Answer Sentence), such as example below:
Question-answer sentence common in the apparel industry of Taobao:
Question sentence: these part jeans can or can not fade?
Question sentence: this part dotey can or can not fade?
Question sentence: really will not fade?
Question sentence: jeans are washed and can be faded several times?
Question sentence: the jeans that you sell are to fade?
Answer Sentence: will not fade, parent.
When user's input " jeans are washed and can be faded several times " when, system can find this to organize FAQ, and this is answered Case sentence returns to user.But, when user's input, " this part dotey has washed and can fade several times?" when, technical staff is necessary Manual arrange in knowing net (How-net) or near synonym table " dotey " (electricity firm industry generally replaces trade name with " dotey ") and " jeans " associate, and " fading " associates with " fading ".System accurately could return to user Answer Sentence, and not so system cannot be counted Calculate real answer.Below the association of not only technical staff's near synonym to be arranged, also have Answer Sentence " will not fade, Parent." corresponding n kind way to put questions all collects, this kind of way, either from the perspective of workload or system effectiveness, it is all Unacceptable.
Summary of the invention
It is an object of the invention to, it is provided that a kind of method building knowledge base based on historical consultation data, solve existing to know Know the problem that storehouse builds inefficiency.
To achieve these goals, the present invention provides a kind of method building knowledge base based on historical consultation data, its bag Include following steps:
1) historical consultation data is read;
2) cutting consulting scene;
3) question and answer pair of each scene are extracted;
4) answer similarity is calculated;
5) answer that similar answer frequency is low is filtered;
6) question sentence that altofrequency answer is corresponding is extracted;
7) the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection is extracted;
8) stored knowledge.
Wherein, in described step 2) in, carry out cutting scene according to consultant, be cut into the single customer service of many groups and single consulting The consulting scene of person.
Wherein, in described step 3) in, extract question and answer pair according to the identity of customer service Yu consultant, customer service is installed with in saying For answer, the content that consultant is said is set to question sentence.
Wherein, in described step 4) in, calculating answer similarity is that the answer of the question and answer centering calculating all scenes is similar Value, first carries out participle to Answer Sentence, and next filters stop words, finally calculates the similar value between every Answer Sentence.
Wherein, the similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, its Between relation be
SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (x, y)+λ3* OrderSim (X, Y), SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points It is not constant, and meets λ 1+ λ 2+ λ 3=1.
Wherein, the computing formula of described WordSim (X, Y) is:
WordSim ( X , Y ) = SameWc ( X , Y ) Max ( Len ( X ) , Len ( Y ) ) ;
Described LenSim (X, Y) computing formula is:
· LenSim ( X , Y ) = 1 - abs ( Len ( X ) - Len ( y ) ) Len ( X ) + Len ( Y ) ;
Described word order calculating formula of similarity is:
Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, and Len (X), Len (X) divide Not Biao Shi Answer Sentence x and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, Onews (x, y) represents: Answer Sentence X and Answer Sentence Y occurs and the most only occurs the set of word once, Reword (X, Y), represents the permutation number between each adjacent word.
Wherein, in described step 5) in, similar answer frequency refers to every Answer Sentence institute in whole historical consultation data The ratio accounted for, then according to the low-frequency answer of threshold filtering, regards as the i.e. high-quality answer of altofrequency answer higher than threshold value.
Wherein, described step 6) according to altofrequency answer, search every corresponding question sentence of answer.
Wherein, described step 7) use statistical principle, by question sentence collection being carried out participle, extract frequency more than threshold value And have the keyword of Cahn-Ingold-Prelog sequence rule.
Wherein, described step 8) content that stores includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, this includes answer, asks Relation between sentence and core keyword Cahn-Ingold-Prelog sequence rule is the relation of multi-to-multi.
Beneficial effects of the present invention: the present invention, based on historical consultation data, quickly can construct knowledge base, and And this knowledge base not only contains the FAQ of multi-to-multi, and contain the key sequence collection of core, seeking advice from history Under premised on data, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and saves Large quantities of manpower maintenance costs, facilitates technical staff's Fast Construction knowledge base.
In order to be able to be further understood that inventive feature and technology contents, refer to below in connection with the present invention is detailed Illustrate and accompanying drawing, but accompanying drawing only provides reference and explanation use, be not used for the present invention is any limitation as.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings, by the detailed description of the invention of the present invention is described in detail, technical scheme will be made And other beneficial effect is apparent.
In accompanying drawing,
Fig. 1 is the schematic flow sheet of the present invention;
Fig. 2 is the historical consultation data scene format of the present invention.
Detailed description of the invention
By further illustrating the technological means and effect, being preferable to carry out below in conjunction with the present invention that the present invention taked Example and accompanying drawing thereof are described in detail.
The implementation environment of the present invention uses the shopping consulting of Taobao to do analysis of cases, with both parties' advisory data conduct Build the Data Source of question answering system knowledge base.
Refer to Fig. 1, the one provided for the embodiment of the present invention builds knowledge base method based on historical consultation data.The party Method, by the analysis to shopping history data, extracts the question-response of both parties, and is stored as FAQ;To in FAQ Answer Sentence, use Similarity Measure, when similar value reaches certain threshold value, predicate similar answer, and add up similar and answer Case frequency.When similar answer frequency reaches certain threshold value, extract the question sentence set that answer is corresponding.Finally by question sentence participle, Extract the keyword string that frequency is high.The method is described in detail below in conjunction with Fig. 1.
Step 101, starts.
Step 102, reads historical consultation data.
Historical consultation data is typically all and is provided by customer service system, it is common that read by the form of API or importing.
Step 103, cutting consulting scene.
Historical consultation data is typically all the corresponding multiple consultants of multiple customer service and is entrained in together, therefore must be according to consultant Carrying out cutting scene, the definition of scene here refers to single customer service and one complete dialogue of single consultant both sides, such as: figure In 2, the full dialog of " buyer abc " and " women's dress customer service " both sides is known as a scene.
Step 104, extracts the question sentence in each scene and Answer Sentence.
Each scene typically contains multiple question sentence and Answer Sentence, therefore must carry according to the identity of customer service with consultant Taking question and answer pair, the content that usual customer service is said is answer, and the content that consultant is said is question sentence.
Step 105, calculates answer similarity.
The advisory data of usual more than 3 months all contains n consulting scene, and each scene contains m question and answer pair, logical Cross the answer set of the question and answer centering extracting n*m, calculate the similarity between Answer Sentence and Answer Sentence, this similarity meter in answer set Calculating and Answer Sentence first carries out participle and filters stop words, then calculate similarity, sentence similarity is long by the similarity of word, sentence Degree similarity, word order similarity determine, wherein morphology similarity plays Main Function, and sentence length similarity plays secondary work With, the effect of word order similarity is minimum.When similar value is more than threshold value r (value of r may be set to 0.9 in such as the present embodiment), Regard as similar answer.
Concrete similarity algorithm is as follows:
The similarity of word a: sentence (sentence), S can see and write words and one of special symbol (hereinafter referred to as individual character) Ordered set.The length of S is i.e. the number of word in S, represents with Len (S) herein, and SameWC (X, Y) represents Answer Sentence X, phase in Y With the number of word, when the number of times difference that a word occurs in X, Y with counting that occurrence number is few.The morphology of Answer Sentence X, Y Similarity WordSim (X, Y) is determined by following formula:
WordSim ( X , Y ) = SameWc ( X , Y ) Max ( Len ( X ) , Len ( Y ) )
Wherein: WordSim (X, Y) ∈ [0,1];Meaning: the number of words that two Answer Sentences are identical is the most, two Answer Sentences are more Similar.
Sentence length similarity: Len (X), Len (Y) represent length that is two answer of Answer Sentence x and Answer Sentence Y respectively The number of word in Ju.Then Answer Sentence length similarity LenSim (X, Y) is determined by following formula:
Wherein: Lensim (x, y) ∈ [0,1];Meaning: the length of two statements closer to, two statements are the most similar.
(x, y) represents the similarity of word order: Onews: all occur and the most only occur the set of word once in X, Y. Reword (X, Y), represents the permutation number between each adjacent word.Ordersim (x, y)
Wherein: OrderSim (X, Y) ∈ [0,1], the advantage so defining word order similarity is: when a subordinate sentence or phrase After overall generation distance moves, still much like with original statement.Realizing quick, algorithm complex is 0 (m), wherein m=| Onews (X, Y) |.
Similarity SentenceSim (X, Y) of sentence X, Y is determined by following formula:
SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (x, y)+λ3* OrderSim (X, Y)
Wherein, it is constant that λ 1, λ 2, λ 3 is divided into, and meets λ 1+ λ 2+ λ 3=1, it is clear that SentenceSim (X, Y) ∈ [0, 1].In sentence similarity, morphology similarity plays Main Function, and statement length similarity and word order similarity play secondary Effect, therefore λ 1, λ 2, λ 3, should there is λ 1 > λ 2 > λ 3 during value, current default takes λ 1=0.8, λ 2=0.15, λ 3=0.05. One threshold value can be set the most in the calculation as a similar condition, when the similarity of two statements is higher than this threshold value Time, it being considered as the two statement similar, threshold value is set as 0.9 the most in the present embodiment, regards as similar answering higher than 0.9 Case.
Step 106, filters the answer that similar answer frequency is low.
After similar answer frequency refers to by Similarity Measure, this Answer Sentence goes out in all answers of historical consultation data Existing frequency number of times.The filtration of similar answer frequency is that (R can be arranged as the case may be, such as in this enforcement with threshold value R Example may be set to 2), less than regarding as low frequency answer, such as: in Fig. 2, " parent, we are to buy in front working as at 17 in afternoon Day delivery, sends, thanks the next day after 17." and " will not fade, parent " frequency number of times of occurring be just 2.
Step 107, extracts the question sentence that altofrequency answer is corresponding.
Remaining after the filtration of step 106 is exactly high-frequency answer, the question sentence that altofrequency answer is corresponding, generally Being all high-quality question sentence, these question sentences the most all comprise near synonym, such as: Answer Sentence in Fig. 2: " will not fade, parent ", right Respectively " this part clothes can fade the question sentence answered?" and " this part dotey can fade?" only store many-to-one FAQ, Just can destroy huge near synonym table.
Step 108, extracts the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection.
Question sentence collection refers to the question sentence group that answer is corresponding, and it contains some question sentences.Core keyword Cahn-Ingold-Prelog sequence rule refers to ask Sentence is concentrated and is contained multiple core keyword collection, and this set of keywords has certain word order.This method first passes through all of Question sentence carries out participle, then uses statistical principle, and extracting frequency more than r2 (r2 is more than 1 in the present embodiment) and has order The keyword of rule.
Such as: Answer Sentence in Fig. 2: " parent, we are to buy in 17 deliveries on the front same day in afternoon, send out the next day after 17 Go out, thanks." corresponding question sentence collection.
When you can deliver?
The most when that can deliver to me?
By participle and filter after stop words, draw following word segmentation result:
When you can deliver
The most when that can deliver to me
After participle, can quickly count frequency according to certain algorithm and be more than 2 and have the keyword of Cahn-Ingold-Prelog sequence rule For " when ... delivery ".
Step 109, stored knowledge.
The storage format of knowledge base by number, answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule storage, wherein answer with ask Being the relation of multi-to-multi between Ju, question sentence and core keyword Cahn-Ingold-Prelog sequence rule are the relations of multi-to-multi;Such as table 1 below:
Table 1
Step 110 terminates.
By above step, based on historical consultation data, can quickly construct knowledge base, and this knowledge Storehouse not only contains the FAQ of multi-to-multi, and contains the key sequence collection of core, is being front with historical consultation data Putting, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and has saved large quantities of people Power maintenance cost, facilitates technical staff's Fast Construction knowledge base.
Above examples providing and automatically build knowledge base, above-described embodiment is mainly with Chinese as object language, it is also possible to It is applicable to other language.What the present embodiment provided builds knowledge base method automatically, is the e-commerce industry with Taobao as representative It is described, but is not limited to e-commerce industry, go for other industry and realize in the method, and present case Similarity Measure is not limited to this algorithm, can realize to use other Similarity Measure modes.
The above, for the person of ordinary skill of the art, can be according to technical scheme and technology Other various corresponding changes and deformation are made in design, and all these change and deformation all should belong to the claims in the present invention Protection domain.

Claims (6)

1. the method building knowledge base based on historical consultation data, it is characterised in that comprise the following steps:
1) historical consultation data is read;
2) cutting consulting scene;
3) question and answer pair of each scene are extracted;
4) answer similarity is calculated;
5) answer that similar answer frequency is low is filtered;
6) question sentence that altofrequency answer is corresponding is extracted;
7) the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection is extracted;
8) stored knowledge;
In described step 4) in, calculate the answer similar value that answer similarity is the question and answer centering calculating all scenes, the most right Answer Sentence carries out participle, and next filters stop words, finally calculates the similar value between every Answer Sentence;
The similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, the relation between it For SentenceSim (X, Y)=λ1* WordSim (X, Y)+λ2* LenSim (X, Y)+λ3* OrderSim (X, Y), SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points It is not constant, and meets λ 1+ λ 2+ λ 3=1;
The computing formula of described WordSim (X, Y) is:
Described LenSim (X, Y) computing formula is:
Described word order calculating formula of similarity is:
Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, Len (X), Len (X) table respectively Showing Answer Sentence X and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, and Onews (X, Y) represents: Answer Sentence X and answer Sentence Y occurs and the most only occur the set of word once, Reword (X, Y), represents the permutation number between each adjacent word;
In described step 5) in, similar answer frequency refers to the ratio that every Answer Sentence is shared in whole historical consultation data, Then according to the low-frequency answer of threshold filtering, the i.e. high-quality answer of altofrequency answer is regarded as higher than threshold value.
2. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step 2) in, carry out cutting scene according to consultant, be cut into the consulting scene organizing single customer service and single consultant more.
3. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step 3) in, extracting question and answer pair according to the identity of customer service Yu consultant, the content that customer service is said is set to answer, and consultant is installed with in saying For question sentence.
4. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 6) According to altofrequency answer, search every corresponding question sentence of answer.
5. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 7) Use statistical principle, by question sentence collection being carried out participle, extracting frequency and be more than threshold value and have the keyword of Cahn-Ingold-Prelog sequence rule.
6. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 8) The content of storage includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, and this includes answer, question sentence and core keyword Cahn-Ingold-Prelog sequence rule Between relation be the relation of multi-to-multi.
CN201310168964.2A 2013-05-08 The method building knowledge base based on historical consultation data Active - Reinstated CN103455535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310168964.2A CN103455535B (en) 2013-05-08 The method building knowledge base based on historical consultation data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310168964.2A CN103455535B (en) 2013-05-08 The method building knowledge base based on historical consultation data

Publications (2)

Publication Number Publication Date
CN103455535A CN103455535A (en) 2013-12-18
CN103455535B true CN103455535B (en) 2016-11-30

Family

ID=

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN102637192A (en) * 2012-02-17 2012-08-15 清华大学 Method for answering with natural language

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN102637192A (en) * 2012-02-17 2012-08-15 清华大学 Method for answering with natural language

Similar Documents

Publication Publication Date Title
Masry et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning
CN106484664B (en) Similarity calculating method between a kind of short text
CN105808526B (en) Commodity short text core word extracting method and device
CN104756106B (en) Data source in characterize data storage system
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN104778209B (en) A kind of opining mining method for millions scale news analysis
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN106372938A (en) Abnormal account identification method and system
Kherwa et al. An approach towards comprehensive sentimental data analysis and opinion mining
CN105786991A (en) Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106202211A (en) A kind of integrated microblogging rumour recognition methods based on microblogging type
CN104077407B (en) A kind of intelligent data search system and method
CN105631018B (en) Article Feature Extraction Method based on topic model
CN109101493A (en) A kind of intelligence house-purchase assistant based on dialogue robot
CN105843796A (en) Microblog emotional tendency analysis method and device
CN103279478A (en) Method for extracting features based on distributed mutual information documents
CN104077417A (en) Figure tag recommendation method and system in social network
CN107908733A (en) A kind of querying method of global trade data, apparatus and system
CN107305545A (en) A kind of recognition methods of the network opinion leader based on text tendency analysis
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
CN108614814A (en) A kind of abstracting method of evaluation information, device and equipment
CN107766431A (en) It is a kind of that Parameter Function Unit method and system are gone based on syntax parsing
CN105955960B (en) Grounding grid defect text mining method based on semantic frame
CN104035969B (en) Feature Words base construction method and system in social networks
Su et al. An improved BERT method for the evolution of network public opinion of major infectious diseases: Case Study of COVID-19

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161130

Termination date: 20190508

RR01 Reinstatement of patent right

Former decision: termination of patent right due to unpaid annual fee

Former decision publication date: 20200424