CN103455535B

CN103455535B - The method building knowledge base based on historical consultation data

Info

Publication number: CN103455535B
Application number: CN201310168964.2A
Authority: CN
Inventors: 冯梓洋; 刁应君; 卢铄波; 胡欢; 刘洋; 杨大川; 宋战
Original assignee: Shenzhen Mingtang Communication Co Ltd
Current assignee: Shenzhen Mingtang Communication Co Ltd
Filing date: 2013-05-08
Publication date: 2016-11-30
Anticipated expiration: 2033-05-08

Abstract

The present invention discloses a kind of method building knowledge base based on historical consultation data, automatically to build knowledge base based on historical consultation data, including historical consultation data is carried out cutting consulting scene, extract the question and answer of each scene to, calculate answer similarity, filter the low answer of similar answer frequency, extract question sentence corresponding to altofrequency answer, extract the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection, stored knowledge.The present invention builds knowledge base automatically by historical consultation data, decreases the artificial workload built knowledge base and safeguard near synonym storehouse.

Description

The method building knowledge base based on historical consultation data

Technical field

The present invention relates to the knowledge base field of computer and question answering system, particularly relate to build based on historical consultation data know The method knowing storehouse.

Background technology

In automatically request-answering system, knowledge base is the significant data source of question answering system, and it serves in the entire system Very important effect, a high-quality knowledge base can be greatly improved efficiency and the accuracy rate of question answering system.

At present construction to knowledge base typically has 2 kinds of modes the most in the industry:

The first is that Knowledge Database also rests on the manual construction period, owing to industry field limits, so very Many knowledge are all pure manual construction, and the technical staff that manual construction is typically all certain industry special completes, and its work is imitated Rate is low, of low quality, and maintenance difficulties is big.

The second is to use semantic matching degree, carries out building knowledge base.Question answering system typically uses and knows net (How- Net) or near synonym table realizes, but know that net (How-net) and near synonym table are all manually to arrange, it is achieved the amount of getting up to work The hugest, and coverage is narrow.

Assume automatically request-answering system knowledge base exists following many-to-one FAQ, and include that a user inputs certainly So language sentence (hereinafter referred to as question sentence) and systems response (hereinafter referred to as Answer Sentence), such as example below:

Question-answer sentence common in the apparel industry of Taobao:

Question sentence: these part jeans can or can not fade？

Question sentence: this part dotey can or can not fade？

Question sentence: really will not fade？

Question sentence: jeans are washed and can be faded several times？

Question sentence: the jeans that you sell are to fade？

Answer Sentence: will not fade, parent.

When user's input " jeans are washed and can be faded several times " when, system can find this to organize FAQ, and this is answered Case sentence returns to user.But, when user's input, " this part dotey has washed and can fade several times？" when, technical staff is necessary Manual arrange in knowing net (How-net) or near synonym table " dotey " (electricity firm industry generally replaces trade name with " dotey ") and " jeans " associate, and " fading " associates with " fading ".System accurately could return to user Answer Sentence, and not so system cannot be counted Calculate real answer.Below the association of not only technical staff's near synonym to be arranged, also have Answer Sentence " will not fade, Parent." corresponding n kind way to put questions all collects, this kind of way, either from the perspective of workload or system effectiveness, it is all Unacceptable.

Summary of the invention

It is an object of the invention to, it is provided that a kind of method building knowledge base based on historical consultation data, solve existing to know Know the problem that storehouse builds inefficiency.

To achieve these goals, the present invention provides a kind of method building knowledge base based on historical consultation data, its bag Include following steps:

1) historical consultation data is read；

2) cutting consulting scene；

3) question and answer pair of each scene are extracted；

4) answer similarity is calculated；

5) answer that similar answer frequency is low is filtered；

6) question sentence that altofrequency answer is corresponding is extracted；

7) the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection is extracted；

8) stored knowledge.

Wherein, in described step 2) in, carry out cutting scene according to consultant, be cut into the single customer service of many groups and single consulting The consulting scene of person.

Wherein, in described step 3) in, extract question and answer pair according to the identity of customer service Yu consultant, customer service is installed with in saying For answer, the content that consultant is said is set to question sentence.

Wherein, in described step 4) in, calculating answer similarity is that the answer of the question and answer centering calculating all scenes is similar Value, first carries out participle to Answer Sentence, and next filters stop words, finally calculates the similar value between every Answer Sentence.

Wherein, the similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, its Between relation be

SentenceSim (X, Y)=λ₁* WordSim (X, Y)+λ₂* LenSim (x, y)+λ₃* OrderSim (X, Y), SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points It is not constant, and meets λ 1+ λ 2+ λ 3=1.

Wherein, the computing formula of described WordSim (X, Y) is:

WordSim (X, Y) = \frac{SameWc (X, Y)}{Max (Len (X), Len (Y))};

Described LenSim (X, Y) computing formula is:

\cdot LenSim (X, Y) = 1 - \frac{abs (Len (X) - Len (y))}{Len (X) + Len (Y)};

Described word order calculating formula of similarity is:

Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, and Len (X), Len (X) divide Not Biao Shi Answer Sentence x and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, Onews (x, y) represents: Answer Sentence X and Answer Sentence Y occurs and the most only occurs the set of word once, Reword (X, Y), represents the permutation number between each adjacent word.

Wherein, in described step 5) in, similar answer frequency refers to every Answer Sentence institute in whole historical consultation data The ratio accounted for, then according to the low-frequency answer of threshold filtering, regards as the i.e. high-quality answer of altofrequency answer higher than threshold value.

Wherein, described step 6) according to altofrequency answer, search every corresponding question sentence of answer.

Wherein, described step 7) use statistical principle, by question sentence collection being carried out participle, extract frequency more than threshold value And have the keyword of Cahn-Ingold-Prelog sequence rule.

Wherein, described step 8) content that stores includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, this includes answer, asks Relation between sentence and core keyword Cahn-Ingold-Prelog sequence rule is the relation of multi-to-multi.

Beneficial effects of the present invention: the present invention, based on historical consultation data, quickly can construct knowledge base, and And this knowledge base not only contains the FAQ of multi-to-multi, and contain the key sequence collection of core, seeking advice from history Under premised on data, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and saves Large quantities of manpower maintenance costs, facilitates technical staff's Fast Construction knowledge base.

In order to be able to be further understood that inventive feature and technology contents, refer to below in connection with the present invention is detailed Illustrate and accompanying drawing, but accompanying drawing only provides reference and explanation use, be not used for the present invention is any limitation as.

Accompanying drawing explanation

Below in conjunction with the accompanying drawings, by the detailed description of the invention of the present invention is described in detail, technical scheme will be made And other beneficial effect is apparent.

In accompanying drawing,

Fig. 1 is the schematic flow sheet of the present invention；

Fig. 2 is the historical consultation data scene format of the present invention.

Detailed description of the invention

By further illustrating the technological means and effect, being preferable to carry out below in conjunction with the present invention that the present invention taked Example and accompanying drawing thereof are described in detail.

The implementation environment of the present invention uses the shopping consulting of Taobao to do analysis of cases, with both parties' advisory data conduct Build the Data Source of question answering system knowledge base.

Refer to Fig. 1, the one provided for the embodiment of the present invention builds knowledge base method based on historical consultation data.The party Method, by the analysis to shopping history data, extracts the question-response of both parties, and is stored as FAQ；To in FAQ Answer Sentence, use Similarity Measure, when similar value reaches certain threshold value, predicate similar answer, and add up similar and answer Case frequency.When similar answer frequency reaches certain threshold value, extract the question sentence set that answer is corresponding.Finally by question sentence participle, Extract the keyword string that frequency is high.The method is described in detail below in conjunction with Fig. 1.

Step 101, starts.

Step 102, reads historical consultation data.

Historical consultation data is typically all and is provided by customer service system, it is common that read by the form of API or importing.

Step 103, cutting consulting scene.

Historical consultation data is typically all the corresponding multiple consultants of multiple customer service and is entrained in together, therefore must be according to consultant Carrying out cutting scene, the definition of scene here refers to single customer service and one complete dialogue of single consultant both sides, such as: figure In 2, the full dialog of " buyer abc " and " women's dress customer service " both sides is known as a scene.

Step 104, extracts the question sentence in each scene and Answer Sentence.

Each scene typically contains multiple question sentence and Answer Sentence, therefore must carry according to the identity of customer service with consultant Taking question and answer pair, the content that usual customer service is said is answer, and the content that consultant is said is question sentence.

Step 105, calculates answer similarity.

The advisory data of usual more than 3 months all contains n consulting scene, and each scene contains m question and answer pair, logical Cross the answer set of the question and answer centering extracting n*m, calculate the similarity between Answer Sentence and Answer Sentence, this similarity meter in answer set Calculating and Answer Sentence first carries out participle and filters stop words, then calculate similarity, sentence similarity is long by the similarity of word, sentence Degree similarity, word order similarity determine, wherein morphology similarity plays Main Function, and sentence length similarity plays secondary work With, the effect of word order similarity is minimum.When similar value is more than threshold value r (value of r may be set to 0.9 in such as the present embodiment), Regard as similar answer.

Concrete similarity algorithm is as follows:

The similarity of word a: sentence (sentence), S can see and write words and one of special symbol (hereinafter referred to as individual character) Ordered set.The length of S is i.e. the number of word in S, represents with Len (S) herein, and SameWC (X, Y) represents Answer Sentence X, phase in Y With the number of word, when the number of times difference that a word occurs in X, Y with counting that occurrence number is few.The morphology of Answer Sentence X, Y Similarity WordSim (X, Y) is determined by following formula:

WordSim (X, Y) = \frac{SameWc (X, Y)}{Max (Len (X), Len (Y))}

Wherein: WordSim (X, Y) ∈ [0,1]；Meaning: the number of words that two Answer Sentences are identical is the most, two Answer Sentences are more Similar.

Sentence length similarity: Len (X), Len (Y) represent length that is two answer of Answer Sentence x and Answer Sentence Y respectively The number of word in Ju.Then Answer Sentence length similarity LenSim (X, Y) is determined by following formula:

Wherein: Lensim (x, y) ∈ [0,1]；Meaning: the length of two statements closer to, two statements are the most similar.

(x, y) represents the similarity of word order: Onews: all occur and the most only occur the set of word once in X, Y. Reword (X, Y), represents the permutation number between each adjacent word.Ordersim (x, y)

Wherein: OrderSim (X, Y) ∈ [0,1], the advantage so defining word order similarity is: when a subordinate sentence or phrase After overall generation distance moves, still much like with original statement.Realizing quick, algorithm complex is 0 (m), wherein m=| Onews (X, Y) |.

Similarity SentenceSim (X, Y) of sentence X, Y is determined by following formula:

SentenceSim (X, Y)=λ₁* WordSim (X, Y)+λ₂* LenSim (x, y)+λ₃* OrderSim (X, Y)

Wherein, it is constant that λ 1, λ 2, λ 3 is divided into, and meets λ 1+ λ 2+ λ 3=1, it is clear that SentenceSim (X, Y) ∈ [0, 1].In sentence similarity, morphology similarity plays Main Function, and statement length similarity and word order similarity play secondary Effect, therefore λ 1, λ 2, λ 3, should there is λ 1 ＞ λ 2 ＞ λ 3 during value, current default takes λ 1=0.8, λ 2=0.15, λ 3=0.05. One threshold value can be set the most in the calculation as a similar condition, when the similarity of two statements is higher than this threshold value Time, it being considered as the two statement similar, threshold value is set as 0.9 the most in the present embodiment, regards as similar answering higher than 0.9 Case.

Step 106, filters the answer that similar answer frequency is low.

After similar answer frequency refers to by Similarity Measure, this Answer Sentence goes out in all answers of historical consultation data Existing frequency number of times.The filtration of similar answer frequency is that (R can be arranged as the case may be, such as in this enforcement with threshold value R Example may be set to 2), less than regarding as low frequency answer, such as: in Fig. 2, " parent, we are to buy in front working as at 17 in afternoon Day delivery, sends, thanks the next day after 17." and " will not fade, parent " frequency number of times of occurring be just 2.

Step 107, extracts the question sentence that altofrequency answer is corresponding.

Remaining after the filtration of step 106 is exactly high-frequency answer, the question sentence that altofrequency answer is corresponding, generally Being all high-quality question sentence, these question sentences the most all comprise near synonym, such as: Answer Sentence in Fig. 2: " will not fade, parent ", right Respectively " this part clothes can fade the question sentence answered？" and " this part dotey can fade？" only store many-to-one FAQ, Just can destroy huge near synonym table.

Step 108, extracts the core keyword Cahn-Ingold-Prelog sequence rule of question sentence collection.

Question sentence collection refers to the question sentence group that answer is corresponding, and it contains some question sentences.Core keyword Cahn-Ingold-Prelog sequence rule refers to ask Sentence is concentrated and is contained multiple core keyword collection, and this set of keywords has certain word order.This method first passes through all of Question sentence carries out participle, then uses statistical principle, and extracting frequency more than r2 (r2 is more than 1 in the present embodiment) and has order The keyword of rule.

Such as: Answer Sentence in Fig. 2: " parent, we are to buy in 17 deliveries on the front same day in afternoon, send out the next day after 17 Go out, thanks." corresponding question sentence collection.

When you can deliver？

The most when that can deliver to me？

By participle and filter after stop words, draw following word segmentation result:

When you can deliver

The most when that can deliver to me

After participle, can quickly count frequency according to certain algorithm and be more than 2 and have the keyword of Cahn-Ingold-Prelog sequence rule For " when ... delivery ".

Step 109, stored knowledge.

The storage format of knowledge base by number, answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule storage, wherein answer with ask Being the relation of multi-to-multi between Ju, question sentence and core keyword Cahn-Ingold-Prelog sequence rule are the relations of multi-to-multi；Such as table 1 below:

Table 1

Step 110 terminates.

By above step, based on historical consultation data, can quickly construct knowledge base, and this knowledge Storehouse not only contains the FAQ of multi-to-multi, and contains the key sequence collection of core, is being front with historical consultation data Putting, this kind of construction method can not only substituted for traditional knowing net (Hownet) and near synonym table, and has saved large quantities of people Power maintenance cost, facilitates technical staff's Fast Construction knowledge base.

Above examples providing and automatically build knowledge base, above-described embodiment is mainly with Chinese as object language, it is also possible to It is applicable to other language.What the present embodiment provided builds knowledge base method automatically, is the e-commerce industry with Taobao as representative It is described, but is not limited to e-commerce industry, go for other industry and realize in the method, and present case Similarity Measure is not limited to this algorithm, can realize to use other Similarity Measure modes.

The above, for the person of ordinary skill of the art, can be according to technical scheme and technology Other various corresponding changes and deformation are made in design, and all these change and deformation all should belong to the claims in the present invention Protection domain.

Claims

1. the method building knowledge base based on historical consultation data, it is characterised in that comprise the following steps:

1) historical consultation data is read；

2) cutting consulting scene；

3) question and answer pair of each scene are extracted；

4) answer similarity is calculated；

5) answer that similar answer frequency is low is filtered；

6) question sentence that altofrequency answer is corresponding is extracted；

8) stored knowledge；

In described step 4) in, calculate the answer similar value that answer similarity is the question and answer centering calculating all scenes, the most right Answer Sentence carries out participle, and next filters stop words, finally calculates the similar value between every Answer Sentence；

The similarity of described Answer Sentence includes the similarity of word, sentence length similarity and word order similarity, the relation between it For SentenceSim (X, Y)=λ₁* WordSim (X, Y)+λ₂* LenSim (X, Y)+λ₃* OrderSim (X, Y), SentenceSim (X, Y) represents Answer Sentence X and the similarity of Answer Sentence Y, and WordSim (X, Y) represents the word of Answer Sentence X and answers Similarity between the word of case sentence Y, LenSim (X, Y) represents between sentence length and the sentence length of Answer Sentence Y of Answer Sentence X Similarity, OrderSim (X, Y) represent Answer Sentence X word order and the word order of Answer Sentence Y between similarity, λ 1, λ 2, λ 3 points It is not constant, and meets λ 1+ λ 2+ λ 3=1;

The computing formula of described WordSim (X, Y) is:

Described LenSim (X, Y) computing formula is:

Described word order calculating formula of similarity is:

Wherein, SameWC (X, Y) represents the number of same words between Answer Sentence X and Answer Sentence Y, Len (X), Len (X) table respectively Showing Answer Sentence X and the length of Answer Sentence Y, abs represents that result of calculation removes absolute value, and Onews (X, Y) represents: Answer Sentence X and answer Sentence Y occurs and the most only occur the set of word once, Reword (X, Y), represents the permutation number between each adjacent word；

In described step 5) in, similar answer frequency refers to the ratio that every Answer Sentence is shared in whole historical consultation data, Then according to the low-frequency answer of threshold filtering, the i.e. high-quality answer of altofrequency answer is regarded as higher than threshold value.

2. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step 2) in, carry out cutting scene according to consultant, be cut into the consulting scene organizing single customer service and single consultant more.

3. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that in described step 3) in, extracting question and answer pair according to the identity of customer service Yu consultant, the content that customer service is said is set to answer, and consultant is installed with in saying For question sentence.

4. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 6) According to altofrequency answer, search every corresponding question sentence of answer.

5. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 7) Use statistical principle, by question sentence collection being carried out participle, extracting frequency and be more than threshold value and have the keyword of Cahn-Ingold-Prelog sequence rule.

6. the method building knowledge base based on historical consultation data as claimed in claim 1, it is characterised in that described step 8) The content of storage includes answer, question sentence, core keyword Cahn-Ingold-Prelog sequence rule, and this includes answer, question sentence and core keyword Cahn-Ingold-Prelog sequence rule Between relation be the relation of multi-to-multi.