CN105843897A

CN105843897A - Vertical domain-oriented intelligent question and answer system

Info

Publication number: CN105843897A
Application number: CN201610167602.5A
Authority: CN
Inventors: 张振峰; 于忠清; 刘晓强
Original assignee: Qingdao Haiersoft Co Ltd
Current assignee: QINGDAO PENGHAI SOFTWARE Co.,Ltd.
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-08-10
Anticipated expiration: 2036-03-23
Also published as: CN105843897B

Abstract

The invention discloses a vertical domain-oriented intelligent question and answer system. The system comprises a question asking module (1), a preprocessing module (2), a word segmentation and vocabulary standardization module (3), a word purification module (4), a synonym expansion module (5), a vocabulary expansion or deletion module (6), a sentence similarity calculation module (7) and an answer output module (8). The system calculates the similarity of question sentences of a user through domain ontology construction and depends on a word segmentation technology, domain ontology construction and ontology similarity calculation. The system has the advantages that a question asking intention of the user can be understood more accurately by applying a domain ontology technology through a sentence similarity algorithm, the sentence similarity can be calculated, and the accuracy of the question and answer system can be improved.

Description

A kind of Intelligent Answer System towards vertical field

Technical field

The present invention relates to a kind of Intelligent Answer System towards vertical field, the semantic analysis accuracy rate in vertical field is had Significant and effect.

Background technology

Divide according to the technology that realizes of question answering system, including: question answering system based on frequently asked questions (FAQ), based on letter The question answering system of breath retrieval, question answering system based on Question Classification and based on resource description framework (Resource Description Framework) question answering system of RDF query.

Question answering system based on frequently asked questions, builds FAQs (FAQ) question and answer pair, it is achieved above depend on user's question sentence With the Similarity Measure of question sentence in FAQ.In the development process of FAQ question answering system, need to identify the intention of user's question sentence, to two Individual sentence carries out Similarity Measure, to return Query Result.The correlation technique flow process of existing FAQ question answering system is: to sentence After carrying out participle, removing the pretreatment work such as stop words, word standardization, set up inverted index table, with VSM or TF-IDF algorithm Calculate the similarity of the word array of two sentences.

Question answering system based on information retrieval, the information source of this system is typically the document on network, is returned Answer is directly extracted from document.

Question answering system based on customer problem classification, generally builds corresponding template to each class problem and processes, increase The strong understanding to problem, improves the accuracy rate of system.

Based on RDF(Resource Description Framework resource description framework, one is used for describing Web money The markup language in source) core of question answering system inquired about is standard query language natural language question sentence being converted into RDF, generally It is W3C given query language SPARQL, class, example or the attribute that the word in natural language question sentence is mapped as in body.

But prior art is when calculating Words similarity, there is employing similarity calculating method based on " knowing net ", but Vertical field for specialty lacks enough semantic analysis.And prior art is when calculating sentence similarity, does not considers field The weight of vocabulary, the vocabulary for the vertical field of specialty lacks enough semantic analysis.

The technical term explanation that the present invention relates to:

Domain body: domain body gives basic terminology and the relation constituting association area vocabulary, and combines these arts Language and relation define the rule of these vocabulary extensions.

Participle technique: participle is exactly to be gone out by the words recognition of sentence and carry out part-of-speech tagging.

Know net: " knowing net " (HowNet) is the semantic knowledge dictionary that a comparison is detailed.With Chinese and english word institute's generation The concept of table is description object, to disclose between concept and concept and relation between attribute that concept is had is in substantially The commonsense knowledge base held.

Inverted index table: word is set up a table, and records the position of problem corresponding to word.Owing to not being by record Determine property value, but determined the position of record by property value, thus referred to as inverted index (inverted index).

VSM: vector space model (Vector Space Model) is the process of content of text is reduced to vector space In vector operation, the similarity of two vector operations is as the semantic similarity of two sentences.

TF-IDF: term frequency-inverse document frequency method (term frequency inverse document frequency), On the basis of VSM algorithm, determine the weight of word according to the frequency of word, calculate the similarity of two sentences.

Summary of the invention

The present invention organically combines realization based on FAQ with based on RDF query technology, proposes a kind of new question answering system and process Flow process, to strengthen Intelligent Answer System semantic analysis ability, improves the accuracy rate of intelligence automatically request-answering system.

The technical scheme is that the present invention calculates the similarity of user's question sentence by building domain body, depend on Participle technique, the structure of domain body, body similarity calculate.

The invention have the advantage that, by this sentence similarity algorithm, application ontology understands use more accurately Family is putd question to and is intended to, and calculates sentence similarity, improves the accuracy rate of question answering system.

Accompanying drawing explanation

Fig. 1 is that present system constitutes block diagram；

Fig. 2 is groundwork program flow diagram of the present invention；

Fig. 3 is the schematic diagram of the taxonomic structure embodiment of body of the present invention；

Fig. 4 is the structural representation of one concrete Noumenon property of the present invention；

Fig. 5 is the flow chart of one embodiment of working procedure of the present invention；

Fig. 6 is body baby's character classification by age structural representation of the present invention.

Detailed description of the invention

See Fig. 1, a kind of Intelligent Answer System towards vertical field of the present invention, it is based primarily upon computer system, including Consisting of part:

(1) module 1 is putd question to: for inputting (proposition) problem to system.Input through keyboard, phonetic entry, hand-written (plate) can be used Input, uses image collecting device input.

(2) pretreatment module 2: include vertical domain body (data base), for by the class in body, attribute, Instance Name Claim to add in dictionary for word segmentation, and mark corresponding part of speech.

(3) participle and lexical normalisation module 3: for question sentence being carried out participle, and carrying out word standardization, marking each Key words sorting in the part of speech of word and body.

(4) purify word module 4: for stop words is removed in the set after participle, remove the modal particle without practical significance, Greeting word.

(5) synonym expansion module 5: for arranging the relevant Chinese thesaurus in vertical field, the meaning of a word is extended.

(6) ontology expansion module 6: for the lexical set after participle is judged, if the vocabulary in body, to word Relation between remittance is analyzed, and is extended or deletes, and arranges this vocabulary weight in sentence；If not the word in body Converge, calculate according to the similarity of common words.

(7) sentence similarity computing module: the weight in sentence of the vocabulary described in combination, calculates candidate in FAQ storehouse and asks Topic and the sentence similarity of question sentence.

(8) output module is replied: for exporting the answer of problem.

Seeing Fig. 2, the groundwork flow process of the present invention includes:

(1) pretreatment: build vertical domain body, adds to the class in body, attribute, instance name in dictionary for word segmentation, and Mark corresponding part of speech.

(2) question sentence carried out participle and carries out word standardization, marking the part of speech of each word, and the contingency table in body Note.

(3) stop words is removed in the set after participle, remove the modal particle without practical significance, greeting word.

(4) arrange the relevant Chinese thesaurus in vertical field, the meaning of a word is extended.

(5) lexical set after participle is judged, if the vocabulary in body, the relation between vocabulary is carried out point Analysis, is extended or deletes, and arranging this vocabulary weight in sentence；If not the vocabulary in body, according to common words Similarity calculate.

(6) combine vocabulary weight in sentence, calculate candidate's problem and the sentence similarity of question sentence in FAQ storehouse.

(7) output problem answers: sort from high to low according to similarity, finally choose the highest problem of similarity as answering Case.

Below in conjunction with Fig. 3-Fig. 6, system and the workflow of the present invention are described further.

1. build about vertical domain ontology repository:

The knowledge in vertical field is classified, the relation between concept of analysis and attribute thereof, it is achieved the expression of domain knowledge.

Class in domain body, example, attribute: class and example and class in object-oriented and to as if similar, attribute is retouched State the relation between class or example.

In Fig. 4, " place ", as a class, has " Suzhou " example as it, has the reality of a Hui Shi golden clothes series Example " Wyeth Hui Shi _ Promil Gold milk powder 2 sections 400g ", its place of production is Suzhou." place of production " connects two realities as attribute Example.

2. the calculating of Words similarity in body:

Vocabulary is corresponding to class, example or the attribute in body.All concepts form directed graph, definition parent and direct subclass away from From for 1, class is 1 with the distance of the example, and attribute defines the distance of territory and codomain and is respectively 1 with it, and the distance of vocabulary W1, W2 depends on Add up according to above-mentioned definition.W0 is the nearest public father node of W1 and W2.The then semantic similarity employing formula of two vocabulary:

+

Such as Fig. 3: with " Thing " as root node, the degree of depth is 0, the degree of depth of " Wyeth Hui Shi _ Promil Gold milk powder 2 sections 400g " It is 5, and the degree of depth of " Wyeth Hui Shi _ 3 sections of 400g of golden clothes Progress milk powder " is 5, their nearest public father node " golden clothes system Row " the degree of depth be 4, then their similarity is+ =0.80。

Or:

α is an adjustable parameter, represents the value that two Lexical Similarities are its public father node of distance when 0.5.

Such as Fig. 3: set α=1.6, with " Thing " as root node, the degree of depth is 0, " Wyeth Hui Shi _ Promil Gold milk powder 2 sections 400g " and " Wyeth Hui Shi _ 3 sections of 400g of golden clothes Progress milk powder ", they away from nearest public father node " golden clothes series " away from From being all 1, then their similarity is:

+ =0.62。

Finally, sort from high to low according to similarity, finally choose corresponding to first (similarity is the highest) problem Answer is as the final result asked a question, and is exported by replying output module.

3. the determination of term weighing in question sentence:

The weights shared by word different in user's question sentence is different, and such as question sentence " may I ask colored king's diaper either with or without day basis Dress？", the term weighing of " flower king's diaper " and " Japan is original-pack " be higher than " may I ask ", " having ", " not having ", " ".The most really The method of determining is:

1) safeguard and disable vocabulary, will " " " " " " etc. get rid of without semantic word, be not counted in sentence similarity and calculate.

2) question sentence occurring, the situation that subclass is adjacent with its parent deletes parent.Chinese there will be the situation of semantic repetition, Such as " Hui Shi milk powder " in Fig. 3, Hui Shi is the subclass of milk powder, and the information of subclass covers the information of parent, and the information that subclass is carried Concrete in further detail, in this case, we only need to consider the information entrained by subclass.

3) analyze the dependence between word, if W1, W2 are modified relationship, and be the relation of Subject-Verb in the body Then its object is added in vocabulary.

Such as Fig. 4: the SVO tlv triple in shown body, " Wyeth Hui Shi _ Promil Gold milk powder 2 sections 400g-product Ground-Suzhou ", " Wyeth Hui Shi _ Promil Gold milk powder 2 sections 400g " is subject, and " place of production " is predicate, and " Suzhou " serves as object Role.

Example 1: question sentence: " where the place of production of Wyeth Hui Shi _ 2 sections of 400g of Promil Gold milk powder is？", Chinese " A's B ", in the case of A modifies B, " Suzhou " is added in vocabulary.

4) concept in domain body is the vocabulary higher with system degree of association, and along with the increase of the concept degree of depth, concept The information carried is the most detailed, and therefore the term weight in domain knowledge is higher than normal words, and deep with vocabulary of the weight of vocabulary Spend and increase.

Weight_w0= 1 +α

Wherein α is an adjustable parameter, the weight of regulation concept, and the value arranging α herein is 1, represents general in domain body The weight read is between 1 and 2.

In Fig. 3, with " Thing " as root node, the degree of depth is 0, and the degree of depth of " Hui Shi " is 3, it may be assumed that=3,=5, the weight of word " Hui Shi " is Weight_{Hui Shi} = 1 + = 1.6。

5) calculate according to the number of words length of vocabulary:

Example 2: for the relation of milk powder Yu baby's age, the effect to semantic analysis of the ontology information of structure.

Such as Fig. 6, according to hop count and the age of applicable baby of milk powder, building corresponding domain body, system can identify " four Month " be " 0-6 month " in the range of, thus find in model answer the problem containing " 0-6 month ", replying showing on output module Show that content is as follows:

Such as Fig. 5, user inputs question sentence, and " why rice flour has rancid taste？", word is marked while participle by system Standardization, is standardized as " abnormal flavour " by " rancid taste "；Remove stop words, " " " will be had " to remove；Look into data base's inverted index table In containing the question sentence of [reason, rice flour, abnormal flavour], and problem is sorted according to the quantity containing key word, takes front 15 question sentences and make For candidate's problem；Use VSM algorithm, calculate successively the participle of these 15 candidate's problems go stop words result with [reason, rice flour, Abnormal flavour] similarity, sequence；Similarity is chosen the answer of the highest problem and is returned.

Claims

1. the Intelligent Answer System towards vertical field, it is characterised in that include consisting of part:

(1) module is putd question to: for inputting problem to system；

(2) pretreatment module: include vertical domain body, for adding the class in body, attribute, instance name to participle word In allusion quotation, and mark corresponding part of speech；

(3) participle and lexical normalisation module: for question sentence is carried out participle, marks the part of speech of each word, and dividing in body Class labelling；

(4) word module is purified: for stop words is removed in the set after participle, remove the modal particle without practical significance, greeting Word；

(5) synonym expansion module: for arranging the relevant Chinese thesaurus in vertical field, the meaning of a word is extended；

(6) ontology expansion module: for the lexical set after participle is judged, if the vocabulary in body, between vocabulary Relation be analyzed, be extended or delete, and this vocabulary weight in sentence is set；If not the vocabulary in body, Calculate according to the similarity of common words；

(7) sentence similarity computing module: the weight in sentence of the vocabulary described in combination, calculate in FAQ storehouse candidate's problem with The sentence similarity of question sentence；

(8) output module is replied: for exporting the answer of problem.

Intelligent Answer System towards vertical field the most according to claim 1, it is characterised in that described enquirement module Use keyboard, voice, hand-written or image collecting device input；Described answer output module uses display, speaker or beats Print machine.

Intelligent Answer System towards vertical field the most according to claim 1, it is characterised in that the workflow of this system Journey includes:

(1) pretreatment: build vertical domain body, adds to the class in body, attribute, instance name in dictionary for word segmentation, and Mark corresponding part of speech；

(2) question sentence carried out participle and carries out word standardization, marking the part of speech of each word, and the key words sorting in body；

(3) stop words is removed in the set after participle, remove the modal particle without practical significance, greeting word；

(4) arrange the relevant Chinese thesaurus in vertical field, the meaning of a word is extended；

(5) lexical set after participle is judged, if the vocabulary in body, the relation between vocabulary is analyzed, enters Row extension or deletion, and this vocabulary weight in sentence is set；If not the vocabulary in body, similar according to common words Degree calculates；

(6) combine vocabulary weight in sentence, calculate candidate's problem and the sentence similarity of question sentence in FAQ storehouse；

(7) output problem answers: sort from high to low according to similarity, finally choose answering corresponding to the problem that similarity is the highest Case is as the answer of problem.