CN109947921A

CN109947921A - A kind of intelligent Answer System based on natural language processing

Info

Publication number: CN109947921A
Application number: CN201910207884.0A
Authority: CN
Inventors: 陈婧怡; 陈慧萍; 杜鹏; 丁翰雯
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-06-28
Anticipated expiration: 2039-03-19
Also published as: CN109947921B

Abstract

The invention discloses a kind of intelligent Answer Systems based on natural language processing, including construction of knowledge base module, question and answer are to management module and question and answer matching module；The construction of knowledge base module includes document preprocessing module, building document collection partition module and building question and answer to module；The question and answer include task management module, document management module, keyword management module and question and answer to operation module to management module；The question and answer matching module is used to ask a question user the question and answer created with knowledge base generation module to matching, and the present invention extracts high quality question and answer pair as much as possible from document, replies the recall precision and accuracy for improving knowledge base.

Description

A kind of intelligent Answer System based on natural language processing

Technical field

The invention belongs to intelligent customer service technical field more particularly to a kind of intelligent answer systems based on natural language processing System.

Background technique

Widely available with personal computer With the fast development of internet, more and more message and data are by surpassing Text transfer protocol is issued in the form of electronic document.Then, the speed of data retrieval and ability encounter huge challenge.Such as What the accurate and information needed for acquisition user in vast as the open sea information world in time, has become internet development so far A great problem.

Search engine technique is a kind of more mature information retrieval technique, still, as the madness of internet data increases, The shortcomings that search engine, gradually shows.Baidu, Google, the search engine that this quasi-tradition must be answered, be typically only capable to using keyword as Input.For ordinary user, they be often difficult to it is condensed go out a small amount of keyword accurately state its query intention.This Outside, search engine returns the result not instead of one succinct accurate answer, the list of a web page fragments.These webpage pieces Section usually contains a large amount of noise data, and there is still a need for read these web page fragments or even corresponding original web page, ability by user Find the answer needed for it.

In order to improve the user experience of information retrieval, people begin one's study directly using natural language as input and output Question answering system, user can directly express its query demand using natural language in a manner of text or voice.Question and answer system It unites after understanding the query intention of user, by a series of retrieval, analysis and processing, directly by the statement in the form of natural language Accurate answer returns to user.Therefore, for users, question answering system is a kind of more convenient, friendly and accurately services.

For having for the enterprise of artificial customer service, question answering system can save a large amount of manpower, and question answering system for them It is more stable and efficient.Such as the traditional client method of service of China Mobile include 10086 to turn manual service, business hall artificial The channels such as service window, there are the costs such as communication expense, training expense, manual resource for these methods of service, and will receive the time (can not There is provided 24 hours and service), the restriction of the conditions such as place (the customer service office space of concentration).It is huge with the growth of corporate client amount Big demand for counseling often allows customer service team can't bear the heavy load.

Therefore, modernization of business, informationization, intelligent development tide under, intelligent Answer System comes into being.

Now, mainly there are START question answering system, the Michigan of the research and development of Massachusetts Polytechnics in the question answering system of external English AnswerBus question answering system, the AskMSR question answering system of Microsoft and the AskJeeves sound of university's research and development answer system.In addition to English For the question answering system of representative, there are also such as across language evaluation of QA system CLEF.

Relative to the progress of external question answering system, domestic starts late, and just takes up to study after 1970 Chinese Question Answering System based on Chinese, until first, had investigated China of Chinese Academy of Sciences's language in 1980 is based on the man-machine of Chinese Conversational system.Currently, Tsinghua University, Fudan University, Beijing Language and Culture University etc. achieve very in Chinese natural language research field More achievements.For example, the EasyNav Campus Navigation System of Tsinghua University, the Computer Department of the Chinese Academy of Science research and development about personage in A Dream of Red Mansions Relationship question answering system.

And knowledge base is the key that one of intelligent Answer System competitiveness, the construction of knowledge base of high quality is always industry hardly possible One of topic.And traditional artificial building knowledge base takes time and effort, covering surface is narrow, and is all at present to convert knot for Un-structured data The knowledge mapping of structure stores, and needs a large amount of human resources and technical support, and the storage of knowledge mapping is inflexible, structure More complex, the efficiency of KnowledgeBase-query and accuracy are all not high enough.Lead to existing intelligent Answer System, can only answer public and few Amount problem, and can not precisely answer.It therefore, can be according to given document (as produced there is an urgent need to there is a set of automation scheme Product handbook, case document, users' guidebook etc.) high quality knowledge base is constructed automatically, to keep question answering system more intelligent.

Summary of the invention

Of the existing technology in order to solve the problems, such as, the present invention provides a kind of intelligent answer system based on natural language processing System extracts high quality question and answer pair as much as possible from document, replies the recall precision and accuracy for improving knowledge base.

The technical problem to be solved by the present invention is to what is be achieved through the following technical solutions:

A kind of intelligent Answer System based on natural language processing, including construction of knowledge base module, question and answer are to management module And question and answer matching module；The construction of knowledge base module include document preprocessing module, building document collection partition module and Question and answer are constructed to module；The question and answer include task management module, document management module, keyword management module to management module And question and answer are to operation module；The question and answer matching module is used to ask a question user to be created with knowledge base generation module Question and answer match topic.

Further, the document preprocessing module is used to filter the garbage in document, and filter process includes:

Garbage output file collection OUT1 in received document is filtered using regular expression；

File set OUT2 is obtained using the repeating part in longest common subsequence algorithm removal file set OUT1；

File set OUT2 is classified according to the granularity of setting, the publicly-owned part in each classifying documents is removed, is wrapped File set OUT3 containing catalogue and text；

Classified using Longest Common Substring algorithm to file set OUT3, removes the publicly-owned part of each classifying documents, obtain To text set OUT4.

Further, for constructing document collection partition, building process includes: the building document collection partition module

1) analysis obtains the html source code of text, constructs HTML tree according to depth-first traversal；

2) structure for adjusting the HTML tree built allows the leaf node of tree to directly constitute the answer portion of question and answer pair Point, generate document collection partition；

3) extreme saturation document collection partition generates key to the issue word structure tree.

Wherein, the rule for generating key to the issue word structure tree is as follows:

A) leaf node is traversed；

B) existing in child nodes indicates the complete punctuate of sentence meaning；

C) there are branches for child nodes, and meet following decision rule:

C1) each child nodes are semantic approximate；

C2) each child's sub-tree structure is identical.

Further, to module for constructing question and answer pair, building process includes: the question and answer

1) obtained document collection partition is carried out depth-first traversal, each paths that will be obtained to building module by question and answer In keyword set as problem alternative keywords, and to the father node of leaf node carry out traversal removal parent information after Answer is constituted, crucial phrase-answer set is generated；

2) after generation problem, in building question and answer clock synchronization, if it is null value that keyword, question sentence, answer, which have any a part, Then give up the question and answer pair；

3) duplicate question sentence is removed, question and answer pair are tentatively obtained, using root node as keyword, if keyword and problem are not Matching then generates keyword as the keyword of the question and answer pair using participle and name entity abstracting method；

4) pure question sentence is encountered in ergodic process and does not enter problem product process, directly using question sentence as problem, subordinate's node As answer, is extracted as asking-answering questions and do proposition entity to question sentence, constitute keyword export.

Further, problem is generated specifically: Chinese word segmentation is carried out for key to the issue word structure tree and constructs custom words Library, then generate question sentence by semantic template method: by subtracting leaf node in document collection partition, generating key to the issue word structure tree, It first determines whether children tree nodes include the keyword of customized dictionary, if comprising it or exactly matching it, deletes the word；It Afterwards, judge children tree nodes whether include verb dictionary attribute qualifier dictionary keyword, classification carry out syntax conversion, most throughout one's life It is problematic.

Further, the task management module is for management role publication, task status monitoring；Document management is for managing Manage file upload, file decompression, document group polling；Question and answer to operation module be used to manage the additions of question and answer pair, deletion, modification, Inquiry operation.

Further, matching includes: the problem of question and answer matching module

Receive user and puts question to Q1

Inverted index is carried out to Q1 by keyword set；

Find out the longest common subsequence of each keyword of question and answer centering；

The matching rate for calculating each keyword in Q1 and keywords database, take wherein matching rate value maximum one as Q1 Keyword；

With the identical question sentence set of keyword in the keyword index database of Q1；

Each of question sentence set question sentence is sought into the short text degree of approximation to Q1, takes approximate angle value maximum corresponding to that Answer return to user as the answer of Q1.

Further, keyword extraction includes: all Document Titles of traversal, finds out the word frequency of all separators；Choose word Frequently highest additional character is split keyword as separator, and generates word frequency mapping；It is higher short to filter word frequency Language then segments question sentence, extracts the keyword of noun or gerund therein as the question sentence.

Beneficial effect includes:

(1) increasingly automated: after user's upload document, to extract text to final building question and answer to completion knowledge base from analysis Building this process can be fully automated.

(2) storage is flexible: the storage mode of information is often structured storage in existing knowledge library, is not easy extension and deposits Storage.And this method innovatively proposes to store information in the form of question and answer pair, is easy to extend and store, and is easy to retrieval and inquiry, and And it can directly export and be FAQ (frequently asked questions and corresponding answer).

(3) question and answer are extracted to accuracy height: extracting question and answer pair using document collection partition, as long as document collection partition quality is high, Theoretically question and answer can achieve 100% to the accuracy of extraction.

(4) KnowledgeBase-query is high-efficient: when carrying out question and answer matching, by first searching keyword, then searching under keyword The problem of mode, substantially increase search efficiency.

Detailed description of the invention

Fig. 1 is the structural diagram of the present invention；

Fig. 2 is system architecture diagram of the invention；

Fig. 3 is work flow diagram of the invention.

Specific embodiment

In order to further describe technical characterstic and effect of the invention, below in conjunction with the drawings and specific embodiments to this hair It is bright to be described further.

The present invention constructs document collection partition by analysis typing document, and it is high that much more as far as possible and quality is extracted from document Question and answer pair realize that rule-based question and answer generate automation, for convenient, efficiently building and managerial knowledge library provide reliably Solution, and substantially increase knowledge base effectiveness of retrieval and accuracy, promote intelligent Answer System to obtain highly efficient It is used with extensive.

As shown in Figure 1-3, a kind of intelligent Answer System based on natural language processing, including construction of knowledge base module, ask Answer questions management module and question and answer matching module；The construction of knowledge base module includes document preprocessing module, building document knot Paper mulberry module and building question and answer are to module；The question and answer include task management module, document management module, pass to management module Keyword management module and question and answer are to operation module；The question and answer matching module is used to ask a question user and generate with knowledge base The question and answer that module is created match topic.

It is stored in the database in the form of question and answer pair after knowledge base creation, present invention backstage uses Tomcat to service Device, database use MySQL database, and the end PC or mobile phone terminal, free switching can be used in foreground exposition.

In practical operation, the compressed file of the ZIP format parsed required for operator's transmission is to intelligent Answer System System system decompresses file, and by after decompression file path and task ID be transferred to the document in system pretreatment mould Block, document preprocessing module extracts text from file, and obtained text is constructed document collection partition, question and answer pair according to rule Module walks document collection partition is constructed, keyword is extracted, and construct question and answer pair, is finally deposited into database.

Specifically, first having to pre-process document during constructing question and answer pair, that is, first have to from former HTML text Effective information is extracted in part.Original includes a large amount of interference informations, in order to remove these interference informations, uses canonical table first Up to formula filter in received document garbage (it is main are as follows: link, css, js script, annotation, empty label to) output text Part collection OUT1；

Then, file set OUT2 is obtained using the repeating part in longest common subsequence algorithm removal file set OUT1；

Then, file set OUT2 is classified according to the granularity of setting, removes the publicly-owned part in each classifying documents, obtains To the file set OUT3 comprising catalogue and text；

Finally, classifying using Longest Common Substring algorithm to file set OUT3, the publicly-owned portion of each classifying documents is removed Point, obtain text set OUT4.

(note: longest common subsequence (LCS) be one in an arrangement set (usually two sequences) be used to search In all sequences the problem of longest subsequence.One ordered series of numbers if being respectively the subsequence of two or more known ordered series of numbers, and is It is all to meet longest in this sequence of conditions, the then referred to as longest common subsequence of known array.)

Pretreatment starts to construct document collection partition after completing, including

1) analysis obtains the html source code (OUT4) of text, constructs HTML tree according to depth-first traversal；

2) structure for adjusting the HTML tree built allows the leaf node of tree to directly constitute the answer part of question and answer pair Generate document collection partition；(opening for display effect is not followed either due to that in the development process to document, there may be fault Hair, it is possible that can have the inaccurate of fraction document collection partition generation, the question and answer that this part generates are artificial multiple to needing Core is filtered)

A) leaf node is traversed；

C) there are branches for child nodes, and meet following decision rule:

C1) each child nodes are semantic approximate；(being judged by the short text degree of approximation interface of Baidu)

C2) each child's sub-tree structure is identical.

Next question and answer pair are constructed on this basis, specifically:

2) after problem generates, in building question and answer clock synchronization, if it is null value that keyword, question sentence, answer, which have any a part, Then give up the question and answer pair；

The method of generation problem specifically:

Chinese word segmentation is carried out for key to the issue word structure tree and constructs customized dictionary, then is asked by the generation of semantic template method Sentence: by subtracting leaf node in document collection partition, key to the issue word structure tree is generated, first determines whether children tree nodes include certainly The keyword for defining dictionary ACML, BCML deletes the word if comprising it or exactly matching it；Later, judge that children tree nodes are The no keyword comprising verb dictionary VL, attribute qualifier dictionary AL, classification carry out syntax conversion, generate question sentence.

The building of described dictionary ACML, BCML, VL, AL are by Stanford CoreNLP (the one of Stanford University Set open source participle tool) Chinese word segmentation is carried out, then the word in certain threshold range is screened as corresponding dictionary by manual type Content.

The detailed generation method of question sentence as problem is as follows:

S0. for each node of key to the issue word structure tree, Chinese word segmentation is carried out by Stanford CoreNLP, The word in certain threshold range is screened by manual type again, constructs customized dictionary: the meaningless dictionary of A class (ACML), B class without Meaning dictionary (BCML), verb dictionary (VL), attribute modification dictionary (AL).Wherein the meaningless dictionary of A class include word such as: " user Guide ", " welcoming to use ", " understanding " etc., need to remove the redundancy section when node includes this kind of word, delete the word；B class without Meaning dictionary include word such as: " Help Center ", " welcome downloading ", entire node generate without any effect question sentence, need to delete Except entire node.

S1. it was 4 (value must be greater than 2) that effective key to the issue word node granularity, which is arranged, selected the first stalk tree.

S2. beta pruning, method are as follows: each node of the subtree is traversed, if the node includes Chinese punctuation mark or A class Meaningless dictionary ACML (such as " Help Center ", " users' guidebook "), then directly delete the node；If node includes that B class is not intended to Adopted dictionary BCML (such as " users' guidebook ", " welcoming to use ", " understanding "), then retain the node and delete the word.Other situations Under be not processed.

S3. branch, method are cut out are as follows: judge whether the subtree depth obtained after beta pruning in S2 is greater than significant problem keyword node Granularity, if it is greater, then return value null, turns S8；Otherwise S4 is continued to execute.

S4. classification processing is carried out according to the subtree depth: if subtree depth is 1, executes S5；If subtree depth is 2, Execute S6；S7 is executed in the case of other；

S5. syntactic analysis is carried out to current subtree, if words and phrases contained by node include verb dictionary VL, the problem of generation Structure Stc51 are as follows:

" how " other words and phrases of the verb are removed in+<VL>+<node, word order is constant>

Otherwise, structure Stc52 the problem of generation are as follows:

<node 1>+" what is "

Turn S8；

S6. the problem of generating structure Stc6 are as follows:<node 1>+" "+<node 2>+" what is "

Turn S8；

S7. judge end-node whether be attribute dictionary AL comprising word (" normally ", " exception "), if it is, generation Structure of problem Stc71 are as follows:

<node 1>+<node 2>+...+<node (length-2)>+" "+<AL>+<node (length-1)>+" it is assorted "

Otherwise, structure Stc72 the problem of generation are as follows:

<node 1>+<node 2>+...+<node (length-1)>+" "+<node (length)>+" what is "

Turn S8；

If S8. next stalk tree is not sky, next class subtree is selected, S2 is turned；Otherwise it completes the algorithm and exits.

Note: vocabulary definitions are as follows:

The meaningless dictionary of A class: A class of meaningless lexicon (ACML)

The meaningless dictionary of B class: B class of meaningless lexicon (BCML)

Verb dictionary: Verb Lexicon (VL)

Attribute dictionary: Attribute Lexicon (AL)

It is as follows to matched process is carried out with the question and answer in knowledge base the problem of being proposed to user:

S1. receive user and put question to Q1；

S2. inverted index is carried out to Q1 with the keyword set in memory-resident；

S2.1 finds out longest common subsequence (LCS, Longest Common of the question and answer pair with each keyword t Subsequence)；

The matching rate that S2.2 calculates each keyword in problem and keywords database is the character length/t. of lcs (t-Q1) Character length, be maximized the keyword as Q1；

S3. with the identical question sentence set of keyword in the keyword index knowledge base of Q1；

S4. each of set question sentence is sought into the short text degree of approximation to Q1, takes answer corresponding to approximate angle value maximum User is returned to as answer.

Application field of the present invention is extensive, and supermatic intelligent customer service system can be provided for all trades and professions.With regard to company For information inquiry, externally, the information of general enterprises webpage is inquired, and catalog series is more, comparatively laborious, is difficult to allow in time The visitor of webpage obtains information needed.Internally, enterprises employee needs other department's information or wants inquiry our department When problem, the data of browsing and document can only be gone, it is time-consuming and not convenient.This system not only can accurately reply public affairs by intelligence system Client's information needed of department saves the query time of company web page visitor, brings more potential customers for company, also can be to public affairs Internal document, data building knowledge base are taken charge of, intra-company employee is facilitated to learn and inquire.In terms of medical treatment, patient is often now It can face and not know about information for hospital, can not see a doctor in time, register and do not know that hanging what section, what expert and drug uses correlation Problem.This system can provide medical inquiry, provide hospital's relevant information in time for patient, drug uses relevant information, to trouble The medical offer of person is convenient, can also further dredge hospital's order, and bring misunderstanding and lance can not be linked up in time by reducing information Shield.In terms of long-distance education, distance education platform construction of knowledge base higher cost only relies on artificial constructed, takes time and effort.This is System can more easily construct knowledge base for platform, further obtain special disciplines knowledge for pupils and students and provide convenience.

Above-described embodiment does not limit the present invention in any form, and all forms for taking equivalent substitution or equivalent transformation are obtained Technical solution, be within the scope of the present invention.

Claims

1. a kind of intelligent Answer System based on natural language processing, which is characterized in that including construction of knowledge base module, question and answer pair Management module and question and answer matching module；The construction of knowledge base module includes document preprocessing module, building document collection partition Module and building question and answer are to module；The question and answer include task management module, document management module, keyword to management module Management module and question and answer are to operation module；The question and answer matching module is for asking a question and knowledge base generation module user The question and answer created match topic.

2. a kind of intelligent Answer System based on natural language processing according to claim 1, which is characterized in that the text Shelves preprocessing module is used to filter the garbage in document, and filter process includes:

File set OUT2 is classified according to the granularity of setting, removes the publicly-owned part in each classifying documents, is obtained comprising mesh The file set OUT3 of record and text；

Classified using Longest Common Substring algorithm to file set OUT3, removes the publicly-owned part of each classifying documents, obtain just Collected works close OUT4.

3. a kind of intelligent Answer System based on natural language processing according to claim 2, which is characterized in that the structure Document collection partition module is built for constructing document collection partition, building process includes:

2) structure for adjusting the HTML tree built allows the leaf node of tree to directly constitute the answer part of question and answer pair, raw At document collection partition；

4. it is according to claim 3 it is a kind of based on the question and answer of document collection partition to method for auto constructing, which is characterized in that it is raw The rule of problematic keyword structure tree is as follows:

A) leaf node is traversed；

C) there are branches for child nodes, and meet following decision rule:

C1) each child nodes are semantic approximate；

C2) each child's sub-tree structure is identical.

5. a kind of intelligent Answer System based on natural language processing according to claim 3, which is characterized in that described to ask Module is answered questions for constructing question and answer pair, building process includes:

1) obtained document collection partition is carried out depth-first traversal to building module by question and answer, will be in obtained each paths Keyword set is constituted as problem alternative keywords, and after carrying out traversal removal parent information to the father node of leaf node Answer generates crucial phrase-answer set；

2) it after generation problem, is given up in building question and answer clock synchronization if it is null value that keyword, question sentence, answer, which have any a part, Abandon the question and answer pair；

3) duplicate question sentence is removed, question and answer pair are tentatively obtained, using root node as keyword, if keyword and problem are not Match, then generates keyword as the keyword of the question and answer pair using participle and name entity abstracting method；

4) pure question sentence is encountered in ergodic process and does not enter problem product process, directly using question sentence as problem, subordinate's node conduct Answer is extracted as asking-answering questions and do proposition entity to question sentence, constitutes keyword export.

6. a kind of intelligent Answer System based on natural language processing according to claim 5, which is characterized in that generation is asked Topic specifically: Chinese word segmentation is carried out for key to the issue word structure tree and constructs customized dictionary, then is generated by semantic template method Question sentence: by subtracting leaf node in document collection partition, generating key to the issue word structure tree, first determine whether children tree nodes whether include The keyword of customized dictionary deletes the word if comprising it or exactly matching it；Later, judge children tree nodes whether include The keyword of verb dictionary attribute qualifier dictionary, classification carry out syntax conversion, ultimately generate problem.

7. a kind of intelligent Answer System based on natural language processing according to claim 1, which is characterized in that described Management module of being engaged in is for management role publication, task status monitoring；Document management is uploaded for managing file, file decompresses, text Shelves group polling；Question and answer are used to manage the addition, deletion, modification, inquiry operation of question and answer pair to operation module.

8. a kind of intelligent Answer System based on natural language processing according to claim 1, which is characterized in that question and answer The matching of the problem of with module includes:

Receive user and puts question to Q1；

Inverted index is carried out to Q1 by keyword set；

The matching rate for calculating each keyword in Q1 and keywords database, takes the wherein maximum pass as Q1 of matching rate value Keyword；

Each of question sentence set question sentence is sought into the short text degree of approximation to Q1, takes maximum the answering corresponding to that of approximate angle value Case returns to user as the answer of Q1.

9. a kind of intelligent Answer System based on natural language processing according to claim 3, which is characterized in that keyword Extraction includes: all Document Titles of traversal, finds out the word frequency of all separators；The highest additional character of word frequency is chosen as separation Symbol, is split keyword, and generates word frequency mapping；The filtering higher phrase of word frequency then segments question sentence, takes out Take the keyword of noun or gerund therein as the question sentence.