CN113157885A - Efficient intelligent question-answering system for knowledge in artificial intelligence field - Google Patents

Efficient intelligent question-answering system for knowledge in artificial intelligence field Download PDF

Info

Publication number
CN113157885A
CN113157885A CN202110392744.2A CN202110392744A CN113157885A CN 113157885 A CN113157885 A CN 113157885A CN 202110392744 A CN202110392744 A CN 202110392744A CN 113157885 A CN113157885 A CN 113157885A
Authority
CN
China
Prior art keywords
question
module
knowledge
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110392744.2A
Other languages
Chinese (zh)
Other versions
CN113157885B (en
Inventor
曲晨帆
金连文
林上港
马骏
谭濯
刘振鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110392744.2A priority Critical patent/CN113157885B/en
Publication of CN113157885A publication Critical patent/CN113157885A/en
Application granted granted Critical
Publication of CN113157885B publication Critical patent/CN113157885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to an efficient intelligent question-answering system for knowledge in the field of artificial intelligence, which comprises a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a question-answering system knowledge structure construction module; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base. According to the invention, through the preparation module and the question-answering module, the word segmentation accuracy of the user questions, the knowledge base questions and the text base questions is greatly enhanced, so that the overall accuracy of the full question-answering system is greatly improved, the user experience is greatly improved, and the knowledge question-answering service with low cost, high efficiency and high user experience is realized.

Description

Efficient intelligent question-answering system for knowledge in artificial intelligence field
Technical Field
The invention relates to the technical field of artificial intelligence and natural language processing, in particular to an efficient intelligent question-answering system for knowledge in the field of artificial intelligence.
Background
In recent years, the artificial intelligence technology is rapidly developed, and the method has a very wide application prospect in the fields of education, medical treatment, agriculture, traffic and the like. However, acquiring knowledge in the field of artificial intelligence needs to have a certain professional basis, and practitioners in various industries lack a way to conveniently and accurately acquire artificial intelligence knowledge, so that the artificial intelligence technology is difficult to popularize in many fields, and the development of social productivity is invisibly hindered. The unstructured text in the field of artificial intelligence bears a large amount of knowledge in the field, and if a knowledge question-answering system based on text understanding in the field can be completed, an efficient and convenient knowledge acquisition way can be provided for people, and further development of the artificial intelligence technology is promoted.
The prior knowledge question-answering system has the following problems: firstly, the information extraction model lacks the support of entity names and entity alternative names, the former causes related professional terms to be wrongly segmented, and further influences the performance of a search engine, and the latter lacks the understanding of synonym problems, so that the subsequent search results are one-sided. Both of which can adversely affect the overall performance of the question-answering system. Secondly, machine reading understanding is used as a complex natural language processing task, the problems of high complexity, large calculation amount and the like exist, the construction of the knowledge base depends on unstructured texts, time and labor are consumed if the knowledge base is constructed in a manual mode, the knowledge base with a sufficient scale is difficult to form, and the actual deployment of the question-answering system is restricted by the knowledge base and the question-answering system. Finally, existing question-answering systems still lack the ability to efficiently get accurate and comprehensive answers from different types of text across paragraphs, documents, and forms, and further lack the ability to guide users to further explore relevant knowledge in the field.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides an efficient intelligent question-answering system for knowledge in the field of artificial intelligence.
The invention is realized by adopting the following technical scheme: an efficient intelligent question-answering system for artificial intelligence field knowledge, comprising: a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a question-answering system knowledge structure construction module; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module labels the collected unstructured knowledge text paragraphs in the artificial intelligence field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and nonsynonymous problems in the artificial intelligence field to train a short text matching model, utilizes a question-answering system knowledge structure construction module to extract knowledge triples from the trained information extraction module and form question-answer pairs, utilizes the extracted entity names and alternative names to perform auxiliary search, and provides semantics for a search engine through a construction method of improving a knowledge base and a text base reverse order index and constructs a knowledge base keyword index;
the question-answering module preprocesses the questions input by the user through the input preprocessing module, searches answers by using the question-answering module based on the knowledge base, if the answers exist, the answers are prepared to be returned, otherwise, the preprocessed user input questions are sent to the question-answering module based on the text base to be searched and prepared to be returned, the questions are recommended to the user by using the question recommending module based on the knowledge base, and finally, the answers and the recommended questions are returned to the user.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention supplements the dictionary of the jieba participle by the entity name and the alternative name extracted by the information extraction model, so that the participle accuracy of the user problem, the knowledge base problem and the text base problem is greatly enhanced, the integral accuracy of the full question-answering system is greatly improved, and the user experience is greatly improved.
2. According to the invention, through the corresponding relation between the entity name and the alternative name extracted by the information extraction model and the near dictionary acquired from the Internet, and by utilizing the improved BM25 knowledge base rough recall module, the contents of all synonymy different keywords are sequenced at the same time without increasing reasoning time in single retrieval, and the influence of document paragraphs on the change of the difference of the subject word frequency and the document length is more robust, so that the retrieval effect is improved.
3. The invention constructs the question-answer knowledge base based on unstructured text paragraphs by utilizing an information extraction technology, and simultaneously optimally utilizes other obtainable semi-structured and structured related text paragraphs as supplements, so that the knowledge acquisition channels are more diverse and flexible, matching can be completed when the user expresses the question through natural language and the existing knowledge semantics in the knowledge base are consistent, and meanwhile, the richness of the answer is enhanced.
4. The question-answering system can recommend relevant questions for the user, guide the user to further explore a knowledge system and inspire the user to ask questions, and has high social value and strong practical significance.
5. The invention has small demand and consumption on computing resources.
Drawings
FIG. 1 is a system block diagram of the present invention;
FIG. 2 is a block diagram of the data collection module of the preparation module of the present invention;
FIG. 3 is a diagram of a model training module in the preparation module of the present invention;
figure 4 is a flow chart of HBT model training in the preparation module of the present invention;
FIG. 5 is a flow chart of ESIM model training in the preparation module of the present invention;
FIG. 6 is a diagram of the RoBERTA-QA model of the present invention;
FIG. 7 is a diagram of a knowledge structure building block for the question and answer system in the preparation block of the present invention;
FIG. 8 is a diagram of a construction module of the inverted index of the knowledge base in the knowledge structure construction module of the question-answering system in the preparation module according to the present invention;
FIG. 9 is a diagram of a knowledge base keyword index building module in the knowledge structure building module of the question and answer system in the preparation module of the present invention;
FIG. 10 is a diagram of a construction module of the reverse order index of the text base in the knowledge structure construction module of the question-answering system in the preparation module according to the present invention;
FIG. 11 is an overall flow diagram of the question-answering module of the present invention;
FIG. 12 is a flow diagram of a method for resolution of an indication in a preprocessing module in the question-answering module of the present invention;
FIG. 13 is a flow chart of a recall module in the knowledge base based question answering module of the present invention;
FIG. 14 is a flowchart of a method for determining synonymy of question sentences using ESIM in a knowledge-based question-answering module of the question-answering module according to the present invention;
FIG. 15 is a flow chart of a rough recall module in a text-based question-answering module in the question-answering module of the present invention;
FIG. 16 is a diagram of a knowledge base based question recommendation module in the question-answering module of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
As shown in fig. 1, the present embodiment is an efficient intelligent question-answering system for knowledge in the field of artificial intelligence, which includes a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a question-answering system knowledge structure construction module; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module labels the collected unstructured knowledge text paragraphs in the artificial intelligence field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and nonsynonymous questions in the artificial intelligence field to train a short text matching model, utilizes a question-answering system knowledge structure construction module to extract knowledge triples from the trained information extraction module and form question-answer pairs, utilizes the extracted entity names and alternative names to assist in realizing more efficient search, provides semantics for a search engine through a construction method of improving a knowledge base and a text base reverse order index, and constructs a knowledge base keyword index to assist in realizing question recommendation;
the question-answering module preprocesses the questions input by the user through the input preprocessing module, searches answers by using the question-answering module based on the knowledge base, if the answers exist, the answers are prepared to be returned, otherwise, the preprocessed user input questions are sent to the question-answering module based on the text base to be searched and prepared to be returned, the questions are recommended to the user by using the question recommending module based on the knowledge base, and finally, the answers and the recommended questions are returned to the user.
As shown in fig. 2, in this embodiment, the data collection module is implemented as follows:
s21, collecting unstructured knowledge text paragraphs from scientific publications, documents, network popular science knowledge and other sources related to the artificial intelligence field, splitting the text paragraphs with overlong length according to periods, and specifically, limiting the length of each text paragraph to be not more than 480 characters;
s22, defining the key information triple types extracted from the information extraction model, and defining the following triple types by using a common relation definition method: entity-description-content, entity-presenter-content, entity-containing-content, entity-application-content and entity-alias-content, labeling the type of defined triple group, uniformly labeling the triple group as subject-predicted-object, firstly labeling all the subjects in the collected unstructured knowledge text paragraphs in the artificial intelligence field by using a brat text labeling tool, labeling all the subjects by checking the subjects, labeling all the objects by checking the objects, and finally labeling all the corresponding relations between the subjects and the objects in the text paragraphs by connecting lines;
s23, using machine reading understanding model, after collecting unstructured knowledge text paragraphs from scientific publications, documents, network popular science knowledge and other sources related to the artificial intelligence field in step S21, marking the starting and ending positions of the relevant content of answers corresponding to the questions in the text paragraphs if corresponding question questions exist, otherwise, directly simulating real user scenes according to partial content in the text paragraphs to define diversified questions related to the scientific knowledge of the artificial intelligence field, and marking the starting and ending positions of the answers corresponding to the questions in the text paragraphs.
S24, directly collecting synonymy problems and different synonymy problems within 30 characters in the field by using a short text matching model, wherein two synonymy problems are paired, and the corresponding label is 1; two different sense problems are paired, and the corresponding label is 0.
As shown in fig. 3, in this embodiment, the implementation process of the model training module is as follows:
s31, building an information extraction model by using the HBT model, initializing model parameters by using a RoBERTA pre-training model, and training by using labeled triple data. Specifically, the HBT model training process is shown in figure 4. Firstly, adding a special symbol [ CLS ] at the beginning of an input text paragraph, adding a special symbol [ SEP ] at the end of the text paragraph, supplementing 512 characters with [ PAD ], changing a word list of a Robertta pre-training model into a corresponding number, mapping the number to a vector of 756-dimensional characteristic dimensions through an Embedding layer of the Robertta pre-training model, and obtaining an output characteristic vector after the vector passes through all layer transformers of the Robertta pre-training model. On one hand, outputting the feature vector to obtain the prediction probability results of the starting point and the end point of the subject through the full connection layer and Sigmoid activation change into 2 channels; and on the other hand, according to the starting point and the end point of the subject in the segment which are randomly selected, the output feature vector is taken out, copied and added by the feature fusion device of the full connection layer, and then the output result is input into the full connection layer to be changed into the feature vector of 10 channels, wherein the 10 channels respectively correspond to the starting point and the end point of 5 types of predicate, and finally the feature vector is activated by Sigmoid to obtain the starting point and end point prediction probability of the predicate-object. And calculating errors of the starting point and the end point prediction probabilities of the subject and the predicted-object by using the two-class cross entropy loss and the labeled data respectively, then adjusting model parameters by back propagation of the errors, and sequentially performing loop iteration.
S32, building an ESIM model training short text matching model, as shown in FIG. 5, firstly training a Chinese character vector by using a word2vec mode through a full corpus of Chinese Wikipedia, then pre-training the ESIM model by using a Chinese translation result of a Quora Question Pairs data set, performing ESIM model parameter fine tuning training on the model obtained through pre-training by using a Chinese open large-scale short text matching data set LCMC and an acquired artificial intelligence field short text matching data set, resampling the artificial intelligence field short text matching data set to a data scale close to the LCMC data set in the process, and calculating cross entropy loss between an ESIM model output result and a label result of 0 or 1 by using binary cross entropy loss in the training process.
S33, building a Roberta-QA model to train a machine reading understanding model, firstly building the Roberta-QA model, wherein the Roberta-QA model is shown in FIG. 6, the input of the Roberta-QA model is the splicing of text paragraphs and questions, after the splicing, characters are coded according to the coding mode of the Roberta vocabulary, a special character [ CLS ] is added at the initial position before coding, a special character [ SEP ] is added between the text paragraphs and the questions, an [ SEP ] is also added at the end of the questions, and 512 characters are supplemented by [ PAD ]. The RoBERTA-QA model predicts the probability of the head and tail positions of answers in a text segment by adding 756 input channels and 2 output channels of a full connection layer and a Softmax layer connected with the full connection layer on the basis of RoBERTA and predicting the probability of the head and tail positions of answers in the text segment by using a large-scale open data set DuReader on the basis of the Chinese pre-training model RoBERTA, performs parameter fine tuning training by using collected machine reading understanding marking data in the field of artificial intelligence and calculates errors by using two-classification cross entropy loss during training.
As shown in fig. 7, in this embodiment, the implementation process of the knowledge structure building module of the question answering system is as follows:
s41, collecting a large amount of unstructured knowledge texts in the field of artificial intelligence; particularly, unstructured text paragraphs from scientific publications, documents, network popular science knowledge and other sources related to the field of artificial intelligence are collected, and paragraphs with overlong length are split according to periods; in particular, each paragraph is limited to no more than 480 characters in length, and each sentence in the paragraph is complete.
S42, extracting triples in a large number of unstructured knowledge text paragraphs in the field of artificial intelligence by using the information extraction model trained in the step S31, adding a special symbol [ CLS ] at the beginning of an input text paragraph, adding a special symbol [ SEP ] at the end of the text, supplementing 512 characters by using [ PAD ], changing the input text paragraph into a corresponding number through a word list of RoBERTA, mapping the corresponding number to a vector of 756-dimensional feature dimensions through an Embedding layer of RoBERTA, obtaining an output feature vector after the vector passes through a transform of all layers of RoBERTA, obtaining a start point and an end point prediction probability result of a subject through a full connection layer and Sigmoid activation change, taking the start point and the end point with the probability greater than 0.5 as the start point and the end point, and combining the start point and the end point closest to the subsequent position to form a subject position. And taking out corresponding output feature vectors from the starting point and the end point of each pair of subjects, copying the corresponding output feature vectors, adding the output results of the full-connection layer feature fusion device, inputting the output feature vectors into the full-connection layer subject predictor, changing the output feature vectors into feature vectors of 10 channels, enabling the 10 channels to respectively correspond to the starting point and the end point of 5 types of predicate, finally activating the feature vectors through Sigmoid to obtain the prediction probability of the starting point and the end point of predicate-subject, taking the starting point and the end point with the probability larger than 0.5 as the starting point and the end point, and combining and pairing the starting point and the end point closest to the starting point and the end point to the rear point to form the object position. Finding the subject content and the corresponding predicted-object content according to the starting point and the end point in the text paragraph, and combining the subject-predicted-object content into a plurality of triples in the form of subject-predicted-object.
S43, the entity names and the entity alias contents in the triples are independently proposed to supplement a dictionary of a jieba word segmentation tool to help greatly improve the word segmentation accuracy; on the other hand, the extracted triple results form a question answer key value pair according to a certain rule, such as: entity-description-content forms what entity is content; entity-application-content forming { entity } which applications: { content }; entity-contains-content forms which { entity } includes { content }; entity-presenter-content forms who entity is { content }; entity-alternative-content forms what the alternative of entity is { content }, then crawls triple knowledge in encyclopedia and Wikipedia and forms question-answer pairs to supplement knowledge base according to the format of attribute of { term name } is { content }. If the existing high-quality question-answer pairs in the field can be found, the high-quality question-answer pairs are also put into a knowledge base for supplement. The question-answer pairs in the knowledge base are stored in a key-value pair mode in which questions are used as keys and answers are used as values.
S44, establishing a reverse index and a keyword index for all the questions in the knowledge base, as shown in FIG. 8 and FIG. 9 respectively. When the reverse order index is established, a jieba word segmentation tool is used for segmenting words of question text paragraphs in each question and answer pair in the knowledge base and removing stop words to obtain a group of words, and then the set of words obtained after all questions in the knowledge base are processed in the mode that all words in the knowledge base are counted. Firstly, establishing an index which takes all words in a knowledge base as keys and takes the frequency of each word appearing in each segment as a value; and then, adjusting the index according to the near-meaning dictionary, wherein the specific method comprises the following steps: if the word A and the word B are in a synonym relationship with each other, the value corresponding to the key of the word A in the adjusted new index is as follows: finding each segment in which the word A or the word B appears according to the index before adjustment, wherein if only one of the word A and the word B appears in the segment, the value is the word frequency of the appearing word in the segment; if the word A and the word B appear in the segment at the same time, the value is the sum of the word frequencies of the word A and the word B in the segment, namely the word frequency fi of a certain word ci in a certain paragraph in the text library is recorded as:
fi=freq(ci)+∑freq(pi)
wherein freq is the word frequency of the word in a certain section of the text library, and pi is the synonym of ci;
when the keyword index of the knowledge base is constructed, traversing each problem in the knowledge base, judging whether the words in the problem intersect with the entity name and alternative name set extracted by the information extraction model, and if so, adding the problem to the value corresponding to the entity name or the alternative name key in the keyword index of the knowledge base.
S45, directly storing a large amount of unstructured knowledge texts in the field of artificial intelligence as a text library and establishing a reverse index for the text library, as shown in FIG. 10. Utilizing a jieba word segmentation tool to perform word segmentation and remove stop words on each text paragraph in a text library respectively to obtain a group of words; and counting a word set obtained after all text paragraphs in the text library are processed in such a way, namely all words in the text library. Firstly, establishing an index which takes all words in a text library as keys and takes the frequency of each word appearing in each segment as a value; and then, adjusting the index according to the near-meaning dictionary, wherein the specific method comprises the following steps: if the word A and the word B are in a synonym relationship with each other, the value corresponding to the key of the word A in the adjusted new index is as follows: finding each segment in which the word A or the word B appears according to the index before adjustment, wherein if only one of the word A and the word B appears in the segment, the value is the word frequency of the appearing word in the segment; if the word A and the word B appear in the segment at the same time, the value is the sum of the word frequencies of the word A and the word B in the segment, namely the word frequency fi of a certain word ci in a certain paragraph in the text library is recorded as:
fi=freq(ci)+∑freq(pi)
wherein freq is the word frequency of the word in a certain section of the text library, and pi is the synonym of ci.
As shown in fig. 11, in this embodiment, the specific processing procedure of the question answering module is as follows:
and S51, removing stop words in the user input problems by using an input preprocessing module, performing word segmentation and part-of-speech tagging by using new words obtained by the information extraction model and a jieba word segmentation tool, and performing reference resolution based on the result of the word segmentation and part-of-speech tagging and the LTP language model of the Hadamard. In the reference resolution, as shown in fig. 12, the part-of-speech tagging in the jieba participle and the syntactic analysis of the LTP model are used to obtain the participle result, the part-of-speech and the dependency arc relationship. Firstly, judging whether a subject exists, if not, locating the pronouns in the part-of-speech array, and if the dependency arc relation is not a moving object relation (VOB) or a centering relation (ATT), replacing the pronouns with the subject recorded at the last time; if the pronouns do not exist, locating the pronouns and judging whether the dependency arc relation is parallel relation (COO), if so, replacing the pronouns by the input subject, and if the subject and the pronouns do not exist, recording the input subject for later reference resolution.
S52, adopting the question-answering module based on the knowledge base, after receiving the result obtained by the preprocessing module, firstly, quickly searching whether the knowledge base has the same question as the question, if so, directly obtaining the corresponding answer, which can be used after the user approves the question recommended by the system and greatly save the response time of the system and the occupation of the computing resource. When no complete match is found, the preprocessed user question is also roughly searched in all question texts of the knowledge base by the improved BM25 knowledge base rough recall module, then the preprocessed user question is finely semantically matched with the rough search result of the question by using an ESIM model, a result with the most similar semanteme is found, and if the matching score of the result exceeds a set threshold value, the answer corresponding to the result is taken as the answer to be returned to the user. The ESIM model determines whether two question sentences are synonymous as shown in fig. 14, the input is a batch of question pairs, and the synonymy probability, i.e., the matching score, of each pair of questions in the batch of question pairs is output. If neither the BM25 knowledge base rough recall module nor the ESIM model found any matching result with a matching score exceeding the threshold in this step, the preprocessed user question is input into the question-answering module based on the text base.
Specifically, as shown in fig. 13, the improved BM25 knowledge base rough recall module firstly determines whether each word in the preprocessed user input problem exists in a set of all words in the knowledge base, if so, directly retains the word, if not, and if so, determines whether any synonym of the word exists in a set of all words in the knowledge base, if so, replaces the word with the synonym and retains the word, and if not, or if not, removes the word, and obtains the further processed user problem. And finally, calculating the relevance scores of each word and each problem in the knowledge base which are reserved through the processing, and summing the scores of the relevance of each word to obtain the relevance scores of each problem in the query and the knowledge base.
Further processed user input question Q and jth question d in knowledge basejThe relevance score of (a) is:
Figure BDA0003017388390000081
wherein q isiIs the ith vocabulary in the user input question after further processing; wijThe weight of the ith word in the jth question corresponding to the user input question in the knowledge base; IDF (q)i) Is the inverse text frequency index of the ith vocabulary in the user input question after further processing; n is the number of words and phrases of the user input question; i refers to the ith vocabulary of the user input question;
in particular, the amount of the solvent to be used,
Figure BDA0003017388390000082
wherein; n is the total number of question-answer pairs in the knowledge base; n (q)i) Is the number of question-answer pairs in the knowledge base containing the ith vocabulary in the user input question after further processing;
in particular, the amount of the solvent to be used,
Figure BDA0003017388390000083
wherein k is1Is the word frequency saturation, here taken to be 1.5; b is the segment length constraint, here taken to be 0.75; dl (dl)jIs the string length of the jth question in the knowledge base; average string length for all problems in the avgdl knowledge base.
S53, if the question-answer module based on the knowledge base does not find the answer to the question input by the user, the question-answer module based on the text base is adopted, and the further processed user question and the knowledge text paragraph in the text base are retrieved and coarsely recalled through the improved BM25 knowledge base coarse recall module; and inputting the result of the rough recall and the preprocessed user question into a RoBERTA-QA model, wherein the model predicts the position of a starting point and an end point of an answer to the user question in the result of each section of the rough recall and the corresponding probability in parallel. And if the probability of any section with the answer exceeds a set threshold value, the text section between the positions with the maximum probability of the starting point and the end point in the text section is used as the answer returned to the user.
As shown in fig. 15, in this embodiment, the modified BM25 text base rough recall module calculates the relevance scores of each word of the further processed user input question and all text paragraphs in the text base, and sums the relevance scores of each word to obtain the relevance scores of the further processed user question and all text paragraphs in the text base, that is, the further processed user input question Q and the jth paragraph d in the text basejThe relevance score of (a) is:
Figure BDA0003017388390000091
wherein q isiIs the ith vocabulary in the user input question after further processing; wijThe weight of the ith word in the jth question corresponding to the user input question in the knowledge base; IDF (q)i) Is the inverse text frequency index of the ith vocabulary in the user input question after further processing; n is the number of words and phrases of the user input question; i is the ith vocabulary of the user input question;
in particular, the amount of the solvent to be used,
Figure BDA0003017388390000092
wherein; n is the total number of question-answer pairs in the knowledge base; n (q)i) Is the number of question-answer pairs in the knowledge base containing the ith vocabulary in the user input question after further processing;
in particular, the amount of the solvent to be used,
Figure BDA0003017388390000093
Wijin k1Is the word frequency saturation, here taken to be 1.5; b is the segment length constraint, here taken to be 0.75; dl (dl)jIs the string length of the jth question in the knowledge base; average string length for all problems in the avgdl knowledge base.
S54, using a knowledge base-based question-answer recommending module, as shown in fig. 16, the module first uses the keyword index of the knowledge base established in step S44 to search for a question with an entity name and a proper name coinciding with those in the knowledge base and the pre-processed user input question, if the question can be found, then randomly selects a question from the search, otherwise, randomly selects a question from all knowledge bases, after the selection is completed, checks whether the question is the same as the pre-processed user question or the answer is the same as the answer to be returned to the user, if so, repeatedly selects the previous steps until a question meeting the inconsistency condition is obtained, and returns the obtained answer to the user and the question recommended to the user in turn.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. An efficient intelligent question-answering system for knowledge in the field of artificial intelligence is characterized by comprising a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a question-answering system knowledge structure construction module; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module labels the collected unstructured knowledge text paragraphs in the artificial intelligence field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and nonsynonymous problems in the artificial intelligence field to train a short text matching model, utilizes a question-answering system knowledge structure construction module to extract knowledge triples from the trained information extraction module and form question-answer pairs, utilizes the extracted entity names and alternative names to perform auxiliary search, and provides semantics for a search engine through a construction method of improving a knowledge base and a text base reverse order index and constructs a knowledge base keyword index;
the question-answering module preprocesses the questions input by the user through the input preprocessing module, searches answers by using the question-answering module based on the knowledge base, if the answers exist, the answers are prepared to be returned, otherwise, the preprocessed user input questions are sent to the question-answering module based on the text base to be searched and prepared to be returned, the questions are recommended to the user by using the question recommending module based on the knowledge base, and finally, the answers and the recommended questions are returned to the user.
2. The system of claim 1, wherein the data collection module is implemented as follows:
s21, collecting unstructured knowledge text paragraphs of scientific publications, documents and network popular science knowledge in the field of artificial intelligence, wherein the text paragraphs are split according to periods;
s22, defining the key information triple types extracted from the information extraction model, and defining the triple types by using a common relation definition method as follows: entity-description-content, entity-presenter-content, entity-containing-content, entity-application-content and entity-alias-content, and labeling the defined three-tuple type, labeling all the objects by checking the objects in the collected unstructured knowledge text paragraphs in the field of artificial intelligence by using a brat text labeling tool, labeling all the objects by checking the objects, and labeling all the corresponding relations between the objects and the objects in the text paragraphs by connecting lines;
s23, using a machine reading understanding model, after collecting unstructured knowledge text paragraphs of scientific publications, documents and network popular science knowledge in the artificial intelligence field in step S21, if corresponding question questions exist, marking the starting point and the end point of the content related to the answers corresponding to the questions in the text paragraphs, otherwise, directly simulating the real user scene according to partial content in the text paragraphs to define diversified problems of the scientific knowledge in the artificial intelligence field, and marking the starting and ending positions of the answers corresponding to the questions in the text paragraphs;
s24, directly collecting synonymy problems and dissimilarity problems by using a short text matching model, wherein two synonymy problems are paired, and the corresponding label is 1; two different sense problems are paired, and the corresponding label is 0.
3. The system of claim 1, wherein the model training module is implemented as follows:
s31, building an information extraction model by using an HBT model, initializing model parameters by using a RoBERTA pre-training model, and training by using labeled triple data;
s32, building an ESIM model training short text matching model, training a Chinese character vector by using a word2vec mode through a full corpus of Chinese Wikipedia, pre-training the ESIM model by using a Chinese translation result of a Quora query Pairs data set, and performing ESIM model parameter fine tuning training by using a Chinese open large-scale short text matching data set LCQC and a collected artificial intelligence field short text matching data set on the basis of the model obtained by pre-training;
s33, building a RoBERTA-QA model training machine reading understanding model, then performing further pre-training on the basis of the Chinese pre-training model RoBERTA by using an open data set DuReader, and performing parameter fine-tuning training by using collected machine reading understanding marking data in the field of artificial intelligence.
4. The system of claim 1, wherein the construction module of the knowledge structure of the question-answering system is implemented as follows:
s41, collecting unstructured knowledge text paragraphs in the field of artificial intelligence;
s42, extracting triples in unstructured knowledge text paragraphs in the field of artificial intelligence by using the information extraction model trained in the step S31;
s43, independently proposing the entity name and the entity alias content in the triple, forming a question-answer key value pair by the extracted triple result, crawling triple knowledge in encyclopedia and Wikipedia and forming question-answer pairs to be put into a knowledge base, wherein the question-answer pairs in the knowledge base are used as keys according to questions, and the answers are stored in the form of key value pairs of values;
s44, establishing a reverse index and a keyword index for all questions in the knowledge base, performing word segmentation and stop word removal on question text paragraphs in each question-answer pair in the knowledge base by using a jieba word segmentation tool when establishing the reverse index to obtain a group of words, and then counting a set of words obtained after all questions in the knowledge base are processed in the way, namely all words in the knowledge base; constructing a keyword index of the knowledge base, traversing each problem in the knowledge base, and adding the problem to a value corresponding to the entity name or the alternative name key in the keyword index of the knowledge base if the words in the keyword have an intersection with the set of the entity name and the alternative name extracted by the information extraction model;
s45, directly storing unstructured knowledge text paragraphs in the field of artificial intelligence as a text base, establishing a reverse index for the text base, and performing word segmentation and removal of stop words on each text paragraph in the text base by using a jieba word segmentation tool to obtain a group of words; and counting a word set obtained after all text paragraphs in the text library are processed in such a way, namely all words in the text library.
5. The system of claim 1, wherein the input preprocessing module is implemented by: and removing stop words in the user input problem by using an input preprocessing module, performing word segmentation and part-of-speech tagging by using new words obtained by the information extraction model and a jieba word segmentation tool, and performing reference resolution based on the results of the word segmentation and part-of-speech tagging and the LTP language model of the Hadamard.
6. The system of claim 1, wherein the knowledge-base-based question-answering module is implemented by: after receiving the result obtained by the input preprocessing module, searching the problem which is completely the same as the problem in the knowledge base, and if the problem is found, directly obtaining the corresponding answer; if no complete match is found, the preprocessed user question is also subjected to rough retrieval in all question texts of the knowledge base by the improved BM25 knowledge base rough recall module, the preprocessed user question and the rough retrieval result of the question are subjected to fine semantic matching by using an ESIM model, a result is found, if the matching score of the result exceeds a set threshold value, an answer corresponding to the result is taken as an answer to be returned to the user, and if no matching result with a matching score exceeding the threshold value is found in the step by the BM25 knowledge base rough recall module and the ESIM model, the preprocessed user question is input into the question-answer module based on the text base.
7. The system of claim 1, wherein the text-based question-answering module is implemented as follows: firstly, the user problems after further processing and knowledge text paragraphs in a text base are retrieved and roughly recalled through an improved BM25 knowledge base rough recall module; inputting the rough recall result and the preprocessed user question into a RoBERTA-QA model, wherein the model predicts the starting point and the end point positions and the corresponding probabilities of the answer to the user question in each section of the rough recall result in parallel, the product of the probability values corresponding to the positions with the maximum starting point and end point probabilities is the probability that the text section has the answer, and if the probability that any section has the answer exceeds the set threshold value, the text section between the positions with the maximum starting point and end point probabilities in the text section is taken as the answer to be returned to the user.
8. The system of claim 1, wherein the knowledge-base-based question-answering recommendation module is implemented as follows: and searching for a problem with entity name and alternative name coincidence with the entity name and the alternative name coincidence in the knowledge base and the preprocessed user input problem by using the keyword index of the knowledge base established in the step S44, if the problem can be found, randomly selecting the problem from the knowledge base, recommending the problem to the user, otherwise, randomly selecting the problem from all the knowledge bases directly, checking whether the problem is the same as the preprocessed user problem or the answer is the same as the answer to be returned to the user after the selection is finished, if the problem is the same as the answer to be returned to the user, repeatedly selecting the previous steps until a problem meeting the inconsistency condition is obtained, and sequentially returning and displaying the obtained answer to the user about the user problem and the problem recommended to the user.
CN202110392744.2A 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field Active CN113157885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392744.2A CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392744.2A CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Publications (2)

Publication Number Publication Date
CN113157885A true CN113157885A (en) 2021-07-23
CN113157885B CN113157885B (en) 2023-07-18

Family

ID=76890104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392744.2A Active CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Country Status (1)

Country Link
CN (1) CN113157885B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757184A (en) * 2022-04-11 2022-07-15 中国航空综合技术研究所 Method and system for realizing knowledge question answering in aviation field
CN115292469A (en) * 2022-09-28 2022-11-04 之江实验室 Question-answering method combining paragraph search and machine reading understanding

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076184A (en) * 2006-07-31 2007-11-21 腾讯科技(深圳)有限公司 Method and system for realizing automatic reply
CN107220380A (en) * 2017-06-27 2017-09-29 北京百度网讯科技有限公司 Question and answer based on artificial intelligence recommend method, device and computer equipment
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN110727778A (en) * 2019-10-15 2020-01-24 大连中河科技有限公司 Intelligent question-answering system for tax affairs
CN110737763A (en) * 2019-10-18 2020-01-31 成都华律网络服务有限公司 Chinese intelligent question-answering system and method integrating knowledge map and deep learning
CN110990527A (en) * 2019-11-26 2020-04-10 泰康保险集团股份有限公司 Automatic question answering method and device, storage medium and electronic equipment
CN111324721A (en) * 2020-03-16 2020-06-23 云南电网有限责任公司信息中心 Method for constructing intelligent question-answering knowledge base
CN111414461A (en) * 2020-01-20 2020-07-14 福州大学 Intelligent question-answering method and system fusing knowledge base and user modeling
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent reading, understanding, question answering system of extraction type machine

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076184A (en) * 2006-07-31 2007-11-21 腾讯科技(深圳)有限公司 Method and system for realizing automatic reply
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN107220380A (en) * 2017-06-27 2017-09-29 北京百度网讯科技有限公司 Question and answer based on artificial intelligence recommend method, device and computer equipment
CN110727778A (en) * 2019-10-15 2020-01-24 大连中河科技有限公司 Intelligent question-answering system for tax affairs
CN110737763A (en) * 2019-10-18 2020-01-31 成都华律网络服务有限公司 Chinese intelligent question-answering system and method integrating knowledge map and deep learning
CN110990527A (en) * 2019-11-26 2020-04-10 泰康保险集团股份有限公司 Automatic question answering method and device, storage medium and electronic equipment
CN111414461A (en) * 2020-01-20 2020-07-14 福州大学 Intelligent question-answering method and system fusing knowledge base and user modeling
CN111324721A (en) * 2020-03-16 2020-06-23 云南电网有限责任公司信息中心 Method for constructing intelligent question-answering knowledge base
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent reading, understanding, question answering system of extraction type machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘阳: "基于基础教育知识图谱的问答系统研究与实现", 《CNKI优秀硕士学位论文库》 *
刘阳: "基于基础教育知识图谱的问答系统研究与实现", 《CNKI优秀硕士学位论文库》, 15 February 2021 (2021-02-15), pages 1 - 45 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757184A (en) * 2022-04-11 2022-07-15 中国航空综合技术研究所 Method and system for realizing knowledge question answering in aviation field
CN114757184B (en) * 2022-04-11 2023-11-10 中国航空综合技术研究所 Method and system for realizing knowledge question and answer in aviation field
CN115292469A (en) * 2022-09-28 2022-11-04 之江实验室 Question-answering method combining paragraph search and machine reading understanding

Also Published As

Publication number Publication date
CN113157885B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN109271505B (en) Question-answering system implementation method based on question-answer pairs
CN109684448B (en) Intelligent question and answer method
CN110737763A (en) Chinese intelligent question-answering system and method integrating knowledge map and deep learning
CN111291188B (en) Intelligent information extraction method and system
CN115292469B (en) Question-answering method combining paragraph search and machine reading understanding
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112307171B (en) Institutional standard retrieval method and system based on power knowledge base and readable storage medium
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN114020862A (en) Retrieval type intelligent question-answering system and method for coal mine safety regulations
CN116127095A (en) Question-answering method combining sequence model and knowledge graph
CN113157885B (en) Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field
CN115470338B (en) Multi-scenario intelligent question answering method and system based on multi-path recall
CN116166782A (en) Intelligent question-answering method based on deep learning
CN114357127A (en) Intelligent question-answering method based on machine reading understanding and common question-answering model
CN112036178A (en) Distribution network entity related semantic search method
CN111767325A (en) Multi-source data deep fusion method based on deep learning
CN113221530A (en) Text similarity matching method and device based on circle loss, computer equipment and storage medium
CN113742446A (en) Knowledge graph question-answering method and system based on path sorting
CN113392265A (en) Multimedia processing method, device and equipment
CN115761753A (en) Retrieval type knowledge prefix guide visual question-answering method fused with knowledge graph
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN112632250A (en) Question and answer method and system under multi-document scene
CN117454898A (en) Method and device for realizing legal entity standardized output according to input text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant