CN113157885B - Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field - Google Patents

Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field Download PDF

Info

Publication number
CN113157885B
CN113157885B CN202110392744.2A CN202110392744A CN113157885B CN 113157885 B CN113157885 B CN 113157885B CN 202110392744 A CN202110392744 A CN 202110392744A CN 113157885 B CN113157885 B CN 113157885B
Authority
CN
China
Prior art keywords
question
module
knowledge
text
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110392744.2A
Other languages
Chinese (zh)
Other versions
CN113157885A (en
Inventor
曲晨帆
金连文
林上港
马骏
谭濯
刘振鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110392744.2A priority Critical patent/CN113157885B/en
Publication of CN113157885A publication Critical patent/CN113157885A/en
Application granted granted Critical
Publication of CN113157885B publication Critical patent/CN113157885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a high-efficiency intelligent question-answering system oriented to knowledge in the field of artificial intelligence, which comprises a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a knowledge structure construction module of a question-answering system; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base. According to the invention, through the preparation module and the question-answering module, the word segmentation accuracy of the user questions, the knowledge base questions and the text base questions is greatly enhanced, and the overall accuracy of the full question-answering system is greatly improved, so that the user experience is greatly improved, and the knowledge question-answering service with low cost, high efficiency and high user experience is realized.

Description

Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field
Technical Field
The invention relates to the technical field of artificial intelligence and natural language processing, in particular to a high-efficiency intelligent question-answering system oriented to knowledge in the field of artificial intelligence.
Background
In recent years, the artificial intelligence technology is rapidly developed, and has very wide application prospects in the fields of education, medical treatment, agriculture, traffic and the like. However, the knowledge in the artificial intelligence field needs to have a certain professional basis, and practitioners in various industries lack a way to conveniently and accurately acquire the artificial intelligence knowledge, so that the artificial intelligence technology is difficult to popularize in many fields, and the development of social productivity is virtually hindered. Unstructured text in the artificial intelligence field bears a large amount of knowledge in the field, and if a knowledge question-answering system based on text understanding in the field can be completed, a high-efficiency and convenient knowledge acquisition way can be provided for people, and further development of the artificial intelligence technology is promoted.
The prior knowledge question-answering system has the following problems: firstly, the information extraction model lacks support of entity names and entity names, the former enables related technical terms to be divided into words by mistake, so that the performance of a search engine is affected, and the latter lacks understanding of a synonym problem, so that follow-up search results are on one side. Both of which can adversely affect the overall performance of the question-answering system. Secondly, machine reading understanding is used as a complex natural language processing task, the problems of high complexity, large calculation amount and the like exist, the construction of the knowledge base depends on unstructured texts, if the knowledge base is constructed manually, time and labor are consumed, a knowledge base with enough scale is difficult to form, and the actual deployment of a question-answering system is restricted by the knowledge base and the knowledge base. Finally, existing question-answering systems still lack the ability to efficiently obtain accurate and comprehensive answers from different types of text in cross-paragraphs, cross-documents, and cross-forms, and further lack the ability to guide users to further explore relevant knowledge in the field.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides the high-efficiency intelligent question-answering system for knowledge in the artificial intelligence field, and the preparation module and the question-answering module are used for greatly enhancing the word segmentation accuracy of user problems, knowledge base problems and text base problems, thereby greatly improving the overall accuracy of the whole question-answering system, greatly improving the user experience and realizing the low-cost high-efficiency high-user experience knowledge question-answering service.
The invention is realized by adopting the following technical scheme: an artificial intelligence domain knowledge oriented efficient intelligent question-answering system, comprising: a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a knowledge structure construction module of a question-answering system; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module marks unstructured knowledge text paragraphs in the collected artificial intelligent field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and non-synonymous questions in the artificial intelligent field to train a short text matching model, utilizes a knowledge structure construction module of a question-answering system to extract knowledge triples from the trained information extraction model and form question-answer pairs, simultaneously utilizes the extracted entity names and unique names to carry out auxiliary search, and provides semantics for a search engine and constructs a knowledge base keyword index through a construction method of improving a knowledge base and a text base reverse index;
the question-answering module is used for preprocessing the questions input by the user through the input preprocessing module, searching answers by utilizing the question-answering module based on the knowledge base, if the answers exist, preparing the answers to return, otherwise, sending the preprocessed questions input by the user into the question-answering module based on the text base to search and prepare the answers to return, recommending the questions to the user by utilizing the question recommending module based on the knowledge base, and finally returning the answers and the recommended questions to the user.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention supplements the dictionary of jieba word segmentation through the entity name extracted by the information extraction model and the other names thereof, so that the word segmentation accuracy of user problems, knowledge base problems and text base problems is greatly enhanced, the overall accuracy of the full question-answering system is greatly improved, and the user experience is greatly improved.
2. The invention uses the improved BM25 knowledge base coarse recall module to make single search almost not increase the reasoning time and sort the content of all synonymous different keywords at the same time, and make the influence of the document paragraph on the variation of the subject word frequency and the document length difference more robust, and make the search effect improved.
3. According to the invention, the question-answer knowledge base is constructed based on unstructured text paragraphs by utilizing an information extraction technology, and other available semistructured and structured related text paragraphs are preferably used as supplements, so that knowledge acquisition channels are more diversified and flexible, matching can be completed if the questions expressed by natural language of a user are consistent with knowledge semantics existing in the knowledge base, and meanwhile, the richness of answers is enhanced.
4. The question-answering system can recommend related questions for the user, guide the user to further explore a knowledge system and inspire the user to ask questions, and has high social value and strong practical significance.
5. The invention has small demand and consumption of computing resources.
Drawings
FIG. 1 is a system block diagram of the present invention;
FIG. 2 is a diagram of a data collection module in a preparation module of the present invention;
FIG. 3 is a diagram of a model training module in the preparation module of the present invention;
figure 4 is a flow chart of HBT model training in the preparation module of the present invention;
FIG. 5 is a flow chart of ESIM model training in the preparation module of the present invention;
FIG. 6 is a Roberta-QA model diagram of the present invention;
FIG. 7 is a block diagram of the knowledge architecture of the question-answering system in the preparation module of the present invention;
FIG. 8 is a block diagram of a knowledge base reverse index building block in a knowledge structure building module of the question-answering system in the preparation module of the invention;
FIG. 9 is a block diagram of knowledge base keyword index construction in knowledge structure construction module of question and answer system in preparation module of the present invention;
FIG. 10 is a block diagram of a text library reverse index building block in the knowledge structure building module of the question-answering system in the preparation module of the present invention;
FIG. 11 is an overall flow chart of the question-answering module of the present invention;
FIG. 12 is a flow chart of an index digestion method in a preprocessing module in the question-answering module of the present invention;
FIG. 13 is a flow chart of a coarse recall module in the knowledge base based question and answer module of the present invention;
FIG. 14 is a flowchart of a method for determining synonym of a question using ESIM in a knowledge base based question-answering module in the question-answering module of the present invention;
FIG. 15 is a flow chart of a coarse recall module in the text library based question and answer module of the present invention;
FIG. 16 is a diagram of a knowledge base based question recommendation module in the question answering module of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
As shown in fig. 1, the efficient intelligent question-answering system for knowledge in the artificial intelligence field of the embodiment includes a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a knowledge structure construction module of a question-answering system; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module marks unstructured knowledge text paragraphs in the collected artificial intelligence field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and non-synonymous questions in the artificial intelligence field to train a short text matching model, utilizes a knowledge structure construction module of a question-answering system to extract knowledge triples from the trained information extraction model and form question-answer pairs, simultaneously utilizes extracted entity names and other names to assist in realizing more efficient searching, and provides semantics for a search engine through a construction method of improving a knowledge base and a reverse index of the text base, and constructs a keyword index of the knowledge base to assist in realizing question recommendation;
the question-answering module is used for preprocessing the questions input by the user through the input preprocessing module, searching answers by utilizing the question-answering module based on the knowledge base, if the answers exist, preparing the answers to return, otherwise, sending the preprocessed questions input by the user into the question-answering module based on the text base to search and prepare the answers to return, recommending the questions to the user by utilizing the question recommending module based on the knowledge base, and finally returning the answers and the recommended questions to the user.
As shown in fig. 2, in this embodiment, the implementation procedure of the data collection module is as follows:
s21, collecting unstructured knowledge text paragraphs from scientific publications, documents, network science popularization knowledge and the like related to the artificial intelligence field, splitting text paragraphs with overlong lengths according to periods, and specifically limiting the length of each text paragraph to be not more than 480 characters;
s22, defining the types of the key information triples extracted from the information extraction model, and defining the following triples by using a common relation definition method: entity-description-content, entity-presenter-content, entity-inclusion-content, entity-application-content, entity-alias-content, and marking the types of the triples which are already defined, wherein the triples are collectively called as the subjects-predicte-objects, all the subjects are marked by hooking the subjects in the collected unstructured knowledge text paragraphs in the artificial intelligence field by using a coat text marking tool, all the objects are marked by hooking the subjects, and finally all the corresponding relations between the subjects and the objects in the text paragraphs are marked by connecting lines;
s23, collecting unstructured knowledge text paragraphs with sources of scientific publications, documents, network science popularization knowledge and the like related to the artificial intelligence field by utilizing a machine reading understanding model, marking a starting point and an ending point of answer related content corresponding to the questions in the text paragraphs if the corresponding question questions exist, otherwise, directly simulating the problems related to the scientific knowledge of the artificial intelligence field according to part of content simulation reality user scene definition in the text paragraphs, and marking starting and ending positions of the answers corresponding to the questions in the text paragraphs.
S24, directly collecting synonymous problems and non-synonymous problems within 30 characters in the field by using a short text matching model, wherein the synonymous problems are two by one, and the corresponding labels are 1; the two pairs of the non-synonymous problems correspond to a label of 0.
As shown in fig. 3, in this embodiment, the implementation process of the model training module is as follows:
s31, constructing an information extraction model by utilizing an HBT model, initializing model parameters by utilizing a RoBERTa pre-training model, and training by utilizing marked triplet data. Specifically, the HBT model training process is shown in fig. 4. Firstly, a special symbol [ CLS ] is added at the beginning of an input text paragraph, a special symbol [ SEP ] is added at the end of the text paragraph, and [ PAD ] is used for supplementing 512 characters, after the vocabulary of the RoBERTa pre-training model is changed into corresponding numbers, the corresponding numbers are mapped to vectors of 756-dimensional feature dimensions through an Embedding layer of the RoBERTa pre-training model, and the vectors are obtained after all layers of transformers of the RoBERTa pre-training model are passed. On one hand, the output feature vector is activated and changed into 2 channels through a full connection layer and Sigmoid to obtain a starting point and end point prediction probability result of a subject; and on the other hand, taking out the output feature vector according to the starting point and the ending point of the subject in the randomly selected pair of sections, copying, adding the output results of the feature fusion device of the full-connection layer, inputting the feature vector into the feature vector of which the full-connection layer is changed into 10 channels, wherein the 10 channels respectively correspond to the starting point and the ending point of 5 types of predicte, and finally, activating the feature vector through Sigmoid to obtain the starting point and the ending point prediction probability of the predicte-subject. Calculating errors by using the two types of cross entropy loss and labeling data respectively according to the starting point and ending point prediction probabilities of the subjects and the predictors, and then adjusting model parameters by back propagation of the errors, and sequentially carrying out loop iteration.
S32, constructing an ESIM model training short text matching model, as shown in FIG. 5, firstly training Chinese character vectors by using word2vec mode by using the whole corpus of Chinese wikipedia, then pre-training the ESIM model by using the Chinese translation result of Quora Question Pairs dataset, performing ESIM model parameter fine tuning training by using a Chinese open large-scale short text matching dataset LCMC and a collected artificial intelligent field short text matching dataset on the basis of the model obtained by pre-training, resampling the artificial intelligent field short text matching dataset to a data scale similar to the LCMC dataset in the process, and calculating cross entropy loss between an ESIM model output result and a label result of 0 or 1 by using two-class cross entropy loss in the training process.
S33, building a Roberta-QA model training machine reading understanding model, firstly building a Roberta-QA model, wherein the Roberta-QA model is shown in fig. 6, the input of the model is the splicing of text paragraphs and questions, characters are encoded according to the encoding mode of Roberta word list after splicing, special characters [ CLS ] are added at the initial position before encoding, special characters [ SEP ] are added between the text paragraphs and the questions, SEP ] are also added at the end of the questions, and [ PAD ] is used for supplementing 512 characters. The RoBERTa-QA model is characterized in that an input channel number is 756, a full-connection layer with an output channel number of 2 and a Softmax layer connected with the full-connection layer are added behind the last layer of transducer on the basis of RoBERTa to predict the probability of the head and tail positions of answers in a text segment, then a large-scale open data set DuReader is utilized to conduct further pre-training on the basis of a Chinese pre-training model RoBERTa, collected machine reading understanding marking data in the artificial intelligence field is utilized to conduct parameter fine-tuning training, and two-classification cross entropy loss is utilized to calculate errors during training.
As shown in fig. 7, in this embodiment, the implementation process of the knowledge structure building module of the question-answering system is as follows:
s41, collecting a large number of unstructured knowledge texts in the artificial intelligence field; the method specifically comprises the steps of collecting unstructured text paragraphs from scientific publications, documents, network science popularization knowledge and the like related to the artificial intelligence field, and splitting the paragraphs with overlong lengths according to periods; specifically, each paragraph is limited to no more than 480 characters in length and each sentence in the paragraph is complete.
S42, performing triplet extraction in unstructured knowledge text paragraphs in a large number of artificial intelligence fields by utilizing the information extraction model trained in the step S31, adding special symbols [ CLS ] at the beginning of the text paragraph, adding special symbols [ SEP ] at the end of the text, supplementing 512 characters with [ PAD ], mapping the text sheets passing through RoBERTa into vectors with 756-dimensional feature dimensions by using the RoBERTa 'S assembled layers after the text sheets become corresponding numbers, obtaining output feature vectors after the vectors pass through all the RoBERTa' S layers of transformers, obtaining the predicted probability results of the start point and the end point of the subject by using the full-connection layers and the Sigmoid activation change into 2 channels while obtaining the predicted probability results of the start point and the end point, taking the start point and the end point with probability greater than 0.5 as the start point and the end point, and combining the start point and the end point nearest to the end point to form the position of the subject. The starting point and the ending point of each pair of subjects are taken out, the corresponding output feature vectors are copied, the output results of the full-connection layer feature fusion device are added and then are input into the full-connection layer object predictor to be changed into feature vectors of 10 channels, the 10 channels respectively correspond to the starting point and the ending point of 5 types of predictes, finally, the feature vectors are activated through Sigmoid to obtain the starting point and ending point prediction probability of the predictes-object, the starting point and the ending point probability are taken as the starting point and the ending point, the starting point and the ending point probability are greater than 0.5, and the starting point and the ending point which are closest to the starting point are combined and paired to form the object position. And finding the subject content and the corresponding pre-object content according to the starting point and the ending point positions in the text paragraph, and combining the subject content and the corresponding pre-object content into a plurality of triples according to the form of the subject-pre-object.
S43, independently proposing entity names and entity name contents in the triples to supplement a dictionary of a jieba word segmentation tool to help greatly improve the word segmentation accuracy; on the other hand, the extracted triplet results form key value pairs of the answers to the questions according to a certain rule, such as: entity-description-what the content forms { entity }; entity-application-content formation { entity } there are applications: { content }; entity-containing-content formation { entity } includes which: { content }; entity-Producer-Contents form { entity } who is proposing } { Contents }; entity-alias-content forms what the alias of { entity }: { content }, then crawls the triplet knowledge in hundred degrees encyclopedia and wikipedia and supplements the knowledge base with questions and answers according to the format of { attribute } such as { term name } { content }. If the existing high-quality question and answer pairs in the field can be found, the question and answer pairs are also put into a knowledge base for supplementation. Question-answer pairs in the knowledge base are stored in the form of key-value pairs in which questions are used as keys and answers are used as values.
S44, establishing an inverted index and a keyword index for all the questions in the knowledge base, as shown in fig. 8 and 9 respectively. When the reverse index is established, the jieba word segmentation tool is used for respectively segmenting words and removing stop words from the question text paragraphs in each question-answer pair in the knowledge base to obtain a group of words, and then the collection of the words obtained after all the questions in the knowledge base are processed in this way is counted, namely all the words in the knowledge base. Firstly, establishing an index which takes all words in a knowledge base as keys and takes the frequency of each word appearing in each section as a value; the index is then adjusted according to the paraphraseology, and the specific method is as follows: if the word A and the word B are synonymous, the corresponding value of the key of the word A in the adjusted new index is as follows: finding out each section where the word A or the word B appears in the index before adjustment, and if only one of the word A and the word B appears in the section, taking the value as the word frequency of the word appearing in the section; if the word A and the word B appear in the section at the same time, the value is the sum of the word frequencies of the word A and the word B in the section, namely the word frequency fi of a certain word ci in a certain section in a text library is recorded as follows:
fi=freq(ci)+∑freq(pi)
wherein freq is word frequency of counting the word in a certain section in the text base, pi is synonym of ci;
when the knowledge base keyword index is constructed, traversing each problem in the knowledge base, judging whether the words in the knowledge base keyword index intersect with the entity name and the set of unique names extracted by the information extraction model, and if so, adding the problem to the value corresponding to the entity name or the unique name in the knowledge base keyword index.
S45, directly storing a large number of unstructured knowledge texts in the artificial intelligence field as a text library and establishing an inverted index for the text library, as shown in FIG. 10. Dividing words and removing stop words from each text paragraph in the text library by using a jieba word dividing tool to obtain a group of words; and counting the set of words obtained after all text paragraphs in the text library are processed in this way, namely all words in the text library. Firstly, establishing an index which takes all words in a text library as keys and takes the frequency of each word appearing in each section as a value; the index is then adjusted according to the paraphraseology, and the specific method is as follows: if the word A and the word B are synonymous, the corresponding value of the key of the word A in the adjusted new index is as follows: finding out each section where the word A or the word B appears in the index before adjustment, and if only one of the word A and the word B appears in the section, taking the value as the word frequency of the word appearing in the section; if the word A and the word B appear in the section at the same time, the value is the sum of the word frequencies of the word A and the word B in the section, namely the word frequency fi of a certain word ci in a certain section in a text library is recorded as follows:
fi=freq(ci)+∑freq(pi)
where freq is the word frequency of counting this word in a certain segment in the text base and pi is the synonym of ci.
As shown in fig. 11, in this embodiment, the specific processing procedure of the question-answering module is as follows:
s51, removing stop words in the user input problem by utilizing an input preprocessing module, performing word segmentation and part of speech tagging by utilizing new words and jieba word segmentation tools obtained by an information extraction model, and performing reference resolution based on the results of the word segmentation and part of speech tagging and a Ha Gong LTP language model. In the reference resolution, as shown in fig. 12, the word segmentation result, the part of speech and the dependency arc relationship are obtained by using part of speech tagging in jieba word segmentation and the syntactic analysis of the LTP model. Firstly, judging whether a subject exists, if not, locating the part of speech in a part of speech array, and if the dependency relationship is not a moving object relationship (VOB) or a centering relationship (ATT), replacing the part of speech with the subject recorded last time; if the pronouns do not exist, positioning the pronouns and judging whether the dependency arc relationship is a parallel relationship (COO), if so, replacing the pronouns with the input subjects, and if the subjects and the pronouns do not exist, recording the input subjects for later reference digestion.
S52, adopting a question-answering module based on a knowledge base, after receiving the result obtained by the preprocessing module, firstly, quickly searching whether the knowledge base has the same problem as the problem, and if so, directly obtaining a corresponding answer, wherein the answer is used after a user approves the problem recommended by the system, and the response time of the system and the occupation of computing resources can be greatly saved. When no complete match is found, the pre-processed user question is also subjected to rough search in all question texts of a knowledge base by an improved BM25 knowledge base rough recall module, then the pre-processed user question and a rough search result of the question are subjected to fine semantic matching by using an ESIM model, a result with the closest semantic is found, and if the matching score of the result exceeds a set threshold, an answer corresponding to the result is used as an answer to be returned to a user. The ESIM model determines whether two questions are synonymous as shown in fig. 14, inputs a batch of question pairs, and outputs the synonymous probability of each pair of questions in the batch of question pairs, i.e. a matching score. If the BM25 knowledge base coarse recall module and the ESIM model do not find any matching result with the matching score exceeding the threshold value in the step, the preprocessed user questions are input into the question-answer module based on the text base.
Specifically, as shown in fig. 13, the improved BM25 knowledge base coarse recall module first determines whether each word in the preprocessed user input problem exists in the set of all words in the knowledge base, if so, directly reserves the word, if not, and if so, then determines whether any of the paraphrasing words of the word exists in the set of all words in the knowledge base, if so, replaces the word with the paraphrasing word and reserves the word, and if not, or if the word does not have the paraphrasing word, then removes the word, and obtains the further processed user problem. And finally, calculating the relevance score of each word and each problem in the knowledge base which are reserved through the processing, and summing the relevance scores of the words to obtain the relevance score of each problem in the query and the knowledge base.
Further processed user input question Q and j-th question d in knowledge base j The relevance score of (2) is:
wherein q i Is the i-th vocabulary in the further processed user input problem; w (W) ij Is the weight of the jth question in the knowledge base corresponding to the ith word in the user input question; IDF (q) i ) Is the inverse text frequency index of the i-th vocabulary in the further processed user input problem; n is the vocabulary number of the user input question; i refers to the i-th vocabulary of the user input question;
in particular, the method comprises the steps of,
wherein; n is the total number of question-answer pairs in the knowledge base; n (q) i ) The number of question-answer pairs in the knowledge base containing the i-th vocabulary in the further processed user input questions;
in particular, the method comprises the steps of,
wherein k is 1 Is the term frequency saturation, here taken as 1.5; b is a segment length constraint, here taken as 0.75; dl-dl j Is the string length of the j-th question in the knowledge base; average string length for all questions in the avgdl knowledge base.
S53, if the question-answering module based on the knowledge base does not find an answer to the user input question, the question-answering module based on the text base is adopted, and the further processed user question and the knowledge text paragraph in the text base are searched and recalled roughly by the improved BM25 knowledge base coarse recall module; and inputting the rough recall result and the preprocessed user questions into a RoBERTa-QA model, wherein the model predicts the starting point and the end point positions and the corresponding probabilities of answers to the user questions in the result of each section of rough recall in parallel. And if the probability of any segment with answer exceeds a set threshold value, taking the text segment between the positions with the maximum probabilities of the start point and the end point in the text segment as the answer returned to the user.
As shown in FIG. 15, in this embodiment, the improved BM25 text library coarse recall module calculates the relevance scores of each word of the further processed user input question and all text paragraphs in the text library, and sums the scores of the relevance of each word to obtain the further processed user question and all text paragraphs in the text libraryThe relevance score of the paragraph, namely the user input question Q after further processing and the jth paragraph d in the text library j The relevance score of (2) is:
wherein q i Is the i-th vocabulary in the further processed user input problem; w (W) ij Is the weight of the jth question in the knowledge base corresponding to the ith word in the user input question; IDF (q) i ) Is the inverse text frequency index of the i-th vocabulary in the further processed user input problem; n is the vocabulary number of the user input question; i is the i-th vocabulary of the user input question;
in particular, the method comprises the steps of,
wherein; n is the total number of question-answer pairs in the knowledge base; n (q) i ) The number of question-answer pairs in the knowledge base containing the i-th vocabulary in the further processed user input questions;
in particular, the method comprises the steps of,
W ij in (k) 1 Is the term frequency saturation, here taken as 1.5; b is a segment length constraint, here taken as 0.75; dl-dl j Is the string length of the j-th question in the knowledge base; average string length for all questions in the avgdl knowledge base.
S54, adopting a question-answer recommendation module based on a knowledge base, as shown in FIG. 16, firstly searching for a question which is coincident with an entity name and a name in the knowledge base and the preprocessed user input questions by using the knowledge base keyword index established in the step S44, if the question can be found, randomly selecting one question from the knowledge base to recommend the question to the user, otherwise, randomly selecting one question from all knowledge bases, checking whether the question is identical with the preprocessed user question or the answer is identical with the answer to be returned to the user after the selection is completed, if the question is identical with the answer to be returned to the user, repeating the previous steps until the question meeting the inconsistent condition is obtained, and sequentially returning the obtained answer about the user question and the question recommended to the user.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (6)

1. The efficient intelligent question-answering system for the knowledge in the artificial intelligence field is characterized by comprising a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a knowledge structure construction module of a question-answering system; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base;
the preparation module marks unstructured knowledge text paragraphs in the collected artificial intelligent field through the data collection module, trains an information extraction module and a machine reading understanding module of the model training module, collects or defines synonymous and non-synonymous questions in the artificial intelligent field to train a short text matching model, utilizes a knowledge structure construction module of a question-answering system to extract knowledge triples from the trained information extraction model and form question-answer pairs, simultaneously utilizes the extracted entity names and unique names to carry out auxiliary search, and provides semantics for a search engine and constructs a knowledge base keyword index through a construction method of improving a knowledge base and a text base reverse index;
the question-answering module is used for preprocessing the questions input by the user through the input preprocessing module, searching answers by utilizing the question-answering module based on the knowledge base, if the answers exist, preparing the answers to return, otherwise, sending the preprocessed questions input by the user into the question-answering module based on the text base to search and prepare the answers to return, recommending the questions to the user by utilizing the question recommending module based on the knowledge base, and finally returning the answers and the recommended questions to the user;
the model training module is realized as follows:
s31, constructing an information extraction model by utilizing an HBT model, initializing model parameters by utilizing a RoBERTa pre-training model, and training by utilizing marked triplet data;
s32, building an ESIM model training short text matching model, training Chinese character vectors by using word2vec mode by using Chinese Uighur full corpus, pre-training the ESIM model by using a Chinese translation result of Quora Question Pairs dataset, and performing ESIM model parameter fine tuning training by using a Chinese open large-scale short text matching dataset LCMC and a collected short text matching dataset in the artificial intelligence field on the basis of the model obtained by pre-training;
s33, building a RoBERTa-QA model to train a machine reading understanding model, further pre-training on the basis of a Chinese pre-training model RoBERTa by utilizing an open data set DuReader, and performing parameter fine-tuning training by utilizing collected machine reading understanding marking data in the artificial intelligence field;
the implementation process of the knowledge structure building module of the question-answering system is as follows:
s41, collecting unstructured knowledge text paragraphs in the artificial intelligence field;
s42, performing triplet extraction in unstructured knowledge text paragraphs in the artificial intelligence field by using the information extraction model trained in the step S31;
s43, independently providing entity names and entity name contents in the triples, forming a question answer key value pair by using the extracted triples, crawling triad knowledge in hundred degrees encyclopedia and wikipedia, forming question answer pairs, and putting the question answer pairs in a knowledge base, wherein the question answer pairs in the knowledge base are stored in a key value pair form by taking questions as keys and taking answers as values;
s44, establishing an inverted index and a keyword index for all the questions in the knowledge base, respectively segmenting the questions text paragraphs in each question-answer pair in the knowledge base and removing stop words by using a jieba segmentation tool when establishing the inverted index to obtain a group of words, and counting a set of words obtained after all the questions in the knowledge base are processed in this way, namely all the words in the knowledge base; constructing a knowledge base keyword index, traversing each problem in the knowledge base, and adding the problem to the entity name or the value corresponding to the key in the knowledge base keyword index if the intersection exists between the words in the knowledge base keyword index and the entity name and the set of the names extracted by the information extraction model;
s45, directly storing unstructured knowledge text paragraphs in the artificial intelligence field as a text library, establishing an inverted index for the text library, and respectively segmenting and removing stop words from each text paragraph in the text library by using a jieba segmentation tool to obtain a group of words; and counting the set of words obtained after all text paragraphs in the text library are processed in this way, namely all words in the text library.
2. The high-efficiency intelligent question-answering system oriented to knowledge in artificial intelligence field according to claim 1, wherein the data collection module is realized as follows:
s21, collecting unstructured knowledge text paragraphs of scientific publications, documents and network science popularization knowledge in the artificial intelligence field, and splitting the text paragraphs according to periods;
s22, defining the types of the key information triples extracted from the information extraction model, and defining the triples by using a common relation definition method as follows: entity-description-content, entity-presenter-content, entity-inclusion-content, entity-application-content, entity-name-content, and labeling the type of the triplet which is already defined, labeling all the subjects by choosing the subjects in the text paragraphs without structured knowledge in the collected artificial intelligence field by using a branch text labeling tool, labeling all the subjects by choosing the subjects, and labeling all the correspondence between the subjects and the subjects in the text paragraphs by connecting wires;
s23, collecting unstructured knowledge text paragraphs of scientific publications, documents and network science popularization knowledge in the artificial intelligence field by utilizing a machine reading understanding model, marking starting point and end point positions of answer related contents corresponding to the questions in the text paragraphs if corresponding questioning questions exist, otherwise, directly simulating problems of real user scene definition diversified scientific knowledge in the artificial intelligence field according to part of contents in the text paragraphs, and marking starting and ending positions of answers corresponding to the questions in the text paragraphs;
s24, directly collecting the synonymous problem and the non-synonymous problem by using a short text matching model, wherein the synonymous problem is two by two, and the corresponding label is 1; the two pairs of the non-synonymous problems correspond to a label of 0.
3. The high-efficiency intelligent question-answering system oriented to knowledge in the field of artificial intelligence according to claim 1, wherein the implementation process of the input preprocessing module is as follows: and removing stop words in the user input problem by utilizing an input preprocessing module, performing word segmentation and part-of-speech tagging by utilizing new words and jieba word segmentation tools obtained by utilizing an information extraction model, and performing reference resolution based on the results of the word segmentation and part-of-speech tagging and a Hadamard LTP language model.
4. The efficient and intelligent question-answering system for artificial intelligence domain knowledge according to claim 1, wherein the question-answering module based on the knowledge base is implemented as follows: after receiving the result obtained by the input preprocessing module, searching the questions which are completely the same as the questions in the knowledge base, and if the questions are found, directly obtaining corresponding answers; if no complete match is found, the BM25 knowledge base rough recall module which is also improved in the preprocessed user questions carries out rough retrieval in all question texts of the knowledge base, the ESIM model is utilized to carry out fine semantic matching on the preprocessed user questions and rough retrieval results of the questions, a result is found, if the matching score of the result exceeds a set threshold value, an answer corresponding to the result is used as an answer to be returned to a user, and if the BM25 knowledge base rough recall module and the ESIM model do not find any matching result with the matching score exceeding the threshold value in the step, the preprocessed user questions are input into the question-answer module based on the text base.
5. The efficient and intelligent question-answering system for artificial intelligence domain knowledge according to claim 1, wherein the text library-based question-answering module is implemented as follows: firstly, searching and coarsely recalling the further processed user problems and knowledge text paragraphs in a text library through an improved BM25 knowledge base coarse recall module; and inputting the rough recall result and the preprocessed user questions into a RoBERTa-QA model, wherein the model predicts the starting point and the end point positions of answers to the user questions and corresponding probabilities in the result of each section of rough recall in parallel, the product of probability values corresponding to the positions with the maximum starting point and the maximum end point probability is the probability of answers to the text section, and if any section of probability of answers exceeds a set threshold, the text section between the positions with the maximum starting point and the maximum end point probability in the text section is used as the answer to be returned to the user.
6. The efficient and intelligent question-answering system for artificial intelligence domain knowledge according to claim 1, wherein the question-answering recommendation module based on the knowledge base is implemented as follows: and (3) searching for a problem which is coincident with the entity name and the name in the knowledge base and the preprocessed user input problem by utilizing the knowledge base keyword index established in the step (S44), if the problem can be found, randomly selecting the problem from the knowledge base, recommending the problem to the user, otherwise, directly randomly selecting the problem from all the knowledge bases, checking whether the problem is identical with the preprocessed user problem or the answer is identical with the answer to be returned to the user after the selection is completed, if the problem is identical with the answer to be returned to the user, repeating the previous step until a problem meeting the inconsistent condition is obtained, and sequentially returning the obtained answer about the user problem and the problem recommended to the user for display to the user.
CN202110392744.2A 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field Active CN113157885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392744.2A CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392744.2A CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Publications (2)

Publication Number Publication Date
CN113157885A CN113157885A (en) 2021-07-23
CN113157885B true CN113157885B (en) 2023-07-18

Family

ID=76890104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392744.2A Active CN113157885B (en) 2021-04-13 2021-04-13 Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field

Country Status (1)

Country Link
CN (1) CN113157885B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757184B (en) * 2022-04-11 2023-11-10 中国航空综合技术研究所 Method and system for realizing knowledge question and answer in aviation field
CN115292469B (en) * 2022-09-28 2023-02-07 之江实验室 Question-answering method combining paragraph search and machine reading understanding
CN116910223B (en) * 2023-08-09 2024-06-11 北京安联通科技有限公司 Intelligent question-answering data processing system based on pre-training model
CN117408631B (en) * 2023-10-18 2024-08-02 江苏泰坦智慧科技有限公司 Operation ticket generation method, device and storage medium
CN117875433B (en) * 2024-03-12 2024-06-07 科沃斯家用机器人有限公司 Question answering method, device, equipment and readable storage medium
CN117909484B (en) * 2024-03-19 2024-05-28 华中科技大学 Method for constructing question-answer Term-BERT model for construction information query and question-answer system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324721A (en) * 2020-03-16 2020-06-23 云南电网有限责任公司信息中心 Method for constructing intelligent question-answering knowledge base
CN111414461A (en) * 2020-01-20 2020-07-14 福州大学 Intelligent question-answering method and system fusing knowledge base and user modeling

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076184B (en) * 2006-07-31 2011-09-21 腾讯科技(深圳)有限公司 Method and system for realizing automatic reply
CN104573028B (en) * 2015-01-14 2019-01-25 百度在线网络技术(北京)有限公司 Realize the method and system of intelligent answer
CN107220380A (en) * 2017-06-27 2017-09-29 北京百度网讯科技有限公司 Question and answer based on artificial intelligence recommend method, device and computer equipment
CN110727778A (en) * 2019-10-15 2020-01-24 大连中河科技有限公司 Intelligent question-answering system for tax affairs
CN110737763A (en) * 2019-10-18 2020-01-31 成都华律网络服务有限公司 Chinese intelligent question-answering system and method integrating knowledge map and deep learning
CN110990527A (en) * 2019-11-26 2020-04-10 泰康保险集团股份有限公司 Automatic question answering method and device, storage medium and electronic equipment
CN111611361B (en) * 2020-04-01 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent reading, understanding, question answering system of extraction type machine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414461A (en) * 2020-01-20 2020-07-14 福州大学 Intelligent question-answering method and system fusing knowledge base and user modeling
CN111324721A (en) * 2020-03-16 2020-06-23 云南电网有限责任公司信息中心 Method for constructing intelligent question-answering knowledge base

Also Published As

Publication number Publication date
CN113157885A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113157885B (en) Efficient intelligent question-answering system oriented to knowledge in artificial intelligence field
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
CN114020862B (en) Search type intelligent question-answering system and method for coal mine safety regulations
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN111475629A (en) Knowledge graph construction method and system for math tutoring question-answering system
CN110543557A (en) construction method of medical intelligent question-answering system based on attention mechanism
CN116127095A (en) Question-answering method combining sequence model and knowledge graph
CN112749265B (en) Intelligent question-answering system based on multiple information sources
CN112307171B (en) Institutional standard retrieval method and system based on power knowledge base and readable storage medium
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN111666376B (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN115470338B (en) Multi-scenario intelligent question answering method and system based on multi-path recall
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN112632250A (en) Question and answer method and system under multi-document scene
CN118093834B (en) AIGC large model-based language processing question-answering system and method
CN113392265A (en) Multimedia processing method, device and equipment
CN110851584A (en) Accurate recommendation system and method for legal provision
CN113408287A (en) Entity identification method and device, electronic equipment and storage medium
CN114997181A (en) Intelligent question-answering method and system based on user feedback correction
CN114239730B (en) Cross-modal retrieval method based on neighbor ordering relation
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN111666374A (en) Method for integrating additional knowledge information into deep language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant