CN117609436A

CN117609436A - College scientific research management question-answering system combining knowledge graph and large language model

Info

Publication number: CN117609436A
Application number: CN202311287960.6A
Authority: CN
Inventors: 王永; 符永骥; 王鹏程; 陈芊如; 潘宇欣; 赵越
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-02-27

Abstract

The patent relates to a college scientific research management question-answering system combining a knowledge graph and a large language model. The system comprises a data module, a question processing module and an answer generating module. In implementation, the data module collects data of schools, professions, papers, patents, projects and the like through web crawlers and stores the data in a structured manner. The problem processing module uses the intent recognition model and the entity recognition model to respectively classify and extract the user problems so as to provide key information. The answer generation module selects different strategies according to the question types, including template matching, graph database query, form generation and large language model generation, so as to generate accurate and diversified answers. The knowledge graph and the large language model are comprehensively utilized, the respective limitations are made up, and more comprehensive and accurate natural language question-answering service is provided.

Description

College scientific research management question-answering system combining knowledge graph and large language model

The invention belongs to the field of knowledge questions and answers, and particularly relates to a college scientific research management question and answer system combining a knowledge graph and a large language model.

Background

In the age of information explosion today, universities face a large amount of information and knowledge management challenges. Students, teachers and campus managers need to timely and accurately acquire information about school resources, courses, policies and the like. The traditional information retrieval and question-answering system has various limitations, including limitation of keyword matching, incapacity of answering complex questions, untimely information updating and the like.

The knowledge graph is a carefully constructed knowledge representation method and has the advantages of high accuracy and high specificity. Knowledge graph relies on expertise and multiple data sources to construct, which can provide high quality and accurate answers. This means that the system user can retrieve information directly from the knowledge-graph without relying on fuzzy matching of the text data. In addition, the knowledge graph has expandability, and can be updated and expanded continuously to contain new information and knowledge, so that the system can adapt to the continuously-changing problem field.

However, knowledge maps also have some limitations. First, its coverage is limited by the source of the data and expertise from which it is built, and may not be able to answer questions for certain areas or topics. Second, knowledge maps present difficulties in dealing with ambiguous, open or ambiguous questions because these questions lack an explicit answer. Finally, building a multilingual knowledge graph and a multi-language enabled question-answering system can be more complex because semantic and cultural differences between different languages need to be considered.

Meanwhile, a large language model (such as ChatGPT) has strong natural language understanding and generating capability and strong universality, and is suitable for the problems of various fields and topics. The model is trained based on large-scale text data, has extensive background knowledge, and can answer various types of questions.

However, large language models also have some limitations. First, it lacks real-time because the training cost of large language models is extremely high, failing to provide up-to-date real-time information or answers to current events. Second, it is typically based on surface-level text matching and pattern matching, lacking in deep semantic understanding. In addition, the answer generated is not necessarily always accurate, as depending on the information in the training data, erroneous or inaccurate information may be contained. Finally, large language models may be abused for generating false information, promotions, fraud and harmful content, requiring supervision and management to reduce risk. In highly specialized areas, large language models may perform poorly because their training data may not be sufficient to cover these areas.

Disclosure of Invention

Aiming at the problems, the invention combines the advantages and the disadvantages of the knowledge graph and the large language model, and provides a scientific research management knowledge question-answering system comprehensively utilizing the knowledge graph and the large language model so as to realize more efficient and accurate natural language question-answering service and overcome respective limitations. By combining the accuracy of the knowledge graph with the universality of the large language model, the system can process a wide range of problem fields and provide more comprehensive and accurate answers, thereby providing excellent question-answering experience for users.

In order to achieve the above purpose, the present invention provides the following technical scheme:

a college scientific research management question-answering system combining a knowledge graph and a large language model comprises three parts:

1. data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:

1) Template library (1 a): for saving answer templates generated by manual designs and large language models.

2) Graph database (1 b): data is stored in entity-relationship-entity and entity-relationship-attribute fashion.

3) Structured document database (1 c): the method is used for structurally storing document data such as papers, patents, projects and the like.

2. Problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:

1) Intent recognition model (2 a): the method is used for classifying the problems, improving generalization of the model and simplifying the subsequent processing flow.

2) Entity recognition model (2 b): for extracting key entities that will be used in subsequent query operations on the database.

3. Answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:

1) Dialog control module (3 a): responsible for reference resolution, entity normalization processing, and performing different query operations according to different classification information, etc.

2) Form module (3 b): and generating a proper form according to the result of the intention recognition model (2 a), guiding a user to fill in form information, and adopting a certain strategy.

3) Large Language Model (LLM) module (3 c): combining the query results and knowledge of the model itself, a diversity and high quality answer is generated.

Further, according to the college scientific research management question-answering system combining the knowledge graph and the large language model of claim 1, the whole process is carried out according to the following steps:

s1: the data module (1) uses web crawlers to crawl HTML web pages and PDF documents of schools, professions, papers, patents, and projects on various websites. These web pages and documents are then subjected to form extraction and document segmentation operations to obtain structured data for schools, professions, teachers, etc., as well as TXT formatted data for papers, patents, projects, etc.

S2: a template library (1 a) of answer templates is constructed manually. These templates will be used to generate answers covering different types of questions and answer structures to increase the diversity and accuracy of the answers.

S3: a graph database (1 b) of entity-relationship-entities, entity-relationship-attributes is constructed from the structured data.

S4: the data in TXT format is further data cleaned to generate and store a structured document database (1 c) in json format for information retrieval using multi-pass recall retrieval techniques such as BM25, vector retrieval, etc. This document database will contain information such as papers, patents, projects, etc. so that the system can retrieve and provide answers to the relevant information.

S5: construction of an intention recognition model (2 a). First, a training dataset is constructed using a small amount of artificial annotation data, template data generated by Chatito, and diversified data generated by LLM. Next, parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Husting Face). Then, a fully connected layer for text classification is added after the BERT model, whose output size is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution. And finally, training and optimizing the model by using the cross entropy loss function so as to improve the accuracy of intention recognition.

Assuming that there are C classes, the output of the model is a vector of size C, representing the score (logits) of each class, and there is a true class label y, ranging from 1 to C. The cross-over loss function is expressed as follows:

wherein y is _i The real label is a C-dimensional one-hot vector, wherein the ith element is 1, and the rest elements are 0; z is the output vector of the model, containing a score (logits) for each category; softmax (z) _i A representation softmax function is applied to the ith element of the model output for converting the score to a class probability. The mathematical expression is as follows:

s6: and (3) constructing an entity identification model (2 b). First, a training dataset is constructed using a small amount of artificial annotation data, template data generated by Chatito, and diversified data generated by LLM. Next, parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Husting Face). Then, a full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags. Finally, cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss. In addition, the present patent uses the viterbi algorithm to find the best tag sequence when predicting new text to achieve efficient entity recognition.

For the NER task, it is assumed that there are C possible labels, and for each label, the model outputs a C-dimensional probability distribution vector. The actual tag sequence is denoted y and the output of the model is denoted p. The mathematical expression of the cross-over-picking loss is:

wherein y is _i Is the i-th element of the real tag sequence, is a one-hot vector, wherein only one element is 1, and represents the real tag; p is p _i Is the ith element of the output probability distribution vector of the model and represents the prediction probability of the model to the ith label.

The loss function of the Conditional Random Field (CRF) is probability-based and is used to model and optimize the distribution probability of the tag sequence to maximize the probability of the real tag sequence.

Let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) ₁ ,x ₂ ,…,x _N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) ₁ ,y ₂ ,…,y _N ). These tags correspond to each tag in the observation sequence X. The object of the conditional random field is to calculate the conditional probability P (y|x) of the tag sequence Y given the observation sequence X. This can be expressed as:

wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) _i Is a model parameter for balancing the contributions of different feature functions; f (f) _i (y _i ,y _i-1 X) is a feature function that measures the label y _i And y _i-1 Relationship to the observation sequence X. The object of the model is to learn the appropriate parameter lambda _i To maximize the conditional probability P (y|x) of the true tag sequence. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:

the goal of this loss function is to minimize the negative log likelihood to optimize the model parameters so that the conditional probability of the real tag sequence is maximized.

The viterbi algorithm is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRF) that finds the tag sequence with the highest conditional probability for a given observation sequence, for which:

let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) ₁ ,x ₂ ,…,x _N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) ₁ ,y ₂ ,…,y _N ). These tags correspond to each tag in the observation sequence X.

In the viterbi algorithm, the goal is to find the tag sequence Y with the highest conditional probability P (y|x). This can be expressed as:

the core idea of the viterbi algorithm is to calculate the best tag y at each position i using dynamic programming _i And gradually constructing the optimal tag sequence. The algorithm comprises the following steps:

1) Initializing: a matrix (table) V is initialized, where V i j represents the maximum conditional probability of selecting tag j at location i. A backtracking matrix B is initialized, wherein B [ i ] [ j ] stores the previous best tag when tag j was selected at location i.

2) And (3) recursion step: each position i in the observation sequence X is traversed from left to right. For each position i, for each possible tag j, V [ i ] [ j ] and B [ i ] [ j ] are calculated:

where k represents the tag at position i-1.

3) And (3) terminating: in observing the sequenceFinding the last tag y with the highest conditional probability _N ：

4) And (3) backtracking: where k represents the tag at position i-1.

Eventually, the Viterbi algorithm returns the tag sequence Y with the highest conditional probability P (Y|X) ^* . This sequence is constructed by selecting the tag with the highest conditional probability at each location.

S7: the dialogue management module (3 a) firstly carries out reference resolution on the problems of the user so as to solve the reference words in the problems, and queries the results of the reference resolution as entities. And then, carrying out normalization processing on the entity extracted by the entity identification model (2 b) to ensure that an accurate query result is obtained. These steps help to improve understanding of user problems and accuracy of information retrieval.

S8: the dialogue management module (3 a) selects different strategies to generate answers according to the classification information of the question processing module (2 a), and the specific steps are as follows:

1) For single entity query class questions, the system will first try template matching, then execute the Cypher query in the graph database, and populate the graph database query results into the matched templates, thereby generating answers.

2) For multi-entity query class problems, the system uses the form module (3 b) to generate a proper form according to the result of the intention recognition model (2 a), and then fills entity information provided by the entity recognition model (2 b) into the form. If the form information is incomplete, the system will initiate a form question and answer until the form information is complete. Subsequently, the dialogue management module (3 a) will perform the next operation, i.e. execute the cytoer query in the graph database, and input the query result as a Prompt (Prompt) to the LLM module (3 c) after splicing with the user question, so as to generate an answer.

3) For open class problems, the system will perform a cytor query in the graph database and a multiple recall search (e.g., BM25, vector search) on the structured document database (1 c), respectively. The system will then splice the two query results as prompts (Prompt) with the user question and input to the LLM module (3 c) to generate an answer.

Compared with the prior art, the invention has the beneficial effects that:

1) And (3) fully utilizing a model: the system fully utilizes the large language model, effectively integrates all links of questions and answers, and plays the high understanding, analyzing and reasoning capacity of the model. This enables the user to obtain deeper and accurate answers, improving the quality and effectiveness of the question-answering service.

2) High degree of controllability: by employing normalized question answer templates, the system ensures controllability of the answer generation. This helps prevent the model from building false information or providing inaccurate answers, providing a more reliable question-answering service for the user. Controllability also helps the system provide accurate answers in the professional field.

3) Broad problem coverage: the system comprehensively utilizes the characteristics of the knowledge graph and the large language model, and can cover a wide range of problem fields including professional fields and general knowledge. The method provides comprehensive answers for users, is not limited by the field, and increases the applicability of the question-answering system.

4) High accuracy answer: by combining the high accuracy of the knowledge graph and the deep background knowledge of the large language model, the system can provide highly accurate answers and meet the requirement of users on information accuracy. This helps the user to better understand and solve the problem.

5) Real-time performance: although the large language model has real-time challenges, the system realizes the tracking and updating of the latest information through the knowledge graph. The system can meet the real-time requirements of users, provides a wider service range, and has important value especially in the case of needing timely information.

Drawings

FIG. 1 is a flow chart of a system for implementing a college scientific research management question-answering system according to the present invention

FIG. 2 is an intent recognition model structure

FIG. 3 entity recognition model structure

Detailed Description

The following describes the embodiments and working principles of the present invention in further detail with reference to the drawings.

As can be seen from fig. 1, the invention comprises a data module (1), a question processing module (2) and an answer generating module (3). The data module (1) is provided with a template library (1 a), a graph database (1 b) and a structured document database (1 c); the problem processing module (2) is provided with an intention recognition model (2 a) and an entity recognition model (2 b); the answer generation module (3) is provided with a dialogue control module (3 a), a form module (3 b) and a Large Language Model (LLM) module (3 c).

In this embodiment, the data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:

The problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:

The answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:

Furthermore, the university scientific research management question-answering system combining the knowledge graph and the large language model has the following overall process:

S5: the construction of the intention recognition model (2 a) is shown in fig. 3. The method comprises the following specific steps:

1) A training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.

2) Parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).

3) A fully connected layer for text classification is added after the BERT model, the output size of which is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution.

4) The cross entropy loss function is used for training and optimizing the model so as to improve accuracy of intention recognition.

5) And using the trained model to carry out intention recognition on the new text.

S6: the entity recognition model (2 b) is constructed, and the structure of the entity recognition model is shown in fig. 3. The method comprises the following specific steps:

3) A full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags.

4) Cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss.

wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) _i Is a model parameter for balancing the contributions of different feature functions; f (f) _i (y _i ,y _i-1 X) is a feature function that measures the label y _i And y _i-1 Relationship to the observation sequence X. The goal of the model is to maximize the conditional probability P (y|x) of the true tag sequence by learning the appropriate parameter λi. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:

5) And performing entity recognition on the new text by using the trained model. The present patent uses the viterbi algorithm, which is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRFs), which finds the tag sequence with the highest conditional probability for a given observation sequence, or other decoding method to find the best tag sequence, for the viterbi algorithm:

where k represents the tag at position i-1.

3) And (3) terminating: at the end N of the observation sequence, the last tag y with the highest conditional probability is found _N ：

4) And (3) backtracking: where k represents the tag at position i-1.

It should be noted that the above description of the embodiments does not limit the present invention, and the present invention is not limited to the above examples. Those skilled in the art will appreciate that various changes, adaptations, additions, modifications, or substitutions may be made within the scope of the claims, and all such changes are intended to be included within the scope of the claims.

Claims

1. A college scientific research management question-answering system combining a knowledge graph and a large language model comprises three parts:

1) Data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:

a) Template library (1 a): for saving answer templates generated by manual designs and large language models.

b) Graph database (1 b): data is stored in entity-relationship-entity and entity-relationship-attribute fashion.

c) Structured document database (1 c): the method is used for structurally storing document data such as papers, patents, projects and the like.

2) Problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:

a) Intent recognition model (2 a): the method is used for classifying the problems, improving generalization of the model and simplifying the subsequent processing flow.

b) Entity recognition model (2 b): for extracting key entities that will be used in subsequent query operations on the database.

3) Answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:

a) Dialog control module (3 a): responsible for reference resolution, entity normalization processing, and performing different query operations according to different classification information, etc.

b) Form module (3 b): and generating a proper form according to the result of the intention recognition model (2 a), guiding a user to fill in form information, and adopting a certain strategy.

c) Large Language Model (LLM) module (3 c): combining the query results and knowledge of the model itself, a diversity and high quality answer is generated.

2. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the data module (1) comprises the following steps:

3. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the question processing module (2) comprises the following steps:

s1: an intention recognition model (2 a) and an entity recognition model (2 b) are constructed.

S2: in the prediction, the problems raised by the user are respectively input into an intention recognition model (2 a) and an entity recognition model (2 b), and then the prediction results of the two models are obtained. Finally, the prediction results of the two models are integrated into classification information of the problem.

4. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, the construction of the intention recognition model (2 a) comprising the steps of:

s1: a training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.

S2: parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).

S3: a fully connected layer for text classification is added after the BERT model, the output size of which is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution.

S4: the cross entropy loss function is used for training and optimizing the model so as to improve accuracy of intention recognition. Specifically, assuming that there are C classes, the output of the model is a vector of size C, representing the score (logits) for each class, and there is a true class label y, ranging from 1 to C. The cross-over loss function is expressed as follows:

s5: and using the trained model to carry out intention recognition on the new text.

5. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the construction of the entity identification model (2 b) comprises the following steps:

S3: a full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags.

S4: cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss. Specifically, for the NER task, it is assumed that there are C possible labels, and for each label, the model outputs a C-dimensional probability distribution vector. Specifically, the true tag sequence is denoted y and the output of the model is denoted p. The mathematical expression of the cross-over-picking loss is:

wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) _i Is a model parameter for balancing the contributions of different feature functions; f (f) _i (y _i ,y _i-1 X) is a feature function that measures the label y _i And y _i-1 Relationship to the observation sequence X. Model objectBy learning a suitable parameter lambda _i To maximize the conditional probability P (y|x) of the true tag sequence. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:

S5: and performing entity recognition on the new text by using the trained model. The present patent uses the viterbi algorithm, which is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRFs), which finds the tag sequence with the highest conditional probability for a given observation sequence, or other decoding method to find the best tag sequence, for the viterbi algorithm:

where k represents the tag at position i-1.

4) And (3) backtracking: where k represents the tag at position i-1.

6. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, the answer generation module (3) comprises the following steps:

s1: the dialogue management module (3 a) firstly carries out reference resolution on the problems of the user so as to solve the reference words in the problems, and queries the results of the reference resolution as entities. And then, carrying out normalization processing on the entity extracted by the entity identification model (2 b) to ensure that an accurate query result is obtained. These steps help to improve understanding of user problems and accuracy of information retrieval.

S2: the dialogue management module (3 a) selects different strategies to generate answers according to the classification information of the question processing module (2 a), and the specific steps are as follows: