CN117609436A - College scientific research management question-answering system combining knowledge graph and large language model - Google Patents

College scientific research management question-answering system combining knowledge graph and large language model Download PDF

Info

Publication number
CN117609436A
CN117609436A CN202311287960.6A CN202311287960A CN117609436A CN 117609436 A CN117609436 A CN 117609436A CN 202311287960 A CN202311287960 A CN 202311287960A CN 117609436 A CN117609436 A CN 117609436A
Authority
CN
China
Prior art keywords
model
tag
data
module
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311287960.6A
Other languages
Chinese (zh)
Inventor
王永
符永骥
王鹏程
陈芊如
潘宇欣
赵越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202311287960.6A priority Critical patent/CN117609436A/en
Publication of CN117609436A publication Critical patent/CN117609436A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The patent relates to a college scientific research management question-answering system combining a knowledge graph and a large language model. The system comprises a data module, a question processing module and an answer generating module. In implementation, the data module collects data of schools, professions, papers, patents, projects and the like through web crawlers and stores the data in a structured manner. The problem processing module uses the intent recognition model and the entity recognition model to respectively classify and extract the user problems so as to provide key information. The answer generation module selects different strategies according to the question types, including template matching, graph database query, form generation and large language model generation, so as to generate accurate and diversified answers. The knowledge graph and the large language model are comprehensively utilized, the respective limitations are made up, and more comprehensive and accurate natural language question-answering service is provided.

Description

College scientific research management question-answering system combining knowledge graph and large language model
The invention belongs to the field of knowledge questions and answers, and particularly relates to a college scientific research management question and answer system combining a knowledge graph and a large language model.
Background
In the age of information explosion today, universities face a large amount of information and knowledge management challenges. Students, teachers and campus managers need to timely and accurately acquire information about school resources, courses, policies and the like. The traditional information retrieval and question-answering system has various limitations, including limitation of keyword matching, incapacity of answering complex questions, untimely information updating and the like.
The knowledge graph is a carefully constructed knowledge representation method and has the advantages of high accuracy and high specificity. Knowledge graph relies on expertise and multiple data sources to construct, which can provide high quality and accurate answers. This means that the system user can retrieve information directly from the knowledge-graph without relying on fuzzy matching of the text data. In addition, the knowledge graph has expandability, and can be updated and expanded continuously to contain new information and knowledge, so that the system can adapt to the continuously-changing problem field.
However, knowledge maps also have some limitations. First, its coverage is limited by the source of the data and expertise from which it is built, and may not be able to answer questions for certain areas or topics. Second, knowledge maps present difficulties in dealing with ambiguous, open or ambiguous questions because these questions lack an explicit answer. Finally, building a multilingual knowledge graph and a multi-language enabled question-answering system can be more complex because semantic and cultural differences between different languages need to be considered.
Meanwhile, a large language model (such as ChatGPT) has strong natural language understanding and generating capability and strong universality, and is suitable for the problems of various fields and topics. The model is trained based on large-scale text data, has extensive background knowledge, and can answer various types of questions.
However, large language models also have some limitations. First, it lacks real-time because the training cost of large language models is extremely high, failing to provide up-to-date real-time information or answers to current events. Second, it is typically based on surface-level text matching and pattern matching, lacking in deep semantic understanding. In addition, the answer generated is not necessarily always accurate, as depending on the information in the training data, erroneous or inaccurate information may be contained. Finally, large language models may be abused for generating false information, promotions, fraud and harmful content, requiring supervision and management to reduce risk. In highly specialized areas, large language models may perform poorly because their training data may not be sufficient to cover these areas.
Disclosure of Invention
Aiming at the problems, the invention combines the advantages and the disadvantages of the knowledge graph and the large language model, and provides a scientific research management knowledge question-answering system comprehensively utilizing the knowledge graph and the large language model so as to realize more efficient and accurate natural language question-answering service and overcome respective limitations. By combining the accuracy of the knowledge graph with the universality of the large language model, the system can process a wide range of problem fields and provide more comprehensive and accurate answers, thereby providing excellent question-answering experience for users.
In order to achieve the above purpose, the present invention provides the following technical scheme:
a college scientific research management question-answering system combining a knowledge graph and a large language model comprises three parts:
1. data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:
1) Template library (1 a): for saving answer templates generated by manual designs and large language models.
2) Graph database (1 b): data is stored in entity-relationship-entity and entity-relationship-attribute fashion.
3) Structured document database (1 c): the method is used for structurally storing document data such as papers, patents, projects and the like.
2. Problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:
1) Intent recognition model (2 a): the method is used for classifying the problems, improving generalization of the model and simplifying the subsequent processing flow.
2) Entity recognition model (2 b): for extracting key entities that will be used in subsequent query operations on the database.
3. Answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:
1) Dialog control module (3 a): responsible for reference resolution, entity normalization processing, and performing different query operations according to different classification information, etc.
2) Form module (3 b): and generating a proper form according to the result of the intention recognition model (2 a), guiding a user to fill in form information, and adopting a certain strategy.
3) Large Language Model (LLM) module (3 c): combining the query results and knowledge of the model itself, a diversity and high quality answer is generated.
Further, according to the college scientific research management question-answering system combining the knowledge graph and the large language model of claim 1, the whole process is carried out according to the following steps:
s1: the data module (1) uses web crawlers to crawl HTML web pages and PDF documents of schools, professions, papers, patents, and projects on various websites. These web pages and documents are then subjected to form extraction and document segmentation operations to obtain structured data for schools, professions, teachers, etc., as well as TXT formatted data for papers, patents, projects, etc.
S2: a template library (1 a) of answer templates is constructed manually. These templates will be used to generate answers covering different types of questions and answer structures to increase the diversity and accuracy of the answers.
S3: a graph database (1 b) of entity-relationship-entities, entity-relationship-attributes is constructed from the structured data.
S4: the data in TXT format is further data cleaned to generate and store a structured document database (1 c) in json format for information retrieval using multi-pass recall retrieval techniques such as BM25, vector retrieval, etc. This document database will contain information such as papers, patents, projects, etc. so that the system can retrieve and provide answers to the relevant information.
S5: construction of an intention recognition model (2 a). First, a training dataset is constructed using a small amount of artificial annotation data, template data generated by Chatito, and diversified data generated by LLM. Next, parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Husting Face). Then, a fully connected layer for text classification is added after the BERT model, whose output size is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution. And finally, training and optimizing the model by using the cross entropy loss function so as to improve the accuracy of intention recognition.
Assuming that there are C classes, the output of the model is a vector of size C, representing the score (logits) of each class, and there is a true class label y, ranging from 1 to C. The cross-over loss function is expressed as follows:
wherein y is i The real label is a C-dimensional one-hot vector, wherein the ith element is 1, and the rest elements are 0; z is the output vector of the model, containing a score (logits) for each category; softmax (z) i A representation softmax function is applied to the ith element of the model output for converting the score to a class probability. The mathematical expression is as follows:
s6: and (3) constructing an entity identification model (2 b). First, a training dataset is constructed using a small amount of artificial annotation data, template data generated by Chatito, and diversified data generated by LLM. Next, parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Husting Face). Then, a full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags. Finally, cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss. In addition, the present patent uses the viterbi algorithm to find the best tag sequence when predicting new text to achieve efficient entity recognition.
For the NER task, it is assumed that there are C possible labels, and for each label, the model outputs a C-dimensional probability distribution vector. The actual tag sequence is denoted y and the output of the model is denoted p. The mathematical expression of the cross-over-picking loss is:
wherein y is i Is the i-th element of the real tag sequence, is a one-hot vector, wherein only one element is 1, and represents the real tag; p is p i Is the ith element of the output probability distribution vector of the model and represents the prediction probability of the model to the ith label.
The loss function of the Conditional Random Field (CRF) is probability-based and is used to model and optimize the distribution probability of the tag sequence to maximize the probability of the real tag sequence.
Let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X. The object of the conditional random field is to calculate the conditional probability P (y|x) of the tag sequence Y given the observation sequence X. This can be expressed as:
wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) i Is a model parameter for balancing the contributions of different feature functions; f (f) i (y i ,y i-1 X) is a feature function that measures the label y i And y i-1 Relationship to the observation sequence X. The object of the model is to learn the appropriate parameter lambda i To maximize the conditional probability P (y|x) of the true tag sequence. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:
the goal of this loss function is to minimize the negative log likelihood to optimize the model parameters so that the conditional probability of the real tag sequence is maximized.
The viterbi algorithm is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRF) that finds the tag sequence with the highest conditional probability for a given observation sequence, for which:
let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X.
In the viterbi algorithm, the goal is to find the tag sequence Y with the highest conditional probability P (y|x). This can be expressed as:
the core idea of the viterbi algorithm is to calculate the best tag y at each position i using dynamic programming i And gradually constructing the optimal tag sequence. The algorithm comprises the following steps:
1) Initializing: a matrix (table) V is initialized, where V i j represents the maximum conditional probability of selecting tag j at location i. A backtracking matrix B is initialized, wherein B [ i ] [ j ] stores the previous best tag when tag j was selected at location i.
2) And (3) recursion step: each position i in the observation sequence X is traversed from left to right. For each position i, for each possible tag j, V [ i ] [ j ] and B [ i ] [ j ] are calculated:
where k represents the tag at position i-1.
3) And (3) terminating: in observing the sequenceFinding the last tag y with the highest conditional probability N
4) And (3) backtracking: where k represents the tag at position i-1.
Eventually, the Viterbi algorithm returns the tag sequence Y with the highest conditional probability P (Y|X) * . This sequence is constructed by selecting the tag with the highest conditional probability at each location.
S7: the dialogue management module (3 a) firstly carries out reference resolution on the problems of the user so as to solve the reference words in the problems, and queries the results of the reference resolution as entities. And then, carrying out normalization processing on the entity extracted by the entity identification model (2 b) to ensure that an accurate query result is obtained. These steps help to improve understanding of user problems and accuracy of information retrieval.
S8: the dialogue management module (3 a) selects different strategies to generate answers according to the classification information of the question processing module (2 a), and the specific steps are as follows:
1) For single entity query class questions, the system will first try template matching, then execute the Cypher query in the graph database, and populate the graph database query results into the matched templates, thereby generating answers.
2) For multi-entity query class problems, the system uses the form module (3 b) to generate a proper form according to the result of the intention recognition model (2 a), and then fills entity information provided by the entity recognition model (2 b) into the form. If the form information is incomplete, the system will initiate a form question and answer until the form information is complete. Subsequently, the dialogue management module (3 a) will perform the next operation, i.e. execute the cytoer query in the graph database, and input the query result as a Prompt (Prompt) to the LLM module (3 c) after splicing with the user question, so as to generate an answer.
3) For open class problems, the system will perform a cytor query in the graph database and a multiple recall search (e.g., BM25, vector search) on the structured document database (1 c), respectively. The system will then splice the two query results as prompts (Prompt) with the user question and input to the LLM module (3 c) to generate an answer.
Compared with the prior art, the invention has the beneficial effects that:
1) And (3) fully utilizing a model: the system fully utilizes the large language model, effectively integrates all links of questions and answers, and plays the high understanding, analyzing and reasoning capacity of the model. This enables the user to obtain deeper and accurate answers, improving the quality and effectiveness of the question-answering service.
2) High degree of controllability: by employing normalized question answer templates, the system ensures controllability of the answer generation. This helps prevent the model from building false information or providing inaccurate answers, providing a more reliable question-answering service for the user. Controllability also helps the system provide accurate answers in the professional field.
3) Broad problem coverage: the system comprehensively utilizes the characteristics of the knowledge graph and the large language model, and can cover a wide range of problem fields including professional fields and general knowledge. The method provides comprehensive answers for users, is not limited by the field, and increases the applicability of the question-answering system.
4) High accuracy answer: by combining the high accuracy of the knowledge graph and the deep background knowledge of the large language model, the system can provide highly accurate answers and meet the requirement of users on information accuracy. This helps the user to better understand and solve the problem.
5) Real-time performance: although the large language model has real-time challenges, the system realizes the tracking and updating of the latest information through the knowledge graph. The system can meet the real-time requirements of users, provides a wider service range, and has important value especially in the case of needing timely information.
Drawings
FIG. 1 is a flow chart of a system for implementing a college scientific research management question-answering system according to the present invention
FIG. 2 is an intent recognition model structure
FIG. 3 entity recognition model structure
Detailed Description
The following describes the embodiments and working principles of the present invention in further detail with reference to the drawings.
As can be seen from fig. 1, the invention comprises a data module (1), a question processing module (2) and an answer generating module (3). The data module (1) is provided with a template library (1 a), a graph database (1 b) and a structured document database (1 c); the problem processing module (2) is provided with an intention recognition model (2 a) and an entity recognition model (2 b); the answer generation module (3) is provided with a dialogue control module (3 a), a form module (3 b) and a Large Language Model (LLM) module (3 c).
In this embodiment, the data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:
1) Template library (1 a): for saving answer templates generated by manual designs and large language models.
2) Graph database (1 b): data is stored in entity-relationship-entity and entity-relationship-attribute fashion.
3) Structured document database (1 c): the method is used for structurally storing document data such as papers, patents, projects and the like.
The problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:
1) Intent recognition model (2 a): the method is used for classifying the problems, improving generalization of the model and simplifying the subsequent processing flow.
2) Entity recognition model (2 b): for extracting key entities that will be used in subsequent query operations on the database.
The answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:
1) Dialog control module (3 a): responsible for reference resolution, entity normalization processing, and performing different query operations according to different classification information, etc.
2) Form module (3 b): and generating a proper form according to the result of the intention recognition model (2 a), guiding a user to fill in form information, and adopting a certain strategy.
3) Large Language Model (LLM) module (3 c): combining the query results and knowledge of the model itself, a diversity and high quality answer is generated.
Furthermore, the university scientific research management question-answering system combining the knowledge graph and the large language model has the following overall process:
s1: the data module (1) uses web crawlers to crawl HTML web pages and PDF documents of schools, professions, papers, patents, and projects on various websites. These web pages and documents are then subjected to form extraction and document segmentation operations to obtain structured data for schools, professions, teachers, etc., as well as TXT formatted data for papers, patents, projects, etc.
S2: a template library (1 a) of answer templates is constructed manually. These templates will be used to generate answers covering different types of questions and answer structures to increase the diversity and accuracy of the answers.
S3: a graph database (1 b) of entity-relationship-entities, entity-relationship-attributes is constructed from the structured data.
S4: the data in TXT format is further data cleaned to generate and store a structured document database (1 c) in json format for information retrieval using multi-pass recall retrieval techniques such as BM25, vector retrieval, etc. This document database will contain information such as papers, patents, projects, etc. so that the system can retrieve and provide answers to the relevant information.
S5: the construction of the intention recognition model (2 a) is shown in fig. 3. The method comprises the following specific steps:
1) A training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.
2) Parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).
3) A fully connected layer for text classification is added after the BERT model, the output size of which is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution.
4) The cross entropy loss function is used for training and optimizing the model so as to improve accuracy of intention recognition.
Assuming that there are C classes, the output of the model is a vector of size C, representing the score (logits) of each class, and there is a true class label y, ranging from 1 to C. The cross-over loss function is expressed as follows:
wherein y is i The real label is a C-dimensional one-hot vector, wherein the ith element is 1, and the rest elements are 0; z is the output vector of the model, containing a score (logits) for each category; softmax (z) i A representation softmax function is applied to the ith element of the model output for converting the score to a class probability. The mathematical expression is as follows:
5) And using the trained model to carry out intention recognition on the new text.
S6: the entity recognition model (2 b) is constructed, and the structure of the entity recognition model is shown in fig. 3. The method comprises the following specific steps:
1) A training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.
2) Parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).
3) A full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags.
4) Cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss.
For the NER task, it is assumed that there are C possible labels, and for each label, the model outputs a C-dimensional probability distribution vector. The actual tag sequence is denoted y and the output of the model is denoted p. The mathematical expression of the cross-over-picking loss is:
wherein y is i Is the i-th element of the real tag sequence, is a one-hot vector, wherein only one element is 1, and represents the real tag; p is p i Is the ith element of the output probability distribution vector of the model and represents the prediction probability of the model to the ith label.
The loss function of the Conditional Random Field (CRF) is probability-based and is used to model and optimize the distribution probability of the tag sequence to maximize the probability of the real tag sequence.
Let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X. The object of the conditional random field is to calculate the conditional probability P (y|x) of the tag sequence Y given the observation sequence X. This can be expressed as:
wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) i Is a model parameter for balancing the contributions of different feature functions; f (f) i (y i ,y i-1 X) is a feature function that measures the label y i And y i-1 Relationship to the observation sequence X. The goal of the model is to maximize the conditional probability P (y|x) of the true tag sequence by learning the appropriate parameter λi. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:
the goal of this loss function is to minimize the negative log likelihood to optimize the model parameters so that the conditional probability of the real tag sequence is maximized.
5) And performing entity recognition on the new text by using the trained model. The present patent uses the viterbi algorithm, which is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRFs), which finds the tag sequence with the highest conditional probability for a given observation sequence, or other decoding method to find the best tag sequence, for the viterbi algorithm:
let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X.
In the viterbi algorithm, the goal is to find the tag sequence Y with the highest conditional probability P (y|x). This can be expressed as:
the core idea of the viterbi algorithm is to calculate the best tag y at each position i using dynamic programming i And gradually constructing the optimal tag sequence. The algorithm comprises the following steps:
1) Initializing: a matrix (table) V is initialized, where V i j represents the maximum conditional probability of selecting tag j at location i. A backtracking matrix B is initialized, wherein B [ i ] [ j ] stores the previous best tag when tag j was selected at location i.
2) And (3) recursion step: each position i in the observation sequence X is traversed from left to right. For each position i, for each possible tag j, V [ i ] [ j ] and B [ i ] [ j ] are calculated:
where k represents the tag at position i-1.
3) And (3) terminating: at the end N of the observation sequence, the last tag y with the highest conditional probability is found N
4) And (3) backtracking: where k represents the tag at position i-1.
Eventually, the Viterbi algorithm returns the tag sequence Y with the highest conditional probability P (Y|X) * . This sequence is constructed by selecting the tag with the highest conditional probability at each location.
S7: the dialogue management module (3 a) firstly carries out reference resolution on the problems of the user so as to solve the reference words in the problems, and queries the results of the reference resolution as entities. And then, carrying out normalization processing on the entity extracted by the entity identification model (2 b) to ensure that an accurate query result is obtained. These steps help to improve understanding of user problems and accuracy of information retrieval.
S8: the dialogue management module (3 a) selects different strategies to generate answers according to the classification information of the question processing module (2 a), and the specific steps are as follows:
1) For single entity query class questions, the system will first try template matching, then execute the Cypher query in the graph database, and populate the graph database query results into the matched templates, thereby generating answers.
2) For multi-entity query class problems, the system uses the form module (3 b) to generate a proper form according to the result of the intention recognition model (2 a), and then fills entity information provided by the entity recognition model (2 b) into the form. If the form information is incomplete, the system will initiate a form question and answer until the form information is complete. Subsequently, the dialogue management module (3 a) will perform the next operation, i.e. execute the cytoer query in the graph database, and input the query result as a Prompt (Prompt) to the LLM module (3 c) after splicing with the user question, so as to generate an answer.
3) For open class problems, the system will perform a cytor query in the graph database and a multiple recall search (e.g., BM25, vector search) on the structured document database (1 c), respectively. The system will then splice the two query results as prompts (Prompt) with the user question and input to the LLM module (3 c) to generate an answer.
It should be noted that the above description of the embodiments does not limit the present invention, and the present invention is not limited to the above examples. Those skilled in the art will appreciate that various changes, adaptations, additions, modifications, or substitutions may be made within the scope of the claims, and all such changes are intended to be included within the scope of the claims.

Claims (6)

1. A college scientific research management question-answering system combining a knowledge graph and a large language model comprises three parts:
1) Data module (1): this part is responsible for the cleansing of the data, structuring the available data and storing it in a database in a different format. This part includes the following submodules:
a) Template library (1 a): for saving answer templates generated by manual designs and large language models.
b) Graph database (1 b): data is stored in entity-relationship-entity and entity-relationship-attribute fashion.
c) Structured document database (1 c): the method is used for structurally storing document data such as papers, patents, projects and the like.
2) Problem processing module (2): this part processes the user's questions to obtain question classification information. This part includes the following submodules:
a) Intent recognition model (2 a): the method is used for classifying the problems, improving generalization of the model and simplifying the subsequent processing flow.
b) Entity recognition model (2 b): for extracting key entities that will be used in subsequent query operations on the database.
3) Answer generation module (3): and according to the query result of the data module (1) and the classification information of the question processing module (2), answering the questions of the user by combining corresponding strategies. This part includes the following submodules:
a) Dialog control module (3 a): responsible for reference resolution, entity normalization processing, and performing different query operations according to different classification information, etc.
b) Form module (3 b): and generating a proper form according to the result of the intention recognition model (2 a), guiding a user to fill in form information, and adopting a certain strategy.
c) Large Language Model (LLM) module (3 c): combining the query results and knowledge of the model itself, a diversity and high quality answer is generated.
2. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the data module (1) comprises the following steps:
s1: the data module (1) uses web crawlers to crawl HTML web pages and PDF documents of schools, professions, papers, patents, and projects on various websites. These web pages and documents are then subjected to form extraction and document segmentation operations to obtain structured data for schools, professions, teachers, etc., as well as TXT formatted data for papers, patents, projects, etc.
S2: a template library (1 a) of answer templates is constructed manually. These templates will be used to generate answers covering different types of questions and answer structures to increase the diversity and accuracy of the answers.
S3: a graph database (1 b) of entity-relationship-entities, entity-relationship-attributes is constructed from the structured data.
S4: the data in TXT format is further data cleaned to generate and store a structured document database (1 c) in json format for information retrieval using multi-pass recall retrieval techniques such as BM25, vector retrieval, etc. This document database will contain information such as papers, patents, projects, etc. so that the system can retrieve and provide answers to the relevant information.
3. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the question processing module (2) comprises the following steps:
s1: an intention recognition model (2 a) and an entity recognition model (2 b) are constructed.
S2: in the prediction, the problems raised by the user are respectively input into an intention recognition model (2 a) and an entity recognition model (2 b), and then the prediction results of the two models are obtained. Finally, the prediction results of the two models are integrated into classification information of the problem.
4. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, the construction of the intention recognition model (2 a) comprising the steps of:
s1: a training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.
S2: parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).
S3: a fully connected layer for text classification is added after the BERT model, the output size of which is equal to the number of classification categories. This fully connected layer converts the output of BERT into a classification probability distribution.
S4: the cross entropy loss function is used for training and optimizing the model so as to improve accuracy of intention recognition. Specifically, assuming that there are C classes, the output of the model is a vector of size C, representing the score (logits) for each class, and there is a true class label y, ranging from 1 to C. The cross-over loss function is expressed as follows:
wherein y is i The real label is a C-dimensional one-hot vector, wherein the ith element is 1, and the rest elements are 0; z is the output vector of the model, containing a score (logits) for each category; softmax (z) i A representation softmax function is applied to the ith element of the model output for converting the score to a class probability. The mathematical expression is as follows:
s5: and using the trained model to carry out intention recognition on the new text.
5. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, wherein the construction of the entity identification model (2 b) comprises the following steps:
s1: a training dataset was constructed using a small amount of artificial annotation data, chatito generated template data, and the diversity data generated by LLM.
S2: parameters of the model are initialized using a pre-trained Chinese BERT model (e.g., BERT-Chinese provided by Hugging Face).
S3: a full connectivity layer and a Conditional Random Field (CRF) layer are added after the BERT model. The fully connected layer is used to map the output of the BERT to the tag space of the Named Entity Recognition (NER) task. Conditional random field layers are used to model the dependencies between tags.
S4: cross entropy loss is used in conjunction with the BERT model for optimization, while CRF part is optimized using CRF loss. Specifically, for the NER task, it is assumed that there are C possible labels, and for each label, the model outputs a C-dimensional probability distribution vector. Specifically, the true tag sequence is denoted y and the output of the model is denoted p. The mathematical expression of the cross-over-picking loss is:
wherein y is i Is the i-th element of the real tag sequence, is a one-hot vector, wherein only one element is 1, and represents the real tag; p is p i Is the ith element of the output probability distribution vector of the model and represents the prediction probability of the model to the ith label.
The loss function of the Conditional Random Field (CRF) is probability-based and is used to model and optimize the distribution probability of the tag sequence to maximize the probability of the real tag sequence.
Let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X. The object of the conditional random field is to calculate the conditional probability P (y|x) of the tag sequence Y given the observation sequence X. This can be expressed as:
wherein Z (X) is a normalization factor, ensuring that the sum of the probabilities of all tag sequences is 1; lambda (lambda) i Is a model parameter for balancing the contributions of different feature functions; f (f) i (y i ,y i-1 X) is a feature function that measures the label y i And y i-1 Relationship to the observation sequence X. Model objectBy learning a suitable parameter lambda i To maximize the conditional probability P (y|x) of the true tag sequence. The running uses Negative Log Likelihood Loss (Negative Log-Likelihood Loss) to represent the Loss function of the CRF:
the goal of this loss function is to minimize the negative log likelihood to optimize the model parameters so that the conditional probability of the real tag sequence is maximized.
S5: and performing entity recognition on the new text by using the trained model. The present patent uses the viterbi algorithm, which is a dynamic programming algorithm for decoding tag sequences of Conditional Random Fields (CRFs), which finds the tag sequence with the highest conditional probability for a given observation sequence, or other decoding method to find the best tag sequence, for the viterbi algorithm:
let an observation sequence X be assumed, where X comprises N markers, denoted x= (X) 1 ,x 2 ,…,x N ). Let a tag sequence Y be assumed, wherein Y comprises N tags, denoted y= (Y) 1 ,y 2 ,…,y N ). These tags correspond to each tag in the observation sequence X.
In the viterbi algorithm, the goal is to find the tag sequence Y with the highest conditional probability P (y|x). This can be expressed as:
the core idea of the viterbi algorithm is to calculate the best tag y at each position i using dynamic programming i And gradually constructing the optimal tag sequence. The algorithm comprises the following steps:
1) Initializing: a matrix (table) V is initialized, where V i j represents the maximum conditional probability of selecting tag j at location i. A backtracking matrix B is initialized, wherein B [ i ] [ j ] stores the previous best tag when tag j was selected at location i.
2) And (3) recursion step: each position i in the observation sequence X is traversed from left to right. For each position i, for each possible tag j, V [ i ] [ j ] and B [ i ] [ j ] are calculated:
where k represents the tag at position i-1.
3) And (3) terminating: at the end N of the observation sequence, the last tag y with the highest conditional probability is found N
4) And (3) backtracking: where k represents the tag at position i-1.
Eventually, the Viterbi algorithm returns the tag sequence Y with the highest conditional probability P (Y|X) * . This sequence is constructed by selecting the tag with the highest conditional probability at each location.
6. The college scientific research management question-answering system combining knowledge graph and large language model according to claim 1, the answer generation module (3) comprises the following steps:
s1: the dialogue management module (3 a) firstly carries out reference resolution on the problems of the user so as to solve the reference words in the problems, and queries the results of the reference resolution as entities. And then, carrying out normalization processing on the entity extracted by the entity identification model (2 b) to ensure that an accurate query result is obtained. These steps help to improve understanding of user problems and accuracy of information retrieval.
S2: the dialogue management module (3 a) selects different strategies to generate answers according to the classification information of the question processing module (2 a), and the specific steps are as follows:
1) For single entity query class questions, the system will first try template matching, then execute the Cypher query in the graph database, and populate the graph database query results into the matched templates, thereby generating answers.
2) For multi-entity query class problems, the system uses the form module (3 b) to generate a proper form according to the result of the intention recognition model (2 a), and then fills entity information provided by the entity recognition model (2 b) into the form. If the form information is incomplete, the system will initiate a form question and answer until the form information is complete. Subsequently, the dialogue management module (3 a) will perform the next operation, i.e. execute the cytoer query in the graph database, and input the query result as a Prompt (Prompt) to the LLM module (3 c) after splicing with the user question, so as to generate an answer.
3) For open class problems, the system will perform a cytor query in the graph database and a multiple recall search (e.g., BM25, vector search) on the structured document database (1 c), respectively. The system will then splice the two query results as prompts (Prompt) with the user question and input to the LLM module (3 c) to generate an answer.
CN202311287960.6A 2023-12-22 2023-12-22 College scientific research management question-answering system combining knowledge graph and large language model Pending CN117609436A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311287960.6A CN117609436A (en) 2023-12-22 2023-12-22 College scientific research management question-answering system combining knowledge graph and large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311287960.6A CN117609436A (en) 2023-12-22 2023-12-22 College scientific research management question-answering system combining knowledge graph and large language model

Publications (1)

Publication Number Publication Date
CN117609436A true CN117609436A (en) 2024-02-27

Family

ID=89943091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311287960.6A Pending CN117609436A (en) 2023-12-22 2023-12-22 College scientific research management question-answering system combining knowledge graph and large language model

Country Status (1)

Country Link
CN (1) CN117609436A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093842A (en) * 2024-04-25 2024-05-28 安徽省交通规划设计研究总院股份有限公司 Optimization method and device of knowledge graph question-answering system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093842A (en) * 2024-04-25 2024-05-28 安徽省交通规划设计研究总院股份有限公司 Optimization method and device of knowledge graph question-answering system

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN114020862B (en) Search type intelligent question-answering system and method for coal mine safety regulations
CN117033608A (en) Knowledge graph generation type question-answering method and system based on large language model
CN112015868B (en) Question-answering method based on knowledge graph completion
CN111782769B (en) Intelligent knowledge graph question-answering method based on relation prediction
CN116166782A (en) Intelligent question-answering method based on deep learning
CN112766507B (en) Complex problem knowledge base question-answering method based on embedded and candidate sub-graph pruning
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN112559723B (en) FAQ search type question-answering construction method and system based on deep learning
CN117609436A (en) College scientific research management question-answering system combining knowledge graph and large language model
CN112559765B (en) Semantic integration method for multi-source heterogeneous database
CN117370580A (en) Knowledge-graph-based large language model enhanced dual-carbon field service method
CN115858750A (en) Power grid technical standard intelligent question-answering method and system based on natural language processing
CN118410175A (en) Intelligent manufacturing capacity diagnosis method and device based on large language model and knowledge graph
CN116821294A (en) Question-answer reasoning method and device based on implicit knowledge ruminant
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
CN113312498B (en) Text information extraction method for embedding knowledge graph by undirected graph
Alwaneen et al. Stacked dynamic memory-coattention network for answering why-questions in Arabic
CN111581365A (en) Predicate extraction method
CN117786052A (en) Intelligent power grid question-answering system based on domain knowledge graph
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution
CN113868389B (en) Data query method and device based on natural language text and computer equipment
CN111767388A (en) Candidate pool generation method
CN111259650A (en) Text automatic generation method based on class mark sequence generation type countermeasure model
CN113779211B (en) Intelligent question-answering reasoning method and system based on natural language entity relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination