CN117033571A

CN117033571A - Knowledge question-answering system construction method and system

Info

Publication number: CN117033571A
Application number: CN202310765310.1A
Authority: CN
Inventors: 李志芸; 冯落落; 李晓瑜; 李沛; 张庆功; 尹青山
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-11-10

Abstract

The application discloses a knowledge question-answering system construction method and a knowledge question-answering system construction system, belongs to the technical field of big data processing, and aims to solve the technical problem of how to construct the knowledge question-answering system by combining a big model and a knowledge graph. The method comprises the following steps: collecting and arranging knowledge data related to the chemical field, extracting the relation among the entity, the relation and the attribute, and constructing a knowledge graph; analyzing and understanding the problem text input by the user, and extracting entities, relations and attributes; according to the keywords and the entities in the problem text, information retrieval is carried out in the knowledge graph; integrating information of the question text input by the user and the retrieved entity, relation and attribute to obtain a corresponding answer which is input by the promt and is generated based on an answer prediction model constructed by a large model technology; based on the interface requirements of the user, the generated answers are formatted and presented, including being presented in the form of texts and charts.

Description

Knowledge question-answering system construction method and system

Technical Field

The application relates to the technical field of big data processing, in particular to a knowledge question-answering system construction method and system.

Background

Large models can master a large amount of knowledge and information by training on large-scale text data. It has many advantages in that it can extract information from text in various fields, including science, history, literature, technology, etc., and can answer various types of questions. Large language models can understand and generate natural language, which has powerful language understanding and generating capabilities. When faced with a user's question, answers can be generated in a smooth, accurate manner. Personalized response and adaptation may be made based on the user's input and context. Meanwhile, the system has learning capability, and can continuously improve the performance of the system through interaction with a user, thereby improving the accuracy and quality of answers. This learning capability allows models to be continually advanced and better meet the needs of the user. Therefore, the large model can be well applied to a question-answering system.

Despite the many advantages of large models, there are challenges such as misleading answers to models, consistency problems of conversations, and data bias. Especially in the professional vertical field such as water conservancy industry, the answer is needed according to the past data, the authenticity of the answer is ensured, and the grappling can not be carried out.

A Knowledge Graph (knowledgegraph) is a graphical structure used to organize and represent Knowledge. It is a knowledge base containing entities, attributes and relationships between them. In a knowledge graph, entities represent specific objects or concepts of the real world, and attributes describe relationships between entities or features of the entities. Knowledge graph integrates domain knowledge into a unified structure so that a computer can understand and process the knowledge. It can extract, link and organize information from a plurality of information sources to build a rich knowledge network.

The information in the knowledge graph is usually from reliable data sources or knowledge notes of experts, and is subjected to strict verification and audit. This allows knowledge patterns to have advantages in terms of reliability and controllability of data. In contrast, large models acquire knowledge through automatic training of large-scale text data, and it is difficult to ensure accuracy and reliability of the data.

Industry knowledge graph takes data in the field or enterprises as main sources, and is generally required to be rapidly enlarged, an industry barrier is constructed, the knowledge structure is more complex, and ontology engineering and rule-type knowledge are generally included. The quality requirement of knowledge extraction is very high, and more relies on structured, unstructured and semi-structured data from the enterprise to carry out joint extraction, and manual checking is needed to ensure the quality. The field where fusion of multiple sources is often required is an effective means of data scaling. The application form is more comprehensive, and besides search questions and answers, the method also comprises decision analysis, service management and the like, and has higher requirements on reasoning and stronger interpretability requirements. The main fields are e-commerce, finance, agriculture, security, medical treatment and the like.

How to combine the large model and the knowledge graph to construct the knowledge question-answering system is a technical problem to be solved.

Disclosure of Invention

The technical task of the application is to provide a knowledge question-answering system construction method and a knowledge question-answering system construction system aiming at the defects, so as to solve the technical problem of how to construct the knowledge question-answering system by combining a large model and a knowledge graph.

The first application relates to a knowledge question-answering system construction method, which is used for constructing a knowledge question-answering system in the chemical field based on knowledge graphs, langchain and large model technology, and comprises the following steps:

collecting and arranging knowledge data related to the chemical field, preprocessing the knowledge data through a natural language processing technology, extracting the relationship among the entities, the relationship and the attributes, and constructing a knowledge graph based on the relationship among the entities, the relationship and the attributes, wherein the relationship is a semantic relationship among the entities, and the attributes are descriptive information for describing the entities, including describing the characteristics and the properties of the entities;

analyzing and understanding the problem text input by the user through a natural language processing technology, and extracting entities, relations and attributes;

according to the keywords and the entities in the problem text, information retrieval is carried out in the knowledge graph to obtain related entities, relations and attributes;

integrating information of the problem text input by the user and the retrieved entity, relation and attribute to obtain a prompt;

generating a corresponding answer based on an answer prediction model constructed by a large model technology by taking a prompt as an input;

based on the interface requirements of the user, the generated answers are formatted and presented, including being presented in the form of texts and charts.

Preferably, the knowledge data includes structured data and unstructured data;

for the structured data, extracting the relationship among the entity, the relationship and the attribute in the modes of entity modeling, relationship modeling and triplet storage;

for unstructured data, the relationships among entities, relationships and attributes are extracted by means of entity extraction and relationship extraction.

Preferably, the entity is extracted by:

performing regular matching based on rules to identify named entities;

or, treating the named entity recognition as a sequence labeling problem based on a statistical model, wherein the statistical model comprises a hidden Markov model, a conditional Markov model and a conditional random field model;

or, using word vectors in the problem text as a basis for realizing end-to-end named entity recognition based on the neural network model;

extracting the relation by a rule-based method or a machine learning-based method;

the relation extraction is carried out by a rule-based method, which comprises the following steps: extracting semantic relationships between entities by identifying grammatical structures and context information in the question text using predefined rules and pattern matching techniques;

the relation extraction is carried out by a machine learning-based method, comprising the following steps: training a relation extraction model by using a supervised learning or unsupervised learning algorithm, and identifying and extracting semantic relations between entities from the problem text based on the trained relation extraction model;

the attribute extraction is performed by the following steps:

performing feature extraction on the problem text based on a rule matching method, supervised learning or semi-supervised learning or a deep learning method;

identifying and extracting attributes based on the extracted features through a preconfigured classification model or a sequence annotation model;

the rule-based matching method comprises rule-based pattern matching and rule-based keyword matching;

when the deep learning method is used for extracting the characteristics of the problem text, the characteristic extraction is carried out on the problem text through the trained BERT model.

Preferably, the constructed knowledge graph is stored by a graph database;

and according to the keywords and the entities in the problem text, retrieving from the knowledge graph through the query language of the graph database, and returning the entities, the relations and the attributes related to the problem text.

Preferably, the answer prediction model is a model constructed based on chatgpt, chatglm or a text-to-speech.

In a second aspect, the present application provides a knowledge question-answering system construction system for constructing a knowledge question-answering system in a chemical field by the knowledge question-answering system construction method according to any one of the first aspects, the construction system comprising:

the knowledge graph construction module is used for collecting and arranging knowledge data related to the chemical field, preprocessing the knowledge data through a natural language processing technology, extracting the relation among the entities, the relation and the attribute, and constructing a knowledge graph based on the relation among the entities, the relation and the attribute, wherein the relation is a semantic relation among the entities, and the attribute is descriptive information for describing the entities, including describing the characteristics and the properties of the entities;

the extraction module is used for analyzing and understanding the problem text input by the user through a natural language processing technology and extracting entities, relations and attributes;

the retrieval matching module is used for carrying out information retrieval in the knowledge graph according to the keywords and the entities in the problem text to obtain related entities, relations and attributes;

the information integration module is used for integrating the information of the problem text input by the user and the retrieved entity, relationship and attribute to obtain a prompt;

the answer generation module is used for taking a sample as input and generating a corresponding answer based on an answer prediction model constructed by a large model technology;

and the answer display module is used for carrying out formatted presentation on the generated answer based on the interface requirement of the user, and comprises display in the form of text and a chart.

Preferably, the knowledge data includes structured data and unstructured data;

for the structured data, the knowledge graph construction module is used for extracting the relationship among the entity, the relationship and the attribute in the modes of entity modeling, relationship modeling and triplet storage;

for unstructured data, the knowledge graph construction module is used for extracting the relation among the entities, the relation and the attributes in the way of entity extraction and relation extraction.

Preferably, the extraction module is configured to extract the entity by:

performing regular matching based on rules to identify named entities;

the extraction module is used for extracting the relation by a rule-based method or a machine learning-based method;

the extraction module is used for extracting the attributes through the following steps:

Preferably, the knowledge graph is stored in a graph database;

the search matching module is used for executing the following steps: and according to the keywords and the entities in the problem text, retrieving from the knowledge graph through the query language of the graph database, and returning the entities, the relations and the attributes related to the problem text.

The knowledge question-answering system construction method and system have the following advantages: the knowledge graph technology-based question-answering system can utilize knowledge graph structured knowledge representation, knowledge fusion in the professional field and flexible information inquiry, and the large model has the advantages of stronger context understanding capability, multi-field knowledge coverage, reasoning capability and language generating capability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

The application is further described below with reference to the accompanying drawings.

FIG. 1 is a general framework of knowledge graph;

fig. 2 is a flow chart of a knowledge question-answering system construction method of embodiment 1.

Detailed Description

The application will be further described with reference to the accompanying drawings and specific examples, so that those skilled in the art can better understand the application and implement it, but the examples are not meant to limit the application, and the technical features of the embodiments of the application and the examples can be combined with each other without conflict.

The embodiment of the application provides a knowledge question-answering system construction method and a knowledge question-answering system construction system, which are used for solving the technical problem of how to construct the knowledge question-answering system by combining a large model and a knowledge graph.

Example 1:

the application discloses a knowledge question-answering system construction method, which is used for constructing a knowledge question-answering system in the chemical field based on knowledge graph, langChain and large model technology, and comprises the following steps:

s100, collecting and arranging knowledge data related to the chemical field, preprocessing the knowledge data through a natural language processing technology, extracting the relation among the entities, the relation and the attribute, and constructing a knowledge graph based on the relation among the entities, the relation and the attribute, wherein the relation is a semantic relation among the entities, and the attribute is descriptive information for describing the entities, including describing the characteristics and the properties of the entities;

s200, analyzing and understanding the problem text input by the user through a natural language processing technology, and extracting entities, relations and attributes;

s300, carrying out information retrieval in a knowledge graph according to keywords and entities in the problem text to obtain related entities, relations and attributes;

s400, integrating information of a question text input by a user and the retrieved entity, relation and attribute to obtain a prompt;

s500, taking a prompt as input, and generating a corresponding answer based on an answer prediction model constructed by a large model technology;

and S600, carrying out formatted presentation on the generated answers based on the interface requirement of the user, wherein the presentation comprises presentation in the form of texts and charts.

The knowledge data collected in step S100 of this embodiment includes structured data and unstructured data, and for the structured data, relationships among entities, relationships, and attributes are extracted by means of entity modeling, relationship modeling, and triplet storage; for unstructured data, the relationships among entities, relationships and attributes are extracted by means of entity extraction and relationship extraction.

The knowledge graph constructed in this embodiment is stored in the graph database. In practical application, the entity can be extracted by other storage structure representations according to the need through the following method:

the entity extraction method adopted in step S200 is to perform regular matching based on rules to perform named entity recognition, or to treat named entity recognition as a sequence labeling problem based on a statistical model, where the statistical model includes a hidden markov model, a conditional markov model and a conditional random field model, or to implement end-to-end named entity recognition based on a neural network model with word vectors in a problem text as, and no longer depends on manually defined features.

Relationship extraction is the extraction of semantic relationships between two or more entities from text. The relationship extraction is closely related to entity extraction, and generally, after identifying entities in the text, the relationship possibly existing between the entities is extracted. The relation extraction is performed by a rule-based method, and the method comprises the following steps: semantic relationships between entities are extracted by identifying grammatical structures and contextual information in the question text using predefined rules and pattern matching techniques. The relation extraction is carried out by a machine learning-based method, comprising the following steps: the relationship extraction model is trained using supervised learning or unsupervised learning algorithms, and semantic relationships between entities are identified and extracted from the problem text based on the trained relationship extraction model.

Attributes are typically features, properties, or other descriptive information describing an entity, such as coordinates of a location, etc. In this embodiment, attribute extraction is performed by:

(1) Performing feature extraction on the problem text based on a rule matching method, supervised learning or semi-supervised learning or a deep learning method;

(2) Attributes are identified and extracted based on the extracted features, by a preconfigured classification model or a sequence annotation model.

The rule-based matching method comprises rule-based pattern matching and rule-based keyword matching. And when the deep learning method is used for extracting the characteristics of the problem text, the characteristic extraction is carried out on the problem text through the trained BERT model.

Step S300, information retrieval is carried out in the knowledge graph according to the keywords and the entities in the question text, and the related entities, relations, attributes and the like are found. In this embodiment, the knowledge graph is stored in the graph database, and when searching, the query language (such as Cypher) or other search algorithm of the graph database can be used to perform searching operation, and information segments related to question sentences are returned, where the information segments are understood to be entities, relationships and attributes related to keywords and entities in the question text.

Step S400 integrates the question text input by the user in step S200 and the information retrieved in step S300 to generate a prompt, and inputs the prompt into the large model. The template of the prompt may be designed according to practical situations, for example: "known information { here, information retrieved in step S300 }, from which a user' S question is answered concisely and professionally. If an answer cannot be obtained from the answer, please say "the question cannot be answered according to the known information" or "sufficient relevant information is not provided", the addition of the composition to the answer is not allowed, and the answer is made using Chinese. The problems are: { question of user in step S200 }).

Step 500 invokes the large model to generate an answer. The natural advantages of the large model, such as contextual understanding capability, multi-domain knowledge coverage, collar sample learning capability, language generation capability and the like, are utilized, and an answer is generated according to input prompt, so that chatgpt, chatglm, a religion and the like can be selected by the large model.

Step S600 presents the generated answer in a format according to the interface requirements of the user, for example, presents the answer to the user in text, a chart or other forms.

Example 2:

the application discloses a knowledge question-answering system construction system, which comprises a knowledge graph construction module, an extraction module, a retrieval matching module, an information integration module, an answer generation module and an answer display module, wherein the knowledge question-answering system in the chemical field is constructed by the method disclosed in the embodiment 1.

The knowledge graph construction module is used for collecting and arranging knowledge data related to the chemical field, preprocessing the knowledge data through a natural language processing technology, extracting the relation among the entities, the relation and the attribute, and constructing a knowledge graph based on the relation among the entities, the relation and the attribute, wherein the relation is a semantic relation among the entities, and the attribute is descriptive information for describing the entities, including describing the characteristics and the properties of the entities.

In this embodiment, the knowledge data includes structured data and unstructured data. For the structured data, the knowledge graph construction module is used for extracting the relation among the entity, the relation and the attribute in the modes of entity modeling, relation modeling and triplet storage; for unstructured data, the knowledge graph construction module is used for extracting the relation among the entity, the relation and the attribute in the way of entity extraction and relation extraction.

The extraction module is used for analyzing and understanding the problem text input by the user through natural language processing technology and extracting entities, relations and attributes.

The extraction module is used for extracting the entity to perform regular matching based on rules to perform named entity identification by the following method; or, identifying the named entity as a sequence labeling problem based on a statistical model, wherein the statistical model comprises a hidden Markov model, a conditional Markov model and a conditional random field model; or, the word vector in the question text is used as a word vector, and the end-to-end named entity recognition is realized based on the neural network model.

The module is used for relation extraction by a rule-based method or a machine learning-based method. The relation extraction is performed by a rule-based method, and the method comprises the following steps: semantic relationships between entities are extracted by identifying grammatical structures and contextual information in the question text using predefined rules and pattern matching techniques. The relation extraction is carried out by a machine learning-based method, comprising the following steps: the relationship extraction model is trained using supervised learning or unsupervised learning algorithms, and semantic relationships between entities are identified and extracted from the problem text based on the trained relationship extraction model.

The module is used for extracting the attribute through the following steps:

The rule-based matching method comprises rule-based pattern matching and rule-based keyword matching; and when the deep learning method is used for extracting the characteristics of the problem text, the characteristic extraction is carried out on the problem text through the trained BERT model.

And the retrieval matching module is used for retrieving information in the knowledge graph according to the keywords and the entities in the problem text to obtain related entities, relations and attributes.

In this embodiment, the knowledge graph is stored in the graph database, and the search matching module is configured to perform the following steps: and according to the keywords and the entities in the problem text, retrieving from the knowledge graph through the query language of the graph database, and returning the entities, the relations and the attributes related to the problem text.

The information integration module is used for integrating information of the problem text input by the user and the retrieved entity, relationship and attribute to obtain the prompt.

The answer generation module is used for taking the prompt as input and generating a corresponding answer based on an answer prediction model constructed by a large model technology.

The answer display module is used for carrying out formatted presentation on the generated answer based on the interface requirement of the user, and the answer display module comprises display in the form of text and diagrams.

The natural advantages of the large model are utilized, such as contextual understanding capability, multi-domain knowledge coverage, collar sample learning capability, language generating capability and the like, and an answer is generated according to input prompt, and the answer prediction model in the embodiment is a model constructed based on chatgpt, chatglm or a text-to-speech.

While the application has been illustrated and described in detail in the drawings and in the preferred embodiments, the application is not limited to the disclosed embodiments, and it will be appreciated by those skilled in the art that the code audits of the various embodiments described above may be combined to produce further embodiments of the application, which are also within the scope of the application.

Claims

1. The knowledge question-answering system construction method is characterized by constructing a knowledge question-answering system in the chemical field based on knowledge graphs, langchain and large model technology, and comprises the following steps:

2. The knowledge question-answering system construction method according to claim 1, wherein the knowledge data includes structured data and unstructured data;

3. The knowledge question-answering system construction method according to claim 1, wherein the entity is extracted by:

performing regular matching based on rules to identify named entities;

the attribute extraction is performed by the following steps:

4. The knowledge question-answering system construction method according to claim 1, wherein the constructed knowledge graph is stored through a graph database;

5. The knowledge question and answer system construction method according to claim 1, wherein the answer prediction model is a model constructed based on chatgpt, chatglm or a text-to-speech.

6. A knowledge question-answering system construction system for constructing a knowledge question-answering system in a chemical field by the knowledge question-answering system construction method according to any one of claims 1 to 5, the construction system comprising:

7. The knowledge question-answering system construction system according to claim 6, wherein the knowledge data includes structured data and unstructured data;

8. The knowledge question-answering system construction system according to claim 6, wherein the extraction module is configured to extract the entity by:

performing regular matching based on rules to identify named entities;

9. The knowledge question-answering system construction system according to claim 6, wherein the knowledge graph is stored in a graph database;

10. The knowledge question and answer system construction system according to claim 6, wherein the answer prediction model is a model constructed based on chatgpt, chatglm or a text-to-speech.