CN117216227B

CN117216227B - Tobacco enterprise intelligent information question-answering method based on knowledge graph and large language model

Info

Publication number: CN117216227B
Application number: CN202311415557.7A
Authority: CN
Inventors: 周泽寻; 黄函; 朱映辉; 王沛涛; 陈炫锐; 黄伟; 杨圣云; 苗利明; 邱树伟; 陈得乐; 蔡烨; 黄东
Original assignee: Guangdong Tobacco Chaozhou City Co ltd; Hanshan Normal University
Current assignee: Guangdong Tobacco Chaozhou City Co ltd; Hanshan Normal University
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-04-16
Anticipated expiration: 2043-10-30
Also published as: CN117216227A

Abstract

The invention provides a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model, which comprises the following steps: constructing a knowledge graph, storing the knowledge graph, and replying the user consultation information by using the knowledge graph. The invention utilizes the GPT language model, fully utilizes the effective semantic perception and stronger feature learning capability of the generated language model, and can effectively extract entity-relation in the process clause. And searching key knowledge points by utilizing the excellent multi-stage jumping capability of the top and the side of the graph database, automatically generating prompt words by combining a genetic algorithm, and generating natural language reply information which accords with logic and is accurate in answer through the strong contextual learning capability of the GPT language model. Compared with other methods, the method does not need to retrain or finely tune the model, greatly reduces the calculation cost, reduces the development difficulty and cost, and promotes the industrial landing and rapid iteration of the artificial intelligence technology.

Description

Tobacco enterprise intelligent information question-answering method based on knowledge graph and large language model

Technical Field

The invention relates to the field of intelligent information processing methods of tobacco business enterprises, in particular to a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model.

Background

In recent years, tobacco business enterprises actively promote technological innovation, self-standing science and technology is used as a strategic support for development, deep fusion of information technology and a whole industrial chain is actively promoted, and intelligent transformation of the industry is accelerated. As an important part of the tobacco industry chain, tobacco business enterprises connect tobacco industry enterprises, retail customers and consumers, while having information systems with numerous businesses and huge enterprise data volumes.

However, the current tobacco enterprise purchasing process has some pain points and problems, which severely restrict the efficiency of enterprise transaction. First, the purchasing process has many specification files, and it is difficult for the purchasing person to find the required specification in a short time. Secondly, some regulations in the specification file are not easy to read, and need to be understood by consulting professionals, which often takes a long time, and under the condition that professionals are limited, the consulting requirements of all buyers are difficult to meet. Finally, the purchasing flow is complicated, and if no complete operation flow chart is adopted for guiding, links are easily omitted; even in the purchase type once handled, the buyer may need to perform consultation service at the next handling due to easy forgetfulness, personnel adjustment, policy change, etc.

Aiming at the problems, how to integrate data resources of a plurality of business systems of a tobacco enterprise by utilizing an artificial intelligence technology, explore and establish an intelligent system suitable for purchasing services in the industry, further improve the working efficiency and the working quality of staff of the enterprise, and become the key point of research. According to literature investigation, although many technical schemes for developing question-answering systems are sequentially proposed at present, the existing technical schemes are not specifically proposed for purchasing problems, or are realized only by simple modes of matching the problems to be consulted with a pre-stored question-answering library and the like, and an intelligent information processing system specifically aiming at the purchasing process problems of tobacco enterprises is absent. In order to overcome the above problems, a new intelligent information question-answering method needs to be developed.

Disclosure of Invention

The invention aims to provide a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model, which solves the problems in the prior art.

The aim of the invention is realized by the following technical scheme:

a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model comprises the following steps:

s1, constructing a knowledge graph: the GPT language model (Generative Pretrained Transformer, the generated pre-training converter language model) is used as a main network structure, the weight of each entity is calculated by combining an attention mechanism, the weight is used for predicting the triplet probability distribution of the entity-relation, the entity identification and the relation classification are carried out, the entity-relation joint extraction is realized, and finally a knowledge graph is constructed;

step S2, storing a knowledge graph: the knowledge graph obtained in the step S1 is stored in a distributed mode by utilizing a Nebula graph database;

step S3, replying processing is carried out on the user consultation information: the knowledge graph stored in the step S2 is utilized, aiming at the problem information input by a user, firstly, knowledge calculation and reasoning are carried out by utilizing the information retrieval capability of multi-stage jump of a graph database, and an intermediate result is output; then automatically generating prompt words by a method based on a genetic algorithm and a GPT language model; and finally generating natural language reply information.

Further, the step S1 includes:

step S101, using GPT language model to token and fragment the input original natural language sequence data to obtain the word order sequence S= { e ₁ ,e ₂ ,…,e _n ,c}，e ₁ Representing the 1 st byte, n is the total number of bytes, c represents the category of the command sequence;

and S102, performing entity-relation joint extraction by using a GPT language model.

Further, the step S102 includes:

step S1021, randomly extracting the ith candidate sequence segment S in the word order sequence S _i ＝{e _i ,e _i+1 ,…,e _i+k And k represents the segment width, and the correlation degree of each byte of the segment is calculated: wherein alpha is _ij Representing an attention coefficient;

step S1022, performing reduction by using maximum pooling operation to obtain feature vectorsConstructing an embedding optimized by a back propagation algorithmMatrix and weight vector w with width of k+1 is selected from the embedded matrix _k+1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the feature vector and the weight vector to obtain an entity vector:wherein s represents a single entity number; combining category c with entity vector e _s Splicing and constructing entities embedded with the context: x is X ^S ＝{e _s C }; predicting entity type through softmax classifier, and completing entity identification:>wherein->Weights and biases in the softmax layer are represented, respectively;

step S1023, selecting two adjacent sequence segments S ₁ ,s ₂ Performing maximum pooling operation to obtain upper and lower Wen Biaozheng c (s ₁ ,s ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the According to the method for generating entity vector in step S1022, the entity for the corresponding sequence segment is obtainedIt is combined with upper and lower Wen Biaozheng c (s ₁ ,s ₂ ) Splicing to obtain entity-relation pairs: />

Step S1024, combining entity-relationship pair X ₁ ,X ₂ Calculating entity-relationship score values through sigmoid functions, namely:

wherein the method comprises the steps ofRepresenting weights and biases in the sigmoid layer; the largest and greater than or equal to the set threshold value is chosen from the two score values as the final entity-relationship.

Further, in the step S1021, the calculation formula of the attention coefficient is:

α _ij ∈A＝softmax(attn(S _i ,S _i+1 )p)；

wherein the softmax function is defined as:

attention function attn (S _i ,S _i+1 ) The two fragments with different sequences are used as input, and the calculation formula is as follows:

wherein->Representing sequence segment S _i Is a transpose of (a).

Further, in the step S1024, if both score values are smaller than the set threshold, the unknown relationship is determined.

Further, the step S3 includes:

step S301, processing question information input by a user in natural language through a word segmentation tool to obtain a word order sequence;

step S302, extracting a triplet entity-relation from the vocabulary sequence by using the method of the step S102, inquiring a knowledge graph stored in a Nebula graph database by using the triplet entity-relation as an index, completing knowledge calculation and knowledge reasoning by using a tool provided by the Nebula graph, and outputting an intermediate result;

step S303, automatically optimizing through a genetic algorithm and a GPT language model according to the intermediate result to generate an optimal prompt word;

and step S304, performing context learning and analysis processing by using the GPT language model according to the optimal prompt word, and generating natural language reply information.

Further, the step S303 includes:

step S3031, initializing a prompt word population P by adopting a prompt word template ₀ ＝{p ₁ ,…,p _N }，p ₁ Indicating the 1 st prompting word in the prompting word population, wherein N indicates the number of prompting words in the population;

step S3032, adopting prompt word population, prompt word evaluation reference software, evaluation data set D and evaluation function f _D (. Cndot.) calculating fitness of each individual in the population of cue words to form an initial fitness set S ₀ ＝{s _i ＝f _D (p _i ),i＝1,…,N}；

Step S3033, using GPT language model to select current population P _t ＝{p ₁ ,…,p _N Any number of individuals in t=0 is selected as parent prompt word population Wherein t is E [1, T]Representing the current iteration number, r representing the individual selected by random sampling, k < N;

step S3034, parent population is subjected to GPT language modelThe individuals in the population are subjected to cross calculation to generate a new prompt word population +.>Wherein c represents individuals of the cross-calculated population;

step S3035, obtaining the prompt word population by means of GPT language model through cross calculationThe individual carries out variation calculation to generate a new prompt word population +.> Wherein m represents the individuals of the population after the mutation calculation;

step S3036, calculating population by using prompt word evaluation reference software, data set and evaluation function Form the fitness set of the current population +.>

Step S3037, slave fitness set S _t Selecting individuals with fitness greater than a preset threshold value to form a new populationThe number of iterations t=t+1;

step S3038, if the current iteration number reaches the maximum value, exiting the loop, and selecting the optimal prompt word as the individual with the highest adaptability in the current populationOtherwise, the process goes to step S3033 to continue.

Further, the step S3032 evaluates the function f _D The expression of (-) is:

f _D (·)＝max∑ _(x,y)∈D L{LLM(p+∈,x),y}，

wherein P is E P ₀ The input prompt words are represented, x and y are respectively a sample and a predicted value of the evaluation data set D, LLM is a GPT language model, epsilon represents random noise, and L is a likelihood function.

The intelligent information processing device based on the knowledge graph and the GPT language model comprises a processor and a memory; the memory stores programs or instructions which are loaded and executed by the processor to realize the intelligent information question-answering method of the tobacco enterprises based on the knowledge graph and the large language model.

A computer readable storage medium, wherein a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the method for inquiring and answering the intelligent information of the tobacco enterprises based on the knowledge graph and the large language model is realized.

Compared with the prior art, the invention has the beneficial effects that:

1. the tobacco enterprise intelligent information question-answering method integrating the knowledge graph and the large language model uses the GPT language model to encode the word order and the context information thereof in the language sequence, calculates the weight of the language sequence by combining the attention mechanism, and uses the weights to calculate the probability distribution of the entity-relation triples so as to predict the relation of each entity. Compared with other methods, although the processing object is structured/unstructured data, the content of the processing object is a logic clear and strict flow specification term, and the method of the invention fully utilizes the effective semantic perception and the stronger characteristic learning capability of the generated language model and can effectively extract the entity-relation in the flow term.

2. The invention uses the idea of map retrieval enhancement generation to normalize and unify data of different service systems and different forms, eliminates data island, and constructs a tobacco industry knowledge graph. And searching key knowledge points by utilizing the excellent multi-stage jumping capability of the top and the side of the graph database, automatically generating prompt words by combining a genetic algorithm, and generating natural language reply information which accords with logic and is accurate in answer through the strong contextual learning capability of the GPT language model. Compared with other methods, the method does not need to retrain or finely tune the model, greatly reduces the calculation cost, reduces the development difficulty and cost, and promotes the industrial landing and rapid iteration of the artificial intelligence technology.

Drawings

Fig. 1 is an overall architecture diagram of a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model.

FIG. 2 is a knowledge-graph construction of the present invention based on a generative pre-training transducer language model.

Fig. 3 is a database storage of the knowledge graph based on nebula graph of the present invention.

FIG. 4 is an automated prompter word engineering based on a genetic algorithm of the present invention.

Detailed Description

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

Example 1

The invention discloses a tobacco enterprise intelligent information question-answering method based on a knowledge graph and a large language model, which comprises the following steps:

s1, constructing a knowledge graph: and taking the GPT language model as a main network structure, calculating the weight of each entity by combining with an attention mechanism, predicting the triplet probability distribution of the entity-relation, carrying out entity identification and relation classification, realizing entity-relation joint extraction, and finally constructing a knowledge graph.

The entity, i.e. the knowledge point, is mainly embodied as a subject and/or object in the input text data, such as a purchasing office, a board of directors, a bidding document, etc.

Relationship classification: for example, the term "the centralized purchasing department is determined by three work management committees", and the triples (three works, determined, centralized purchasing department) are the relationship classifications of the entity "three works" and the entity "centralized purchasing department".

In the field of knowledge graphs, practical experience is applied to understand the entities and the relationships among the entities, and the relationship which accords with the actual situation and is semantically represented among the entities, namely an entity-relationship data set is established, and the process is called knowledge graph construction.

The specific process of step S1 is as follows:

step S101, firstly, using GPT language model to token and fragment the input original natural language sequence data to obtain the vocabulary sequence S= { e ₁ ,e ₂ ,…,e _n ,c}，e ₁ Representing the 1 st byte, n being the total number of bytes, c representing the class of the token sequence.

Step S102, performing entity-relation joint extraction by using a GPT language model, wherein the specific process is as follows:

step S1021, aiming at the step S101, obtaining a word order sequence S, and randomly extracting a sequence segment S of an ith candidate in the word order sequence S _i ＝{e _i ,e _i+1 ,…,e _i+k The segment width is denoted by k, and the attention function calculates the degree of correlation of the individual bytes of the segment:

wherein alpha is _ij Represents the attention coefficient, and is composed of a softmax function and an attention function attn (S _i ,S _i+1 ) The calculation results are that:

α _ij ∈A＝softmax(attn(S _i ,S _i+1 )) (2)

wherein the softmax function is defined as:

attention letterNumber attn (S) _i ,S _i+1 ) The two fragments with different sequences are used as input, and the calculation formula is as follows:

wherein the method comprises the steps ofRepresenting sequence segment S _i Is a transpose of (a).

Step S1022, aiming at the sequence segment obtained in step S1021, performing reduction by using the maximum pooling operation to obtain a feature vectorConstructing an embedding matrix optimized by a backward propagation algorithm, and selecting a weight vector w of width k+1 from the embedding matrix _k+1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the feature vector and the weight vector to obtain an entity vector:

where s represents a single entity number, consider context information, and associate category c with entity vector e _s Splicing and constructing entities embedded with the context:

X ^s ＝{e _s ,c} (6)

predicting the entity type through a softmax classifier, and completing entity identification: wherein->The weights and biases in the softmax layer are shown, respectively.

Step S1023, selecting adjacentIs a sequence of two sequence segments s ₁ ,s ₂ Performing maximum pooling operation to obtain upper and lower Wen Biaozheng c (s ₁ ,s ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the According to the method for generating entity vector in step S1022, the entity for the corresponding sequence segment is obtainedIt is combined with upper and lower Wen Biaozheng c (s ₁ ,s ₂ ) And splicing to obtain the entity-relation pair.

Note that here the front-to-back positions of the two entities are undetermined, so there are two types of entity-relationship pairs that are stitched:

step S1024, X obtained in step S1023 ₁ ,X ₂ The entity-relationship score values are calculated by a sigmoid function, respectively, namely:

wherein the method comprises the steps ofRepresenting weights and biases in the sigmoid layer. Selecting the largest of the two score values and greater than or equal to a set threshold value as the final entity-relation; and if the two relationships are smaller than the threshold value, judging the relationship as None, namely unknown relationship.

Step S2, storing a knowledge graph: and (3) carrying out distributed storage on the knowledge graph obtained in the step (S1) by using an open source distributed graph database Nebula graph database, and realizing common data operations such as traversal, inquiry, pattern matching and the like through the nGQL language provided by the knowledge graph.

Step S3, replying processing is carried out on the user consultation information: aiming at the problem information input by a user, the knowledge graph database generated in the step S1 is utilized to perform knowledge calculation and reasoning by utilizing the information retrieval capability of multi-stage jump of the graph database, and an intermediate result is output; then automatically generating prompt words by a method based on a genetic algorithm and a GPT language model; and finally, performing context learning and analysis processing by using the GPT language model to generate natural language reply information.

Further, the specific implementation process of step S3 is as follows:

step S302, extracting a triplet entity-relation from the vocabulary sequence by using the method of step S102, querying a knowledge graph stored in a Nebula graph database by using the triplet entity-relation as an index, completing knowledge calculation and knowledge reasoning by using a tool provided by the Nebula graph, and outputting an intermediate result.

Step S303, automatically optimizing the intermediate result output in the step S302 through a genetic algorithm-based GPT language model to generate an optimal prompt word, wherein the specific flow is as follows:

step S3031, combining the prompting word templates of the existing design to initialize the prompting word population P ₀ ＝{p ₁ ,…,p _N }，p ₁ The 1 st prompting word in the prompting word population is represented, and N represents the number of prompting words in the population.

Step S3032, preparing prompt word evaluation reference software PromptBunch and evaluation data set D, and the evaluation function f provided by the reference software _D (. Cndot.) the use of a catalyst. According to the prompt word population, prompt word evaluation reference software, evaluation data set and evaluation function f _D (. Cndot.) calculating fitness of each individual in the population of cue words to form an initial fitness set S ₀ ＝{s _i ＝f _D (p _i ),i＝1,…,N}。

Promptband is a benchmark test for measuring robustness of Large Language Models (LLMs) to antagonistic cues.

In this embodiment, the expression of the evaluation function is:

f _D (·)＝max∑ _(x,y)∈D L{LLM(p+∈,x),y} (10)

wherein P is E P ₀ The input prompt words are represented, x and y are respectively the samples and the predicted values of the evaluation data set D, LLM is a generated pre-training converter language model, epsilon represents random noise, and L is a likelihood function. The meaning of the evaluation function is that individuals in the prompt word population are input with content data x in the evaluation data set D, a GPT language model is utilized for carrying out text organization to obtain a complete sentence pattern, and a value with the maximum likelihood with y is calculated to be used as the fitness of the individuals.

Step S3033, using GPT language model to select current population P _t ＝{p ₁ ,…,p _N Any number of individuals in t=0 is selected as parent prompt word population Wherein t is E [1, T]Representing the current iteration number, r represents the individual selected by random sampling, k < N.

Step S3034, parent population is subjected to GPT language modelThe individuals in the population are subjected to cross calculation to generate a new prompt word population +.>Where c represents the individuals of the cross-calculated population, in order to distinguish from the individuals r selected for random sampling.

Step S3035, obtaining the prompt word population by means of GPT language model through cross calculationThe individual carries out variation calculation to generate a new prompt word population +.> Where m represents the individuals of the variant calculated population in order to distinguish them from the individuals of the cross calculated population.

Step S3037, slave fitness set S _t Selecting individuals with fitness greater than a preset threshold value to form a new populationThe number of iterations t=t+1.

Step S304, according to the prompt words generated in the step S303, performing context learning and analysis processing by using the GPT language model, and generating natural language reply information.

The embodiment provides an intelligent information processing device of a knowledge graph and a GPT language model, which comprises a processor and a memory; the memory stores programs or instructions which are loaded and executed by the processor to realize the intelligent information question-answering method of the tobacco enterprises based on the knowledge graph and the large language model.

The embodiment provides a computer readable storage medium, wherein a program or an instruction is stored on the readable storage medium, and the program or the instruction realizes the intelligent information question-answering method of the tobacco enterprises based on the knowledge graph and the large language model when being executed by a processor.

The above description is for the purpose of illustrating the embodiments of the present invention and is not to be construed as limiting the invention, but is intended to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims

1. The tobacco enterprise intelligent information question-answering method based on the knowledge graph and the large language model is characterized by comprising the following steps of:

s1, constructing a knowledge graph: the GPT language model is used as a main network structure, the weight of each entity is calculated by combining an attention mechanism, the weight is used for predicting the triplet probability distribution of the entity-relation, entity identification and relation classification are carried out, entity-relation joint extraction is realized, and finally a knowledge graph is constructed;

step S3, replying processing is carried out on the user consultation information: the knowledge graph stored in the step S2 is utilized, aiming at the problem information input by a user, firstly, knowledge calculation and reasoning are carried out by utilizing the information retrieval capability of multi-stage jump of a graph database, and an intermediate result is output; then automatically generating prompt words by a method based on a genetic algorithm and a GPT language model; finally, generating natural language reply information;

the step S1 includes:

step S102, performing entity-relation joint extraction by using a GPT language model;

the step S3 includes:

2. The method for tobacco enterprise intelligent information question-answering based on knowledge graph and large language model according to claim 1, wherein the step S102 comprises:

step S1022, performing reduction by using maximum pooling operation to obtain feature vectorsConstructing an embedding matrix optimized by a backward propagation algorithm, and selecting a width k+1 from the embedding matrixWeight vector w _k+1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the feature vector and the weight vector to obtain an entity vector:wherein s represents a single entity number; combining category c with entity vector e _s Splicing and constructing entities embedded with the context: x is X ^s ＝{e _s C }; predicting entity type through softmax classifier, and completing entity identification:>wherein->Weights and biases in the softmax layer are represented, respectively;

3. The method for tobacco enterprise intelligent information question-answering based on knowledge graph and large language model according to claim 2, wherein the calculation formula of the attention coefficient in step S1021 is:

α _ij ∈A＝softmax(attn(S _i ,S _i+1 ))；

wherein the softmax function is defined as:

wherein->Representing sequence segment S _i Is a transpose of (a).

4. The method for tobacco enterprise intelligent information question-answering based on knowledge graph and large language model according to claim 2, wherein in step S1024, if both score values are smaller than a set threshold, the unknown relationship is determined.

5. The method for tobacco enterprise intelligent information question-answering based on knowledge graph and large language model according to claim 1, wherein the step S303 comprises:

6. The method for tobacco enterprise intelligent information question-answering based on knowledge graph and large language model according to claim 5, wherein the evaluation function f in step S3032 is _D The expression of (-) is:

f _D (·)＝max∑ _(x,y)∈D L{LLM(p+∈,x),y}，

7. The intelligent information processing device based on the knowledge graph and the GPT language model is characterized by comprising a processor and a memory; the memory stores a program or instructions that are loaded and executed by the processor to implement the knowledge graph and large language model-based tobacco enterprise intelligent information question-answering method according to any one of claims 1 to 6.

8. A computer readable storage medium having a program or instructions stored thereon, wherein the program or instructions, when executed by a processor, implement the knowledge-graph and large language model-based tobacco enterprise intelligent information question-answering method according to any one of claims 1 to 6.