CN117556010A - Knowledge base and large model-based document generation system, method, equipment and medium - Google Patents
Knowledge base and large model-based document generation system, method, equipment and medium Download PDFInfo
- Publication number
- CN117556010A CN117556010A CN202311521400.2A CN202311521400A CN117556010A CN 117556010 A CN117556010 A CN 117556010A CN 202311521400 A CN202311521400 A CN 202311521400A CN 117556010 A CN117556010 A CN 117556010A
- Authority
- CN
- China
- Prior art keywords
- document
- knowledge base
- text
- large model
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 239000013598 vector Substances 0.000 claims abstract description 174
- 238000011049 filling Methods 0.000 claims abstract description 8
- 238000012795 verification Methods 0.000 claims description 45
- 230000011218 segmentation Effects 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 22
- 239000012634 fragment Substances 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 15
- 238000001914 filtration Methods 0.000 claims description 7
- 238000013461 design Methods 0.000 claims description 6
- 238000004806 packaging method and process Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 239000013589 supplement Substances 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000010606 normalization Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 16
- 238000007726 management method Methods 0.000 description 15
- 238000005516 engineering process Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012384 transportation and delivery Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 230000008602 contraction Effects 0.000 description 2
- 238000000586 desensitisation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000009411 base construction Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000009417 prefabrication Methods 0.000 description 1
- 238000004801 process automation Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of artificial intelligence, and discloses a document generation method, a system, equipment and a medium based on a knowledge base and a large model, wherein the method comprises the following steps: configuration information carrying personalized requirements is obtained, a knowledge base, a large model, a prompt word template and a security policy are configured according to the configuration information, and a knowledge base in the vertical field is constructed; extracting features of text information corresponding to the question information, and determining a question feature vector; checking the problem feature vector according to the security policy, inputting the problem feature vector into a vertical domain knowledge base after the checking is passed, and outputting a document vector if the problem feature vector is matched with the document vector associated with the problem feature vector; filling the document vector and the problem feature vector into a prompt word template to generate a prompt word; the method and the device improve the document writing efficiency, normalization, specialty and document quality by inputting the prompt words into the large model for reasoning, generating the answer document, checking the answer document and outputting the answer document after the answer document passes the check.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a document generation system, a method, equipment and a medium based on a knowledge base and a large model.
Background
In the related technology, a large amount of delivery document output is required in daily operation of an enterprise, and the traditional document processing method has low efficiency, large amount of repeatability and low-value workload, so that the requirements of cost reduction and synergy of the enterprise are difficult to meet.
However, the large model is adopted to process the delivery document required by enterprise operation output, and as the object processed by the large model belongs to the general field, the document content processing capability in the vertical field is not professional enough, so that the matching degree of the output delivery document and the questioning information is poor, and the user requirement cannot be met.
Disclosure of Invention
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows.
In view of the shortcomings of the prior art, the invention discloses a document generation system, a method, equipment and a medium based on a knowledge base and a large model, which are used for solving the problem that the matching degree of a delivered document and question information is poor and the user requirement cannot be met.
In a first aspect, the present invention provides a document generation method based on a knowledge base and a large model, including: configuration information carrying personalized requirements is obtained, the knowledge base, the big model, the prompt word template and the security policy are configured according to the configuration information, a vertical domain knowledge base is constructed, and the big model is output based on the vertical domain knowledge base, the prompt word template and the security policy; acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector; the problem feature vector is checked once according to the security policy, after the verification is passed, the problem feature vector is input into the vertical domain knowledge base for matching, and if the document vector which has the association degree with the problem feature vector reaching a preset threshold value is matched in the vertical domain knowledge base, the document vector is output; filling the document vector and the problem feature vector into the prompt word template to generate a prompt word; and inputting the prompt word into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
In an embodiment of the first aspect, obtaining configuration information carrying personalized requirements, and respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, including: acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model; configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy; configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption; configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template; configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
In an embodiment of the first aspect, constructing the vertical domain knowledge base includes: determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link; converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word; and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
In an embodiment of the first aspect, the verifying using the security policy further includes: if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail; and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
In an embodiment of the first aspect, until the final answer document is output, the method further includes: generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal; signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model; the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
In an embodiment of the first aspect, determining whether sensitive word information is involved further comprises: if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base; if the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
In an embodiment of the first aspect, the knowledge base and the large model are arranged by adopting a micro-service containerization architecture design, and services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands; the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
In an embodiment of the first aspect, constructing the vertical domain knowledge base includes: acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
The invention provides a document generation system based on a knowledge base and a large model, which comprises: the configuration module is used for acquiring configuration information carrying personalized requirements, respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, constructing a vertical domain knowledge base, and outputting the large model based on the vertical domain knowledge base, the prompt word template and the security policy; the problem determining module is used for acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information and determining a problem characteristic vector; the document vector determining module is used for checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the problem feature vector passes the checking, and outputting the document vector if the document vector which has the association degree with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base; the prompt word determining module is used for filling the document vector and the problem feature vector into the prompt word template to generate a prompt word; and the document generation module is used for inputting the prompt words into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting the answer document after the verification is passed.
The invention provides an electronic device, comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the method.
The present invention provides a computer readable medium having stored thereon a computer program for causing a computer to perform the above-described method.
The invention has the beneficial effects that:
the vertical domain knowledge base is constructed by pre-configuring the knowledge base, the large model and the knowledge base are configured in a personalized way according to the service processing requirements of users, and the vertical domain knowledge base and the large model are combined, so that the project document can be quickly created by one-key operation of the users, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the requirements of the users can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Drawings
FIG. 1 is a flow diagram illustrating a knowledge base and large model based document generation method in accordance with an illustrative embodiment of the present invention;
FIG. 2 is a schematic diagram of a personalized configuration flow shown in an exemplary embodiment of the invention;
FIG. 3 is a flow chart of a method for automatically generating documents according to an exemplary embodiment of the present invention;
FIG. 4 is a schematic diagram of a prompt configuration flow shown in an exemplary embodiment of the invention;
FIG. 5 is a schematic diagram illustrating a structure for verification using security policies in accordance with an exemplary embodiment of the present invention;
FIG. 6 is a schematic diagram of an architecture of a knowledge base and large model based document generation system, as illustrated in an exemplary embodiment of the invention;
FIG. 7 is a schematic diagram of a knowledge base and large model based document generation system in accordance with an illustrative embodiment of the invention;
fig. 8 is a schematic diagram of a computer system suitable for use in implementing the electronic device of the present invention, as shown in an exemplary embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that, without conflict, the following embodiments and sub-samples in the embodiments may be combined with each other.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.
The terms first, second and the like in the description and in the claims of the embodiments of the disclosure and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe embodiments of the present disclosure. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. The term "plurality" means two or more, unless otherwise indicated. In the embodiment of the present disclosure, the character "/" indicates that the front and rear objects are an or relationship. For example, A/B represents: a or B. The term "and/or" is an associative relationship that describes an object, meaning that there may be three relationships. For example, a and/or B, represent: a or B, or, A and B.
Referring to fig. 1, a flowchart of a document generating method based on a knowledge base and a large model according to an exemplary embodiment of the invention is shown. Referring to fig. 1, in an exemplary embodiment, the method for generating a document based on a knowledge base and a large model at least includes steps S101 to S105, which are described in detail as follows:
step S101, configuration information carrying personalized requirements is obtained, the knowledge base, the large model, a prompt word template and a security policy are respectively configured according to the configuration information, a vertical domain knowledge base is constructed, and the large model is output based on the vertical domain knowledge base, the prompt word template and the security policy;
the configuration information with personalized requirements is obtained in advance to meet the service requirements of users, for example, a local knowledge base is configured according to the requirements of the users to form a vertical field knowledge base, meanwhile, a prompt word template and a security policy are configured, so that the large model adopts the security policy to carry out security verification before the question information is input, the question information and a document vector which is matched with the question information in the vertical field knowledge base are filled into the prompt word template, prompt words are generated and input into the large model to carry out reasoning, answer documents are determined, and the security verification is carried out on the answer documents by adopting the security policy, and the answer documents can be output after verification is passed.
Specifically, the knowledge base, the large model, the prompt word template and the security policy can be respectively configured by adopting AIGC, so that the interface individuation flexible configuration, the full-process automation, the templating of the prompt word template, the security guarantee mechanism and the service high availability and expandability based on the micro-service architecture containerization architecture technology are ensured.
For example, AIGC refers to generated artificial intelligence, and is an emerging technology in the field of artificial intelligence in recent years. It is capable of automatically generating new and creative contents by learning a large amount of data and knowledge, and is applied to various fields such as natural language processing, image generation, audio synthesis, etc.
Step S102, acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector;
specifically, the question information may be any one of text information, voice information, image information, or video information, where, for example, the target terminal is used to collect voice information of the user, and the target terminal includes at least one of the following: smart phones, tablets, notebook computers, computers.
For example, after converting the voice information into text information, extracting features of the text information to obtain a problem feature vector; for example, voice information input by a user is acquired through a recording or voice input device using a voice recognition technology, and the acquired voice information is converted into text information using a voice recognition engine; preprocessing the converted text information, including word segmentation, stop word removal, special symbol removal and other operations, so as to prepare subsequent feature extraction; features are extracted from the preprocessed text information, and the features can comprise text features such as word Frequency, word length, part of speech, named entities and the like, and the features can be weighted by using a TF-IDF (Term Frequency-Inverse Document Frequency) method and the like, so that the extracted features are converted into numerical vector forms, and the numerical vector forms a problem feature vector.
Step S103, checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the checking is passed, and outputting the document vector if the document vector with the association degree reaching a preset threshold value with the problem feature vector is matched in the vertical domain knowledge base;
specifically, the word expressed by the problem feature vector is matched by using a preset sensitive word bank, if the problem feature vector is not matched with any sensitive word in the preset sensitive word bank, the verification is passed, otherwise, the verification is not passed.
Specifically, similarity calculation is carried out on each document vector according to the problem feature vector, so that vector similarity between the document vector and each text feature vector is obtained; and determining a document vector associated with the text feature vector from the preset document vector library according to the vector similarity, and taking the document vector with the association degree reaching a preset threshold value as the document vector to be output.
Optionally, if the verification is not passed, a preset template is called to directly respond to the problem feature vector.
Step S104, filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
specifically, taking a document vector and a problem feature vector as inputs, and determining a word or phrase to be filled according to the positions of the vectors in a prompt word template; if the prompt word template contains placeholders or variables, the corresponding vocabulary or phrase is used for replacement.
It should be noted that when filling the hint word template, the name, type, style and content of the document should be considered to ensure that the generated hint word remains consistent with the original document; in addition, other techniques may be used to optimize the generation of the hint words, such as preprocessing the text using natural language processing techniques, analyzing the text using rules or machine learning models, and so forth. For example, user intent, target task, etc. are fed back in the prompt.
In this manner, a cue word having a particular content and format may be generated for use in subsequent text generation or other tasks.
Step S105, inputting the prompt words into the large model for reasoning, generating an answer document, performing secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
Specifically, the large model processes, for example, encodes or converts, the sequence corresponding to the input prompt word to generate a corresponding output sequence, and the output sequence generates a final answer document after decoding or converting. In addition, the generated answer document is subjected to secondary verification by using a security policy, wherein the security policy can comprise aspects of content filtering, content integrity verification, language style inspection, grammar correctness verification and the like, and if the answer document does not meet the requirement of the security policy, the answer document needs to be correspondingly corrected or regenerated until the answer document passes the verification, and the answer document is output.
By means of the method, the vertical domain knowledge base is built through the pre-configuration knowledge base, the large model and the knowledge base are configured in an individualized mode according to the service processing requirements of users, and the vertical domain knowledge base and the large model are combined, so that the project document can be quickly created through one-key operation of the users, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the user requirements can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Referring to fig. 2, a personalized configuration flow chart is shown in an exemplary embodiment of the present invention, and is described in detail as follows: the method for obtaining configuration information carrying personalized requirements, and respectively configuring the knowledge base, the big model, the prompt word template and the security policy according to the configuration information comprises the following steps:
acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model;
for example, in fig. 2, the configuration information is configured in sequence by a knowledge base, a large model (i.e., chatgpt), a prompting word template and a security policy, and it is to be noted that the configuration sequence is not required to be considered when the knowledge base, the security policy, the prompting word template and the large model are configured.
Configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy;
Configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption;
configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template;
for example, a hint word template example: "" "known information: { context }, the user's question is briefly and professionally answered according to the above known information. If no answer can be obtained from the answer, please answer: "the question cannot be answered based on known information" or "sufficient relevant information is not provided". Adding fictional content to the answer is not allowed and the answer must be in chinese. The problems are: { query } "". Wherein context part fills the text retrieved by the knowledge base and query fills the problem of user query.
Configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
In this embodiment, a knowledge base, a large model, a prompt word template and a security policy are configured according to personalized requirement configuration information, where the knowledge base is configured according to first configuration information, and a vertical domain knowledge base is constructed: configuring the security policy according to the second configuration information to generate a security policy with problem security check and content security check, thereby realizing the problem security check and the content security check; and configuring the prompt word templates according to the third configuration information, and generating the prompt word templates containing the problem prompt word sentences aiming at the questioning information so as to generate the prompt word templates containing the problem prompt word sentences aiming at the questioning information. And configuring the large model according to the fourth configuration information so as to combine the large model with the knowledge base, the security policy and the prompt word template, so that the large model can be combined with the knowledge base, the security policy and the prompt word template to generate an answer. In general, the corresponding configuration can be performed according to the personalized requirements of the user, so that more accurate and safe answers to the questions are provided.
Referring to fig. 3, a flowchart of a method for automatically generating a document according to an exemplary embodiment of the present invention is shown, including:
A. Knowledge base data preprocessing
1) Loading and reading documents
The knowledge base documents stored locally are read by means of batch processing and converted into text format.
2) Text segmentation
The above configured segmentation strategy performs segmentation, for example: and dividing the text according to the classification, paragraphs, text lines and the like to obtain the text of each part.
B. Text vectorization storage
1) Text vectorization
The segmented text is converted into a numerical vector so as to facilitate the subsequent text similarity calculation, thereby retrieving the text related to the problem.
2) Vector storage
The text vectorized data is stored in an elastic search vector database.
Examples: sentence: "write one ERP System Online scheme"
Word segmentation: word segmentation result of sentences: [ "write", "one copy", "ERP System", "on line", "scheme" ] the system is written to the user
Constructing a dictionary: { "write": "0", "one": "0", "ERP system": "3", "online": "2", "scheme": "0" }
Vector representation: the vectors of the sentence are expressed as: [0 0 3 2 0];
vector storage: these vector data are stored in an elastic search vector database.
C. Safe filtering and problem vectorization
And carrying out safety verification aiming at the problem input by the user, and judging whether sensitive information is related or not according to the safety strategy configuration information. If not, the query question is converted into a semantic vector by using the vectorization processing mode which is the same as that of the knowledge base text, and the semantic vector is used for calculating the similarity between the question and the knowledge base text.
D. Text similarity matching
Referring to fig. 4 in detail, the text vector closest to the problem vector is found out by means of cosine similarity, euclidean distance and other calculation methods, so as to find the top k texts most relevant to the problem in the vertical domain knowledge base.
E. Generating a Prompt word statement
And selecting relevant contents from texts most relevant to the questions, combining the relevant contents with the questions, and filling the contents into a Prompt word template to form the Prompt words in a Prompt sentence format.
Promtt hint word sentence example: "" "known information: { item name: ERP System version number: v1.0 mirror version: the functional module of the erp-server-V1.0 on-line comprises: team members such as financial accounting, project management, finance management, etc.: zhang san, lisi, wang Wu; time to line: 2023-10-16 pm 20 o' clock }.
Based on the above known information, the user's questions are briefly and professionally answered. If no answer can be obtained from the answer, please answer: "the question cannot be answered based on known information" or "sufficient relevant information is not provided". Adding fictional content to the answer is not allowed and the answer must be in chinese. The problems are: { write one ERP system on-line scheme } "".
F. Automatically generating document content
And acquiring the ChatGPT link information from the configuration information, carrying out dialogue by using an OpenAI interface provided by the ChatGPT, loading the Prompt word, transmitting the Prompt word to the ChatGPT, and returning the generated answer.
G. Content security filtering and integrity verification
Checking the generated answer content according to the configured sensitive word service, judging whether the generated content has sensitive information, and if so, performing desensitization treatment; if not, directly carrying out subsequent treatment; and the related content is returned to the ChatGPT of other configurations for supplementary confirmation, so that the integrity of the generated answer is ensured.
H. Returning an answer
And encrypting the answer document and transmitting the encrypted answer document to the target terminal, and returning the complete document content to the user.
For example, the answer document generated in the above manner is as follows:
ERP system online scheme:
1. project introduction
Project name: ERP system
Version number: v1.0
Mirror version: erp-server-V1.0
Covered functional modules: accounting, project management, finance management, etc
Project team: zhang San, lifour, wang Wu
2. System testing
Before formally online, the system is subjected to comprehensive testing, including but not limited to functional testing of modules such as financial accounting, project management, financial management and the like, so that the stability and the accuracy of the system are ensured.
3. Staff training
The training method for the ERP system is used for organizing staff, and mainly comprises the contents of basic operation, business flow, data management and the like of a new system, wherein the training aims to ensure the familiarity and effective use of the new system by the staff.
4. Data migration
The data of the old system is migrated to the new ERP system, and the backup and verification of the data are performed to ensure the integrity and accuracy of the data.
5. System switching
After the system test is completed, the data migration is completed and the staff training is completed, the system switching is formally performed, namely, the service is switched from the old system to the new ERP system.
6. On-line support
After online, providing technical support for a period of time to solve the problems possibly encountered by a new system in operation; meanwhile, feedback of users is collected, and the system is continuously optimized and upgraded.
7. Project assessment
After the system is online, the project is evaluated, including the stability of the system, the accuracy of data, the use condition of staff and the like, so as to know the actual effect and influence of the system.
8. Risk management
Throughout the project, risk management, including risk identification, risk assessment and risk management, will continue to be performed to ensure successful performance of the project.
9. Time to line
The specific time to line will be determined based on the traffic and team readiness, with the goal of line to occur when traffic impact is minimal.
To sum up, in this embodiment, the business operation efficiency of the company is improved and the smooth proceeding of the project is ensured by the online of the ERP (enterprise resource planning) system.
In some embodiments, building the vertical domain knowledge base includes:
determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link;
converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word;
and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
In this embodiment, the pre-formatted text is converted into a document vector, and stored into a vector database that satisfies a support vector search engine (e.g., an elastic search) to form a vertical domain knowledge base: and converting the text in the preset format into a document vector, and storing the document vector in a vector database meeting the requirements of a support vector search engine to form a vertical domain knowledge base. Through the method, the document data can be effectively extracted from the determined local knowledge base, converted into the text and the document vector in the preset format, and then stored in the vector database meeting the support vector search engine to construct the vertical domain knowledge base, so that accurate and comprehensive information is provided for subsequent question answers or other tasks.
In some embodiments, the verifying using the security policy further comprises:
if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail;
and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
In some embodiments, until the final answer document is output, further comprising:
generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal;
Signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model;
the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
Through the mode, the encryption and decryption processes of the data are realized, and meanwhile, the private key is used for signing the first encrypted message, so that the source and the integrity of the data can be ensured; the method and the device solve the problem that the forwarding data is tampered, and can be checked no matter in which link the data is tampered, so that the authenticity and the integrity of the final answer document are ensured, and the safety of safety check is improved.
In some embodiments, determining whether sensitive word information is involved further comprises:
if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base;
If the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
Specifically, the preset sensitive word library is a preset word library, wherein various possible sensitive words or phrases are contained in the preset word library, and the sensitive word library is used for checking whether the input text feature vector contains the sensitive words or not.
If the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts;
specifically, if the question feature vector contains a sensitive word, identifying a sensitive word category in the text feature vector, and determining a preset sensitive text according to the identified sensitive word category; then, responding according to the preset sensitive text.
And if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a preset document vector library.
Specifically, if the problem feature vector does not contain the sensitive word, the problem feature vector is triggered to be matched in a preset document vector library, a preset document which is matched with the input text best is found, and then response is carried out according to the preset document.
For example, the answer document is the same as the question feature vector principle, and will not be described here again.
By the method, whether the sensitive words are contained or not is automatically detected, and corresponding response texts are automatically generated according to the types of the sensitive words or other matching conditions, so that interaction with a user is more natural and accurate.
Referring to fig. 5, a schematic diagram of a structure for verification using a security policy according to an exemplary embodiment of the present invention is shown in detail as follows:
carrying out safety verification on problem information input by a user, judging whether sensitive information is related according to safety strategy configuration information, and if the sensitive information is not related to the problem information, converting the query problem into a semantic vector for calculating the similarity between the problem and the knowledge base text by using a vectorization processing mode which is the same as that of the vertical field knowledge base text;
the generated answer document needs to be checked according to the configured sensitive word service, whether the generated content has sensitive information or not is judged to be subjected to desensitization processing, and related content is returned to other configured ChatGPTs for supplementary confirmation, so that the integrity of the generated answer is ensured.
In this embodiment, the content is encrypted and transmitted to the target terminal, and the complete archive document is returned and sent to the user.
In some embodiments, the knowledge base and the large model are arranged by adopting a micro-service containerization architecture design, and services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands;
the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
Referring to fig. 6, a schematic architecture diagram of a knowledge base and large model based document generation system according to an exemplary embodiment of the present invention is shown, and is described in detail as follows:
the system adopts a micro-service containerized architecture design, and splits service according to a single principle into a data collector, a management background service (API) and a proxy service (external access), and each service node can realize automatic capacity expansion and contraction according to performance requirements, so that the high availability and the stability of the system are improved.
The system also adopts an adapter mode design, supports multiple interfaces to integrate, comprises multiple data sources and multiple ChatGPT channels of a knowledge base, and improves the expandability of the system. A step of
The system configures a local knowledge base corresponding to a vertical domain knowledge base and a vector database in advance through a configuration center, and stores objects through OSS (for example, OSS (Object Storage Service, object storage service) is a storage structure and is mainly used for storing unstructured data such as pictures, videos, log files and the like.
In addition, the tools for software deployment and management employed in FIG. 8 are Kubernetes (k 8 s) and Docker, where Docker is an open-source containerized platform for building, deploying and running applications, by using container technology, packaging applications and their dependencies into a separate, portable container. The deployment and expansion of applications is made faster and more efficient, while providing cross-platform, portable features.
k8s is an open-source container orchestration system for automated application container deployment, extension, and management. For coordinating the execution of Docker containers (or other container runtime) in a cluster, handling tasks such as load balancing, service discovery, expansion, rollback, etc. k8s provides an abstraction layer, so that a user can ignore specific implementation details of the abstraction of the underlying Docker container, and simultaneously provides a plurality of functions such as automatic disaster recovery, automatic expansion and contraction, automatic log collection and the like. Dock is used for packaging and deployment of applications and their dependent items, while k8s is used for orchestration and management of application containers.
ChatGPT request processing:
aiming at the situation that the concurrent request is excessively larger than the ChatGPT processing capacity, the system adopts a mode of pooling request design and asynchronous blocking, so that the throughput and the utilization rate of the system are improved;
creating a thread pool with a matched size according to the ChatGPT processing capacity, and caching the request of the ChatGPT;
processing the request in an asynchronous queue mode;
when the requested data text exceeds the threshold value set by ChatGPT, automatically splitting the requested text according to paragraphs, realizing multiple request calls of a question, caching returned answers, and finally, uniformly splicing and returning the answers.
By the method, the high availability and the expandability of the AIGC service are realized based on the micro-service architecture containerized architecture technology.
In some embodiments, building the vertical domain knowledge base includes:
acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
Specifically, if the text data corresponding to the document data is monitored to exceed the preset length, splitting the text data corresponding to the document data to obtain a plurality of text fragments, and generating text vectors, fragment identifications and text keywords corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation among the text fragments, the text vectors, the fragment identifications and the text keywords; for example, acquiring question information; generating a question vector corresponding to the question information; determining a target segment corresponding to the target problem from the text segments of the vertical domain knowledge base according to the text vector corresponding to the problem vector; and generating a question prompt word corresponding to the text information according to at least one part of the target fragment.
By the method, compared with the method of directly dividing through document segmentation granularity, the method has the advantages that the text segmentation position is determined based on segmentation characters, and the integrity of each text segment is guaranteed, so that the prompt information of a target question (question information) is more accurate, and the accuracy of a system in outputting an answer document is improved;
because the large model has input limit, text keywords are extracted from the text fragments, and question prompt words corresponding to target questions are generated according to the text keywords, compared with the whole text fragments, the data size of the text keywords is smaller, so that the input limit of the large model is met, meanwhile, the pertinence is stronger, and the question-answering efficiency of the large model is improved; the method and the system not only generate the question answers through the current target questions, but also generate the question recommendation information according to the history records associated with the target questions, so that more information is provided for users, and the comprehensiveness of the output of the question-answering large model is improved.
In this embodiment, the document generation method based on the knowledge base and the large model is applied to daily, school or company of an enterprise, and has the following technical effects:
firstly, the invention realizes knowledge base construction, prompt word template creation, chatGPT integration and security policy creation based on AIGC interface visual configuration technology, and the whole process carries out personalized flexible configuration according to the user requirement, thereby reducing the workload of repeated development and hard coding. Secondly, the invention realizes the automatic processing of knowledge base data and the automatic generation of documents according to the user's question based on AIGC flow automation technology, the program standardization of the whole process, the excessive participation of users is not needed, and the efficiency and quality of document generation are improved. Thirdly, based on AIGC Prompt templating technology, the invention realizes the prefabrication of the Prompt template according to scene classification, and dynamically generates a language big model Prompt word statement according to the input of a user and the data of a local knowledge base. Fourth, the invention is based on AIGC security guarantee technology, which realizes security guarantee of input information security check, generated content security filtering, user privacy information encryption and the like. Fifth, the invention realizes high availability and expandability of AIGC service based on micro-service architecture containerization architecture technology.
In the embodiment, through combining the vertical domain knowledge base with the ChatGPT model, a user can quickly create a project document by one-key operation, and required document content is automatically generated. The interface personalized visual configuration, the complete automation of the process and the safety are guaranteed, and a user can obtain required project files almost without excessive adjustment, so that the document writing efficiency, the standardization, the specialization and the document quality are effectively improved.
Referring to FIG. 7, a schematic diagram of a knowledge base and large model based document generation system is shown in accordance with an exemplary embodiment of the invention. As shown in connection with FIG. 7, the exemplary knowledge base and large model based document generation system includes: a configuration module 701, a question determination module 702, a document vector determination module 703, a hint word determination module 704, and a document generation module 705, wherein:
the configuration module 701 is configured to obtain configuration information carrying personalized requirements, configure the knowledge base, the big model, the prompt word template and the security policy according to the configuration information, construct a vertical domain knowledge base, and output the big model based on the vertical domain knowledge base, the prompt word template and the security policy;
The problem determining module 702 is configured to obtain input question information, perform feature extraction on text information corresponding to the question information, and determine a problem feature vector;
the document vector determining module 703 is configured to perform a verification on the problem feature vector according to the security policy, input the problem feature vector into the vertical domain knowledge base for matching after the verification is passed, and output the document vector if a document vector with a degree of association with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base;
a prompt word determining module 704, configured to populate the document vector and the problem feature vector with the prompt word template, and generate a prompt word;
the document generation module 705 is configured to input the prompt word into the large model for reasoning, generate an answer document, perform a second verification on the answer document according to the security policy, and output the answer document after the verification is passed.
It should be noted that, the knowledge base and large model-based document generating system provided in the above embodiment and the knowledge base and large model-based document generating method provided in the above embodiment belong to the same concept, and the specific manner in which each step performs the operation has been described in detail in the system embodiment, which is not repeated here.
By adopting the document generation system based on the knowledge base and the large model, which is provided by the embodiment of the disclosure, the knowledge base is pre-configured to construct the knowledge base in the vertical field, the large model and the knowledge base are personalized configured according to the service requirement of the user, and the knowledge base in the vertical field is combined with the large model, so that the project document can be quickly created by one-key operation of the user, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the user requirement can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Referring to FIG. 8, a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention is shown. It should be noted that, the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a central processing unit (Central Processing Unit, CPU) 801 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 802 or a program loaded from a storage section 808 into a random access Memory (Random Access Memory, RAM) 803. In the RAM 803, various programs and data required for system operation are also stored. The CPU 801, ROM802, and RAM 803 are connected to each other by a bus 804. An Input/Output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, and a speaker, and the like; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN (Local Area Network ) card, modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage portion 808 as needed.
In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. When executed by a Central Processing Unit (CPU) 801, performs the various functions defined in the system of the present invention.
The present invention also provides a computer readable storage medium storing a computer program which when executed implements at least one embodiment described above for a knowledge base and large model based document generation method, such as the embodiment described in fig. 1.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
In one or more exemplary aspects, the functions described by the computer program of the methods of the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed in the present invention may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.
The flowcharts and block diagrams in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.
Claims (11)
1. A knowledge base and large model based document generation method, comprising:
configuration information carrying personalized requirements is obtained, the knowledge base, the big model, the prompt word template and the security policy are configured according to the configuration information, a vertical domain knowledge base is constructed, and the big model is output based on the vertical domain knowledge base, the prompt word template and the security policy;
acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector;
the problem feature vector is checked once according to the security policy, after the verification is passed, the problem feature vector is input into the vertical domain knowledge base for matching, and if the document vector which has the association degree with the problem feature vector reaching a preset threshold value is matched in the vertical domain knowledge base, the document vector is output;
Filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
and inputting the prompt word into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
2. The knowledge base and large model based document generation method according to claim 1, wherein obtaining configuration information carrying personalized requirements, and configuring the knowledge base, the large model, a prompt word template and a security policy according to the configuration information, respectively, comprises:
acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model;
configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy;
Configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption;
configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template;
configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
3. The knowledge base and large model based document generation method of claim 1, wherein constructing the vertical domain knowledge base comprises:
determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link;
Converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word;
and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
4. The knowledge base and large model based document generation method of claim 1, wherein verification is performed using the security policy, further comprising:
if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail;
and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
5. The knowledge base and large model based document generation method of claim 4, further comprising, until outputting said final answer document:
generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal;
signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model;
the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
6. The knowledge base and large model based document generation method of claim 4, wherein determining whether sensitive word information is involved further comprises:
if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base;
If the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
7. The knowledge base and large model based document generation method according to any one of claims 1 to 6, wherein the knowledge base and the large model are arranged by adopting a micro service containerization architecture design, services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands;
the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
8. The knowledge base and large model based document generation method of claim 1, wherein constructing the vertical domain knowledge base comprises:
Acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments;
establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
9. A knowledge base and large model based document generation system, comprising:
the configuration module is used for acquiring configuration information carrying personalized requirements, respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, constructing a vertical domain knowledge base, and outputting the large model based on the vertical domain knowledge base, the prompt word template and the security policy;
the problem determining module is used for acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information and determining a problem characteristic vector;
The document vector determining module is used for checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the problem feature vector passes the checking, and outputting the document vector if the document vector which has the association degree with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base;
the prompt word determining module is used for filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
and the document generation module is used for inputting the prompt words into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting the answer document after the verification is passed.
10. An electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to any one of claims 1 to 8.
11. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311521400.2A CN117556010A (en) | 2023-11-13 | 2023-11-13 | Knowledge base and large model-based document generation system, method, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311521400.2A CN117556010A (en) | 2023-11-13 | 2023-11-13 | Knowledge base and large model-based document generation system, method, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117556010A true CN117556010A (en) | 2024-02-13 |
Family
ID=89814148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311521400.2A Pending CN117556010A (en) | 2023-11-13 | 2023-11-13 | Knowledge base and large model-based document generation system, method, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117556010A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743315A (en) * | 2024-02-20 | 2024-03-22 | 浪潮软件科技有限公司 | Method for providing high-quality data for multi-mode large model system |
-
2023
- 2023-11-13 CN CN202311521400.2A patent/CN117556010A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743315A (en) * | 2024-02-20 | 2024-03-22 | 浪潮软件科技有限公司 | Method for providing high-quality data for multi-mode large model system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11093707B2 (en) | Adversarial training data augmentation data for text classifiers | |
US11887010B2 (en) | Data classification for data lake catalog | |
US9923860B2 (en) | Annotating content with contextually relevant comments | |
US11189269B2 (en) | Adversarial training data augmentation for generating related responses | |
US10885276B2 (en) | Document clearance using blockchain | |
US11030402B2 (en) | Dictionary expansion using neural language models | |
US11429352B2 (en) | Building pre-trained contextual embeddings for programming languages using specialized vocabulary | |
US11416539B2 (en) | Media selection based on content topic and sentiment | |
US11586816B2 (en) | Content tailoring for diverse audiences | |
US11182545B1 (en) | Machine learning on mixed data documents | |
CN117556010A (en) | Knowledge base and large model-based document generation system, method, equipment and medium | |
US11954138B2 (en) | Summary generation guided by pre-defined queries | |
US20230092274A1 (en) | Training example generation to create new intents for chatbots | |
US11968224B2 (en) | Shift-left security risk analysis | |
US20200110834A1 (en) | Dynamic Linguistic Assessment and Measurement | |
AU2020364386B2 (en) | Rare topic detection using hierarchical clustering | |
US20230418859A1 (en) | Unified data classification techniques | |
WO2022048535A1 (en) | Reasoning based natural language interpretation | |
US11675980B2 (en) | Bias identification and correction in text documents | |
US20220207384A1 (en) | Extracting Facts from Unstructured Text | |
US11361761B2 (en) | Pattern-based statement attribution | |
CN112131378A (en) | Method and device for identifying categories of civil problems and electronic equipment | |
US20230004761A1 (en) | Generating change request classification explanations | |
US11314931B2 (en) | Assistant dialog model generation | |
US11675822B2 (en) | Computer generated data analysis and learning to derive multimedia factoids |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |