CN117556010A - Knowledge base and large model-based document generation system, method, equipment and medium - Google Patents

Knowledge base and large model-based document generation system, method, equipment and medium Download PDF

Info

Publication number
CN117556010A
CN117556010A CN202311521400.2A CN202311521400A CN117556010A CN 117556010 A CN117556010 A CN 117556010A CN 202311521400 A CN202311521400 A CN 202311521400A CN 117556010 A CN117556010 A CN 117556010A
Authority
CN
China
Prior art keywords
document
knowledge base
text
large model
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311521400.2A
Other languages
Chinese (zh)
Inventor
王晓虎
陈善艺
陈明友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Guangyu Mingdao Digital Technology Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Guangyu Mingdao Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Guangyu Mingdao Digital Technology Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202311521400.2A priority Critical patent/CN117556010A/en
Publication of CN117556010A publication Critical patent/CN117556010A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a document generation method, a system, equipment and a medium based on a knowledge base and a large model, wherein the method comprises the following steps: configuration information carrying personalized requirements is obtained, a knowledge base, a large model, a prompt word template and a security policy are configured according to the configuration information, and a knowledge base in the vertical field is constructed; extracting features of text information corresponding to the question information, and determining a question feature vector; checking the problem feature vector according to the security policy, inputting the problem feature vector into a vertical domain knowledge base after the checking is passed, and outputting a document vector if the problem feature vector is matched with the document vector associated with the problem feature vector; filling the document vector and the problem feature vector into a prompt word template to generate a prompt word; the method and the device improve the document writing efficiency, normalization, specialty and document quality by inputting the prompt words into the large model for reasoning, generating the answer document, checking the answer document and outputting the answer document after the answer document passes the check.

Description

Knowledge base and large model-based document generation system, method, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a document generation system, a method, equipment and a medium based on a knowledge base and a large model.
Background
In the related technology, a large amount of delivery document output is required in daily operation of an enterprise, and the traditional document processing method has low efficiency, large amount of repeatability and low-value workload, so that the requirements of cost reduction and synergy of the enterprise are difficult to meet.
However, the large model is adopted to process the delivery document required by enterprise operation output, and as the object processed by the large model belongs to the general field, the document content processing capability in the vertical field is not professional enough, so that the matching degree of the output delivery document and the questioning information is poor, and the user requirement cannot be met.
Disclosure of Invention
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows.
In view of the shortcomings of the prior art, the invention discloses a document generation system, a method, equipment and a medium based on a knowledge base and a large model, which are used for solving the problem that the matching degree of a delivered document and question information is poor and the user requirement cannot be met.
In a first aspect, the present invention provides a document generation method based on a knowledge base and a large model, including: configuration information carrying personalized requirements is obtained, the knowledge base, the big model, the prompt word template and the security policy are configured according to the configuration information, a vertical domain knowledge base is constructed, and the big model is output based on the vertical domain knowledge base, the prompt word template and the security policy; acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector; the problem feature vector is checked once according to the security policy, after the verification is passed, the problem feature vector is input into the vertical domain knowledge base for matching, and if the document vector which has the association degree with the problem feature vector reaching a preset threshold value is matched in the vertical domain knowledge base, the document vector is output; filling the document vector and the problem feature vector into the prompt word template to generate a prompt word; and inputting the prompt word into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
In an embodiment of the first aspect, obtaining configuration information carrying personalized requirements, and respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, including: acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model; configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy; configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption; configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template; configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
In an embodiment of the first aspect, constructing the vertical domain knowledge base includes: determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link; converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word; and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
In an embodiment of the first aspect, the verifying using the security policy further includes: if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail; and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
In an embodiment of the first aspect, until the final answer document is output, the method further includes: generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal; signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model; the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
In an embodiment of the first aspect, determining whether sensitive word information is involved further comprises: if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base; if the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
In an embodiment of the first aspect, the knowledge base and the large model are arranged by adopting a micro-service containerization architecture design, and services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands; the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
In an embodiment of the first aspect, constructing the vertical domain knowledge base includes: acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
The invention provides a document generation system based on a knowledge base and a large model, which comprises: the configuration module is used for acquiring configuration information carrying personalized requirements, respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, constructing a vertical domain knowledge base, and outputting the large model based on the vertical domain knowledge base, the prompt word template and the security policy; the problem determining module is used for acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information and determining a problem characteristic vector; the document vector determining module is used for checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the problem feature vector passes the checking, and outputting the document vector if the document vector which has the association degree with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base; the prompt word determining module is used for filling the document vector and the problem feature vector into the prompt word template to generate a prompt word; and the document generation module is used for inputting the prompt words into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting the answer document after the verification is passed.
The invention provides an electronic device, comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the method.
The present invention provides a computer readable medium having stored thereon a computer program for causing a computer to perform the above-described method.
The invention has the beneficial effects that:
the vertical domain knowledge base is constructed by pre-configuring the knowledge base, the large model and the knowledge base are configured in a personalized way according to the service processing requirements of users, and the vertical domain knowledge base and the large model are combined, so that the project document can be quickly created by one-key operation of the users, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the requirements of the users can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Drawings
FIG. 1 is a flow diagram illustrating a knowledge base and large model based document generation method in accordance with an illustrative embodiment of the present invention;
FIG. 2 is a schematic diagram of a personalized configuration flow shown in an exemplary embodiment of the invention;
FIG. 3 is a flow chart of a method for automatically generating documents according to an exemplary embodiment of the present invention;
FIG. 4 is a schematic diagram of a prompt configuration flow shown in an exemplary embodiment of the invention;
FIG. 5 is a schematic diagram illustrating a structure for verification using security policies in accordance with an exemplary embodiment of the present invention;
FIG. 6 is a schematic diagram of an architecture of a knowledge base and large model based document generation system, as illustrated in an exemplary embodiment of the invention;
FIG. 7 is a schematic diagram of a knowledge base and large model based document generation system in accordance with an illustrative embodiment of the invention;
fig. 8 is a schematic diagram of a computer system suitable for use in implementing the electronic device of the present invention, as shown in an exemplary embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that, without conflict, the following embodiments and sub-samples in the embodiments may be combined with each other.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.
The terms first, second and the like in the description and in the claims of the embodiments of the disclosure and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe embodiments of the present disclosure. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. The term "plurality" means two or more, unless otherwise indicated. In the embodiment of the present disclosure, the character "/" indicates that the front and rear objects are an or relationship. For example, A/B represents: a or B. The term "and/or" is an associative relationship that describes an object, meaning that there may be three relationships. For example, a and/or B, represent: a or B, or, A and B.
Referring to fig. 1, a flowchart of a document generating method based on a knowledge base and a large model according to an exemplary embodiment of the invention is shown. Referring to fig. 1, in an exemplary embodiment, the method for generating a document based on a knowledge base and a large model at least includes steps S101 to S105, which are described in detail as follows:
step S101, configuration information carrying personalized requirements is obtained, the knowledge base, the large model, a prompt word template and a security policy are respectively configured according to the configuration information, a vertical domain knowledge base is constructed, and the large model is output based on the vertical domain knowledge base, the prompt word template and the security policy;
the configuration information with personalized requirements is obtained in advance to meet the service requirements of users, for example, a local knowledge base is configured according to the requirements of the users to form a vertical field knowledge base, meanwhile, a prompt word template and a security policy are configured, so that the large model adopts the security policy to carry out security verification before the question information is input, the question information and a document vector which is matched with the question information in the vertical field knowledge base are filled into the prompt word template, prompt words are generated and input into the large model to carry out reasoning, answer documents are determined, and the security verification is carried out on the answer documents by adopting the security policy, and the answer documents can be output after verification is passed.
Specifically, the knowledge base, the large model, the prompt word template and the security policy can be respectively configured by adopting AIGC, so that the interface individuation flexible configuration, the full-process automation, the templating of the prompt word template, the security guarantee mechanism and the service high availability and expandability based on the micro-service architecture containerization architecture technology are ensured.
For example, AIGC refers to generated artificial intelligence, and is an emerging technology in the field of artificial intelligence in recent years. It is capable of automatically generating new and creative contents by learning a large amount of data and knowledge, and is applied to various fields such as natural language processing, image generation, audio synthesis, etc.
Step S102, acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector;
specifically, the question information may be any one of text information, voice information, image information, or video information, where, for example, the target terminal is used to collect voice information of the user, and the target terminal includes at least one of the following: smart phones, tablets, notebook computers, computers.
For example, after converting the voice information into text information, extracting features of the text information to obtain a problem feature vector; for example, voice information input by a user is acquired through a recording or voice input device using a voice recognition technology, and the acquired voice information is converted into text information using a voice recognition engine; preprocessing the converted text information, including word segmentation, stop word removal, special symbol removal and other operations, so as to prepare subsequent feature extraction; features are extracted from the preprocessed text information, and the features can comprise text features such as word Frequency, word length, part of speech, named entities and the like, and the features can be weighted by using a TF-IDF (Term Frequency-Inverse Document Frequency) method and the like, so that the extracted features are converted into numerical vector forms, and the numerical vector forms a problem feature vector.
Step S103, checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the checking is passed, and outputting the document vector if the document vector with the association degree reaching a preset threshold value with the problem feature vector is matched in the vertical domain knowledge base;
specifically, the word expressed by the problem feature vector is matched by using a preset sensitive word bank, if the problem feature vector is not matched with any sensitive word in the preset sensitive word bank, the verification is passed, otherwise, the verification is not passed.
Specifically, similarity calculation is carried out on each document vector according to the problem feature vector, so that vector similarity between the document vector and each text feature vector is obtained; and determining a document vector associated with the text feature vector from the preset document vector library according to the vector similarity, and taking the document vector with the association degree reaching a preset threshold value as the document vector to be output.
Optionally, if the verification is not passed, a preset template is called to directly respond to the problem feature vector.
Step S104, filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
specifically, taking a document vector and a problem feature vector as inputs, and determining a word or phrase to be filled according to the positions of the vectors in a prompt word template; if the prompt word template contains placeholders or variables, the corresponding vocabulary or phrase is used for replacement.
It should be noted that when filling the hint word template, the name, type, style and content of the document should be considered to ensure that the generated hint word remains consistent with the original document; in addition, other techniques may be used to optimize the generation of the hint words, such as preprocessing the text using natural language processing techniques, analyzing the text using rules or machine learning models, and so forth. For example, user intent, target task, etc. are fed back in the prompt.
In this manner, a cue word having a particular content and format may be generated for use in subsequent text generation or other tasks.
Step S105, inputting the prompt words into the large model for reasoning, generating an answer document, performing secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
Specifically, the large model processes, for example, encodes or converts, the sequence corresponding to the input prompt word to generate a corresponding output sequence, and the output sequence generates a final answer document after decoding or converting. In addition, the generated answer document is subjected to secondary verification by using a security policy, wherein the security policy can comprise aspects of content filtering, content integrity verification, language style inspection, grammar correctness verification and the like, and if the answer document does not meet the requirement of the security policy, the answer document needs to be correspondingly corrected or regenerated until the answer document passes the verification, and the answer document is output.
By means of the method, the vertical domain knowledge base is built through the pre-configuration knowledge base, the large model and the knowledge base are configured in an individualized mode according to the service processing requirements of users, and the vertical domain knowledge base and the large model are combined, so that the project document can be quickly created through one-key operation of the users, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the user requirements can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Referring to fig. 2, a personalized configuration flow chart is shown in an exemplary embodiment of the present invention, and is described in detail as follows: the method for obtaining configuration information carrying personalized requirements, and respectively configuring the knowledge base, the big model, the prompt word template and the security policy according to the configuration information comprises the following steps:
acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model;
for example, in fig. 2, the configuration information is configured in sequence by a knowledge base, a large model (i.e., chatgpt), a prompting word template and a security policy, and it is to be noted that the configuration sequence is not required to be considered when the knowledge base, the security policy, the prompting word template and the large model are configured.
Configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy;
Configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption;
configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template;
for example, a hint word template example: "" "known information: { context }, the user's question is briefly and professionally answered according to the above known information. If no answer can be obtained from the answer, please answer: "the question cannot be answered based on known information" or "sufficient relevant information is not provided". Adding fictional content to the answer is not allowed and the answer must be in chinese. The problems are: { query } "". Wherein context part fills the text retrieved by the knowledge base and query fills the problem of user query.
Configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
In this embodiment, a knowledge base, a large model, a prompt word template and a security policy are configured according to personalized requirement configuration information, where the knowledge base is configured according to first configuration information, and a vertical domain knowledge base is constructed: configuring the security policy according to the second configuration information to generate a security policy with problem security check and content security check, thereby realizing the problem security check and the content security check; and configuring the prompt word templates according to the third configuration information, and generating the prompt word templates containing the problem prompt word sentences aiming at the questioning information so as to generate the prompt word templates containing the problem prompt word sentences aiming at the questioning information. And configuring the large model according to the fourth configuration information so as to combine the large model with the knowledge base, the security policy and the prompt word template, so that the large model can be combined with the knowledge base, the security policy and the prompt word template to generate an answer. In general, the corresponding configuration can be performed according to the personalized requirements of the user, so that more accurate and safe answers to the questions are provided.
Referring to fig. 3, a flowchart of a method for automatically generating a document according to an exemplary embodiment of the present invention is shown, including:
A. Knowledge base data preprocessing
1) Loading and reading documents
The knowledge base documents stored locally are read by means of batch processing and converted into text format.
2) Text segmentation
The above configured segmentation strategy performs segmentation, for example: and dividing the text according to the classification, paragraphs, text lines and the like to obtain the text of each part.
B. Text vectorization storage
1) Text vectorization
The segmented text is converted into a numerical vector so as to facilitate the subsequent text similarity calculation, thereby retrieving the text related to the problem.
2) Vector storage
The text vectorized data is stored in an elastic search vector database.
Examples: sentence: "write one ERP System Online scheme"
Word segmentation: word segmentation result of sentences: [ "write", "one copy", "ERP System", "on line", "scheme" ] the system is written to the user
Constructing a dictionary: { "write": "0", "one": "0", "ERP system": "3", "online": "2", "scheme": "0" }
Vector representation: the vectors of the sentence are expressed as: [0 0 3 2 0];
vector storage: these vector data are stored in an elastic search vector database.
C. Safe filtering and problem vectorization
And carrying out safety verification aiming at the problem input by the user, and judging whether sensitive information is related or not according to the safety strategy configuration information. If not, the query question is converted into a semantic vector by using the vectorization processing mode which is the same as that of the knowledge base text, and the semantic vector is used for calculating the similarity between the question and the knowledge base text.
D. Text similarity matching
Referring to fig. 4 in detail, the text vector closest to the problem vector is found out by means of cosine similarity, euclidean distance and other calculation methods, so as to find the top k texts most relevant to the problem in the vertical domain knowledge base.
E. Generating a Prompt word statement
And selecting relevant contents from texts most relevant to the questions, combining the relevant contents with the questions, and filling the contents into a Prompt word template to form the Prompt words in a Prompt sentence format.
Promtt hint word sentence example: "" "known information: { item name: ERP System version number: v1.0 mirror version: the functional module of the erp-server-V1.0 on-line comprises: team members such as financial accounting, project management, finance management, etc.: zhang san, lisi, wang Wu; time to line: 2023-10-16 pm 20 o' clock }.
Based on the above known information, the user's questions are briefly and professionally answered. If no answer can be obtained from the answer, please answer: "the question cannot be answered based on known information" or "sufficient relevant information is not provided". Adding fictional content to the answer is not allowed and the answer must be in chinese. The problems are: { write one ERP system on-line scheme } "".
F. Automatically generating document content
And acquiring the ChatGPT link information from the configuration information, carrying out dialogue by using an OpenAI interface provided by the ChatGPT, loading the Prompt word, transmitting the Prompt word to the ChatGPT, and returning the generated answer.
G. Content security filtering and integrity verification
Checking the generated answer content according to the configured sensitive word service, judging whether the generated content has sensitive information, and if so, performing desensitization treatment; if not, directly carrying out subsequent treatment; and the related content is returned to the ChatGPT of other configurations for supplementary confirmation, so that the integrity of the generated answer is ensured.
H. Returning an answer
And encrypting the answer document and transmitting the encrypted answer document to the target terminal, and returning the complete document content to the user.
For example, the answer document generated in the above manner is as follows:
ERP system online scheme:
1. project introduction
Project name: ERP system
Version number: v1.0
Mirror version: erp-server-V1.0
Covered functional modules: accounting, project management, finance management, etc
Project team: zhang San, lifour, wang Wu
2. System testing
Before formally online, the system is subjected to comprehensive testing, including but not limited to functional testing of modules such as financial accounting, project management, financial management and the like, so that the stability and the accuracy of the system are ensured.
3. Staff training
The training method for the ERP system is used for organizing staff, and mainly comprises the contents of basic operation, business flow, data management and the like of a new system, wherein the training aims to ensure the familiarity and effective use of the new system by the staff.
4. Data migration
The data of the old system is migrated to the new ERP system, and the backup and verification of the data are performed to ensure the integrity and accuracy of the data.
5. System switching
After the system test is completed, the data migration is completed and the staff training is completed, the system switching is formally performed, namely, the service is switched from the old system to the new ERP system.
6. On-line support
After online, providing technical support for a period of time to solve the problems possibly encountered by a new system in operation; meanwhile, feedback of users is collected, and the system is continuously optimized and upgraded.
7. Project assessment
After the system is online, the project is evaluated, including the stability of the system, the accuracy of data, the use condition of staff and the like, so as to know the actual effect and influence of the system.
8. Risk management
Throughout the project, risk management, including risk identification, risk assessment and risk management, will continue to be performed to ensure successful performance of the project.
9. Time to line
The specific time to line will be determined based on the traffic and team readiness, with the goal of line to occur when traffic impact is minimal.
To sum up, in this embodiment, the business operation efficiency of the company is improved and the smooth proceeding of the project is ensured by the online of the ERP (enterprise resource planning) system.
In some embodiments, building the vertical domain knowledge base includes:
determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link;
converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word;
and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
In this embodiment, the pre-formatted text is converted into a document vector, and stored into a vector database that satisfies a support vector search engine (e.g., an elastic search) to form a vertical domain knowledge base: and converting the text in the preset format into a document vector, and storing the document vector in a vector database meeting the requirements of a support vector search engine to form a vertical domain knowledge base. Through the method, the document data can be effectively extracted from the determined local knowledge base, converted into the text and the document vector in the preset format, and then stored in the vector database meeting the support vector search engine to construct the vertical domain knowledge base, so that accurate and comprehensive information is provided for subsequent question answers or other tasks.
In some embodiments, the verifying using the security policy further comprises:
if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail;
and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
In some embodiments, until the final answer document is output, further comprising:
generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal;
Signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model;
the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
Through the mode, the encryption and decryption processes of the data are realized, and meanwhile, the private key is used for signing the first encrypted message, so that the source and the integrity of the data can be ensured; the method and the device solve the problem that the forwarding data is tampered, and can be checked no matter in which link the data is tampered, so that the authenticity and the integrity of the final answer document are ensured, and the safety of safety check is improved.
In some embodiments, determining whether sensitive word information is involved further comprises:
if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base;
If the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
Specifically, the preset sensitive word library is a preset word library, wherein various possible sensitive words or phrases are contained in the preset word library, and the sensitive word library is used for checking whether the input text feature vector contains the sensitive words or not.
If the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts;
specifically, if the question feature vector contains a sensitive word, identifying a sensitive word category in the text feature vector, and determining a preset sensitive text according to the identified sensitive word category; then, responding according to the preset sensitive text.
And if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a preset document vector library.
Specifically, if the problem feature vector does not contain the sensitive word, the problem feature vector is triggered to be matched in a preset document vector library, a preset document which is matched with the input text best is found, and then response is carried out according to the preset document.
For example, the answer document is the same as the question feature vector principle, and will not be described here again.
By the method, whether the sensitive words are contained or not is automatically detected, and corresponding response texts are automatically generated according to the types of the sensitive words or other matching conditions, so that interaction with a user is more natural and accurate.
Referring to fig. 5, a schematic diagram of a structure for verification using a security policy according to an exemplary embodiment of the present invention is shown in detail as follows:
carrying out safety verification on problem information input by a user, judging whether sensitive information is related according to safety strategy configuration information, and if the sensitive information is not related to the problem information, converting the query problem into a semantic vector for calculating the similarity between the problem and the knowledge base text by using a vectorization processing mode which is the same as that of the vertical field knowledge base text;
the generated answer document needs to be checked according to the configured sensitive word service, whether the generated content has sensitive information or not is judged to be subjected to desensitization processing, and related content is returned to other configured ChatGPTs for supplementary confirmation, so that the integrity of the generated answer is ensured.
In this embodiment, the content is encrypted and transmitted to the target terminal, and the complete archive document is returned and sent to the user.
In some embodiments, the knowledge base and the large model are arranged by adopting a micro-service containerization architecture design, and services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands;
the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
Referring to fig. 6, a schematic architecture diagram of a knowledge base and large model based document generation system according to an exemplary embodiment of the present invention is shown, and is described in detail as follows:
the system adopts a micro-service containerized architecture design, and splits service according to a single principle into a data collector, a management background service (API) and a proxy service (external access), and each service node can realize automatic capacity expansion and contraction according to performance requirements, so that the high availability and the stability of the system are improved.
The system also adopts an adapter mode design, supports multiple interfaces to integrate, comprises multiple data sources and multiple ChatGPT channels of a knowledge base, and improves the expandability of the system. A step of
The system configures a local knowledge base corresponding to a vertical domain knowledge base and a vector database in advance through a configuration center, and stores objects through OSS (for example, OSS (Object Storage Service, object storage service) is a storage structure and is mainly used for storing unstructured data such as pictures, videos, log files and the like.
In addition, the tools for software deployment and management employed in FIG. 8 are Kubernetes (k 8 s) and Docker, where Docker is an open-source containerized platform for building, deploying and running applications, by using container technology, packaging applications and their dependencies into a separate, portable container. The deployment and expansion of applications is made faster and more efficient, while providing cross-platform, portable features.
k8s is an open-source container orchestration system for automated application container deployment, extension, and management. For coordinating the execution of Docker containers (or other container runtime) in a cluster, handling tasks such as load balancing, service discovery, expansion, rollback, etc. k8s provides an abstraction layer, so that a user can ignore specific implementation details of the abstraction of the underlying Docker container, and simultaneously provides a plurality of functions such as automatic disaster recovery, automatic expansion and contraction, automatic log collection and the like. Dock is used for packaging and deployment of applications and their dependent items, while k8s is used for orchestration and management of application containers.
ChatGPT request processing:
aiming at the situation that the concurrent request is excessively larger than the ChatGPT processing capacity, the system adopts a mode of pooling request design and asynchronous blocking, so that the throughput and the utilization rate of the system are improved;
creating a thread pool with a matched size according to the ChatGPT processing capacity, and caching the request of the ChatGPT;
processing the request in an asynchronous queue mode;
when the requested data text exceeds the threshold value set by ChatGPT, automatically splitting the requested text according to paragraphs, realizing multiple request calls of a question, caching returned answers, and finally, uniformly splicing and returning the answers.
By the method, the high availability and the expandability of the AIGC service are realized based on the micro-service architecture containerized architecture technology.
In some embodiments, building the vertical domain knowledge base includes:
acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
Specifically, if the text data corresponding to the document data is monitored to exceed the preset length, splitting the text data corresponding to the document data to obtain a plurality of text fragments, and generating text vectors, fragment identifications and text keywords corresponding to the text fragments; establishing a vertical domain knowledge base according to the corresponding relation among the text fragments, the text vectors, the fragment identifications and the text keywords; for example, acquiring question information; generating a question vector corresponding to the question information; determining a target segment corresponding to the target problem from the text segments of the vertical domain knowledge base according to the text vector corresponding to the problem vector; and generating a question prompt word corresponding to the text information according to at least one part of the target fragment.
By the method, compared with the method of directly dividing through document segmentation granularity, the method has the advantages that the text segmentation position is determined based on segmentation characters, and the integrity of each text segment is guaranteed, so that the prompt information of a target question (question information) is more accurate, and the accuracy of a system in outputting an answer document is improved;
because the large model has input limit, text keywords are extracted from the text fragments, and question prompt words corresponding to target questions are generated according to the text keywords, compared with the whole text fragments, the data size of the text keywords is smaller, so that the input limit of the large model is met, meanwhile, the pertinence is stronger, and the question-answering efficiency of the large model is improved; the method and the system not only generate the question answers through the current target questions, but also generate the question recommendation information according to the history records associated with the target questions, so that more information is provided for users, and the comprehensiveness of the output of the question-answering large model is improved.
In this embodiment, the document generation method based on the knowledge base and the large model is applied to daily, school or company of an enterprise, and has the following technical effects:
firstly, the invention realizes knowledge base construction, prompt word template creation, chatGPT integration and security policy creation based on AIGC interface visual configuration technology, and the whole process carries out personalized flexible configuration according to the user requirement, thereby reducing the workload of repeated development and hard coding. Secondly, the invention realizes the automatic processing of knowledge base data and the automatic generation of documents according to the user's question based on AIGC flow automation technology, the program standardization of the whole process, the excessive participation of users is not needed, and the efficiency and quality of document generation are improved. Thirdly, based on AIGC Prompt templating technology, the invention realizes the prefabrication of the Prompt template according to scene classification, and dynamically generates a language big model Prompt word statement according to the input of a user and the data of a local knowledge base. Fourth, the invention is based on AIGC security guarantee technology, which realizes security guarantee of input information security check, generated content security filtering, user privacy information encryption and the like. Fifth, the invention realizes high availability and expandability of AIGC service based on micro-service architecture containerization architecture technology.
In the embodiment, through combining the vertical domain knowledge base with the ChatGPT model, a user can quickly create a project document by one-key operation, and required document content is automatically generated. The interface personalized visual configuration, the complete automation of the process and the safety are guaranteed, and a user can obtain required project files almost without excessive adjustment, so that the document writing efficiency, the standardization, the specialization and the document quality are effectively improved.
Referring to FIG. 7, a schematic diagram of a knowledge base and large model based document generation system is shown in accordance with an exemplary embodiment of the invention. As shown in connection with FIG. 7, the exemplary knowledge base and large model based document generation system includes: a configuration module 701, a question determination module 702, a document vector determination module 703, a hint word determination module 704, and a document generation module 705, wherein:
the configuration module 701 is configured to obtain configuration information carrying personalized requirements, configure the knowledge base, the big model, the prompt word template and the security policy according to the configuration information, construct a vertical domain knowledge base, and output the big model based on the vertical domain knowledge base, the prompt word template and the security policy;
The problem determining module 702 is configured to obtain input question information, perform feature extraction on text information corresponding to the question information, and determine a problem feature vector;
the document vector determining module 703 is configured to perform a verification on the problem feature vector according to the security policy, input the problem feature vector into the vertical domain knowledge base for matching after the verification is passed, and output the document vector if a document vector with a degree of association with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base;
a prompt word determining module 704, configured to populate the document vector and the problem feature vector with the prompt word template, and generate a prompt word;
the document generation module 705 is configured to input the prompt word into the large model for reasoning, generate an answer document, perform a second verification on the answer document according to the security policy, and output the answer document after the verification is passed.
It should be noted that, the knowledge base and large model-based document generating system provided in the above embodiment and the knowledge base and large model-based document generating method provided in the above embodiment belong to the same concept, and the specific manner in which each step performs the operation has been described in detail in the system embodiment, which is not repeated here.
By adopting the document generation system based on the knowledge base and the large model, which is provided by the embodiment of the disclosure, the knowledge base is pre-configured to construct the knowledge base in the vertical field, the large model and the knowledge base are personalized configured according to the service requirement of the user, and the knowledge base in the vertical field is combined with the large model, so that the project document can be quickly created by one-key operation of the user, the required document content is automatically generated, the efficiency and the quality of automatically generating the document are greatly improved, and the user requirement can be met; meanwhile, through configuration of the prompt word template and the security policy, the large model is output based on the vertical field knowledge base, the prompt word template and the security policy, so that automatic knowledge base data processing and automatic document generation according to user question information are realized, the whole process program is standardized, required project files can be obtained without excessive participation of users, and document writing efficiency, standardization, professionals and document quality are effectively improved.
Referring to FIG. 8, a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention is shown. It should be noted that, the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a central processing unit (Central Processing Unit, CPU) 801 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 802 or a program loaded from a storage section 808 into a random access Memory (Random Access Memory, RAM) 803. In the RAM 803, various programs and data required for system operation are also stored. The CPU 801, ROM802, and RAM 803 are connected to each other by a bus 804. An Input/Output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, and a speaker, and the like; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN (Local Area Network ) card, modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage portion 808 as needed.
In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. When executed by a Central Processing Unit (CPU) 801, performs the various functions defined in the system of the present invention.
The present invention also provides a computer readable storage medium storing a computer program which when executed implements at least one embodiment described above for a knowledge base and large model based document generation method, such as the embodiment described in fig. 1.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
In one or more exemplary aspects, the functions described by the computer program of the methods of the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed in the present invention may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.
The flowcharts and block diagrams in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (11)

1. A knowledge base and large model based document generation method, comprising:
configuration information carrying personalized requirements is obtained, the knowledge base, the big model, the prompt word template and the security policy are configured according to the configuration information, a vertical domain knowledge base is constructed, and the big model is output based on the vertical domain knowledge base, the prompt word template and the security policy;
acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information, and determining a question characteristic vector;
the problem feature vector is checked once according to the security policy, after the verification is passed, the problem feature vector is input into the vertical domain knowledge base for matching, and if the document vector which has the association degree with the problem feature vector reaching a preset threshold value is matched in the vertical domain knowledge base, the document vector is output;
Filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
and inputting the prompt word into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting after the verification is passed.
2. The knowledge base and large model based document generation method according to claim 1, wherein obtaining configuration information carrying personalized requirements, and configuring the knowledge base, the large model, a prompt word template and a security policy according to the configuration information, respectively, comprises:
acquiring configuration information carrying personalized requirements, wherein the configuration information comprises first configuration information, second configuration information, third configuration information and fourth configuration information which are sequentially aimed at the knowledge base, the security policy, the prompt word template and the large model;
configuring the knowledge base according to the first configuration information to construct a vertical domain knowledge base, wherein the first configuration information comprises at least one of the following: knowledge base classification, name, local file path, external links, text segmentation strategy;
Configuring the security policy according to the second configuration information to generate the security policy with problem security check and content security check; the second configuration information includes at least one of: security policy switching, sensitive word filtering and encryption;
configuring the prompt word template according to the third configuration information, and generating the prompt word template containing the question prompt word statement aiming at the question information, wherein the third configuration information comprises at least one of the following components: the prompting word template is classified, named and prompting word template;
configuring the large model according to the fourth configuration information so as to enable the large model to be combined with the knowledge base, the security policy and the prompt word template, wherein the fourth configuration information comprises at least one of the following: large model classification, name, link information, query rate limit per second, data size.
3. The knowledge base and large model based document generation method of claim 1, wherein constructing the vertical domain knowledge base comprises:
determining a local knowledge base based on the knowledge base classification, the knowledge base name and the local file path, or determining the local knowledge base through an external link;
Converting the document data in the local knowledge base into text information, and segmenting the text information according to a text segmentation strategy to obtain a preset format text, wherein the text segmentation strategy comprises at least one of the following steps: paragraph, text line, word;
and converting the text in the preset format into a document vector, and storing the document vector into a vector database meeting the requirement of a support vector search engine to form a vertical domain knowledge base.
4. The knowledge base and large model based document generation method of claim 1, wherein verification is performed using the security policy, further comprising:
if the problem feature vector corresponding to the question information is received, checking the problem feature vector once according to a preset sensitive word stock, determining whether the problem feature vector relates to sensitive word information, if the problem feature vector does not relate to the sensitive word information, checking to pass, and if the problem feature vector relates to the sensitive word information, checking to fail;
and if the answer document is received, carrying out secondary verification on the answer document, determining whether the answer document relates to sensitive word information, if the answer document does not relate to the sensitive word information, passing the verification, carrying out integrity supplement on the current answer document by utilizing the large model until the final answer document is output, and if the answer document relates to the sensitive word information, not passing the verification.
5. The knowledge base and large model based document generation method of claim 4, further comprising, until outputting said final answer document:
generating a public key and private key pair based on a preset asymmetric encryption algorithm, configuring a public key in the public key and private key pair in the large model, and configuring a private key in the public key and private key pair in a target terminal;
signing the document to be supplemented with the answer based on the private key, and generating a first encrypted message to upload to the large model;
the large model carries out signature verification and response on the first encrypted message, generates a final answer document, generates a second encrypted message and feeds the second encrypted message back to the target terminal; and extracting, arranging and packaging the second encrypted message to obtain a final answer document, and outputting the final answer document.
6. The knowledge base and large model based document generation method of claim 4, wherein determining whether sensitive word information is involved further comprises:
if the problem feature vector contains sensitive words, identifying the types of the sensitive words in the problem feature vector, determining preset sensitive texts according to the types of the sensitive words, and responding according to the preset sensitive texts; if the problem feature vector does not contain the sensitive word, triggering the problem feature vector to be matched in a vertical domain knowledge base;
If the answer document contains sensitive words, desensitizing the answer document to filter the sensitive words; and if the question feature vector does not contain the sensitive word, triggering and outputting the answer document.
7. The knowledge base and large model based document generation method according to any one of claims 1 to 6, wherein the knowledge base and the large model are arranged by adopting a micro service containerization architecture design, services are split into a data collector, a management background server and a proxy server according to a single principle, wherein the large model externally accessed by the proxy server comprises at least one of the following: chatgpt, religion, meaning thousands;
the large model creates a thread pool to buffer requests containing questioning information; and processing the request in an asynchronous queue mode, splitting the text corresponding to the request according to paragraphs when the data text of the request exceeds the preset threshold size of the large model, realizing multiple request calls of one question, caching the pre-returned answer document, and uniformly splicing the pre-returned answer document.
8. The knowledge base and large model based document generation method of claim 1, wherein constructing the vertical domain knowledge base comprises:
Acquiring document data in a local knowledge base in advance, splitting the document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments;
establishing a vertical domain knowledge base according to the corresponding relation between the text segment and the text vector; dividing the document data into a plurality of text data blocks according to a preset document dividing granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the document data into a plurality of text fragments according to the text splitting position.
9. A knowledge base and large model based document generation system, comprising:
the configuration module is used for acquiring configuration information carrying personalized requirements, respectively configuring the knowledge base, the large model, the prompt word template and the security policy according to the configuration information, constructing a vertical domain knowledge base, and outputting the large model based on the vertical domain knowledge base, the prompt word template and the security policy;
the problem determining module is used for acquiring input questioning information, extracting characteristics of text information corresponding to the questioning information and determining a problem characteristic vector;
The document vector determining module is used for checking the problem feature vector once according to the security policy, inputting the problem feature vector into the vertical domain knowledge base for matching after the problem feature vector passes the checking, and outputting the document vector if the document vector which has the association degree with the problem feature vector reaching a preset threshold is matched in the vertical domain knowledge base;
the prompt word determining module is used for filling the document vector and the problem feature vector into the prompt word template to generate a prompt word;
and the document generation module is used for inputting the prompt words into the large model for reasoning, generating an answer document, carrying out secondary verification on the answer document according to the security policy, and outputting the answer document after the verification is passed.
10. An electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to any one of claims 1 to 8.
11. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any of claims 1 to 8.
CN202311521400.2A 2023-11-13 2023-11-13 Knowledge base and large model-based document generation system, method, equipment and medium Pending CN117556010A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311521400.2A CN117556010A (en) 2023-11-13 2023-11-13 Knowledge base and large model-based document generation system, method, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311521400.2A CN117556010A (en) 2023-11-13 2023-11-13 Knowledge base and large model-based document generation system, method, equipment and medium

Publications (1)

Publication Number Publication Date
CN117556010A true CN117556010A (en) 2024-02-13

Family

ID=89814148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311521400.2A Pending CN117556010A (en) 2023-11-13 2023-11-13 Knowledge base and large model-based document generation system, method, equipment and medium

Country Status (1)

Country Link
CN (1) CN117556010A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743315A (en) * 2024-02-20 2024-03-22 浪潮软件科技有限公司 Method for providing high-quality data for multi-mode large model system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743315A (en) * 2024-02-20 2024-03-22 浪潮软件科技有限公司 Method for providing high-quality data for multi-mode large model system

Similar Documents

Publication Publication Date Title
US11093707B2 (en) Adversarial training data augmentation data for text classifiers
US11887010B2 (en) Data classification for data lake catalog
US9923860B2 (en) Annotating content with contextually relevant comments
US11189269B2 (en) Adversarial training data augmentation for generating related responses
US10885276B2 (en) Document clearance using blockchain
US11030402B2 (en) Dictionary expansion using neural language models
US11429352B2 (en) Building pre-trained contextual embeddings for programming languages using specialized vocabulary
US11416539B2 (en) Media selection based on content topic and sentiment
US11586816B2 (en) Content tailoring for diverse audiences
US11182545B1 (en) Machine learning on mixed data documents
CN117556010A (en) Knowledge base and large model-based document generation system, method, equipment and medium
US11954138B2 (en) Summary generation guided by pre-defined queries
US20230092274A1 (en) Training example generation to create new intents for chatbots
US11968224B2 (en) Shift-left security risk analysis
US20200110834A1 (en) Dynamic Linguistic Assessment and Measurement
AU2020364386B2 (en) Rare topic detection using hierarchical clustering
US20230418859A1 (en) Unified data classification techniques
WO2022048535A1 (en) Reasoning based natural language interpretation
US11675980B2 (en) Bias identification and correction in text documents
US20220207384A1 (en) Extracting Facts from Unstructured Text
US11361761B2 (en) Pattern-based statement attribution
CN112131378A (en) Method and device for identifying categories of civil problems and electronic equipment
US20230004761A1 (en) Generating change request classification explanations
US11314931B2 (en) Assistant dialog model generation
US11675822B2 (en) Computer generated data analysis and learning to derive multimedia factoids

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination