CN117252251A - Private domain data generation method, device, equipment and storage medium - Google Patents

Private domain data generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117252251A
CN117252251A CN202311546225.2A CN202311546225A CN117252251A CN 117252251 A CN117252251 A CN 117252251A CN 202311546225 A CN202311546225 A CN 202311546225A CN 117252251 A CN117252251 A CN 117252251A
Authority
CN
China
Prior art keywords
api
pair
data
private domain
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311546225.2A
Other languages
Chinese (zh)
Other versions
CN117252251B (en
Inventor
王洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202311546225.2A priority Critical patent/CN117252251B/en
Publication of CN117252251A publication Critical patent/CN117252251A/en
Application granted granted Critical
Publication of CN117252251B publication Critical patent/CN117252251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/02Computing arrangements based on specific mathematical models using fuzzy logic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method, a device, equipment and a storage medium for generating private domain data, wherein the method comprises the following steps: acquiring an API data source, combining a plurality of items according to the API data of a private domain, obtaining at least one first QA pair according to the items and a data structure, performing blurring processing on the at least one first QA pair to obtain at least one second QA and a mapping relation table, and performing deepening expansion on the at least one second QA pair to obtain at least one third QA pair; and performing the definition processing on at least one third QA pair according to the mapping relation table to obtain at least one fourth QA pair, and transmitting the fourth QA pair as a supervised fine-tuning SFT data set conforming to the private domain API to a training model so as to train a large language model by using the SFT data set. The method randomly generates a large number of SFT cases covering a plurality of service ranges, thereby increasing the breadth of QA pairs; and the security of the private domain code is improved by blurring the generated first QA pair.

Description

Private domain data generation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computers, and in particular, to a method, an apparatus, a device, and a storage medium for generating private domain data.
Background
OpenAI developed a chat robot chatGPT, including GPT4, GPT3.5, collectively referred to as chatGPT, which encounters two major problems when it is mainly applied to private code generation of enterprises, resulting in that it cannot land, and is specifically embodied in the following two aspects:
on the one hand, chatGPT as a closed source model cannot be deployed locally. An enterprise, if it wants to use a chatGPT, must give user data (e.g., private code/information) to the carrier of the chatGPT (e.g., openAI or microsoft). Such data may thus be compromised, for example, potentially being used by OpenAI/microsoft to continue training a chatGPT or other model, while an attacker may obtain private information of the enterprise through the trained model.
On the other hand, chatGPT uses public domain data (such as open source code) to train, and lacks knowledge about the private domain of the enterprise, resulting in poor performance on tasks specific to the private domain. For example, the chatGPT cannot generate code correctly because it does not know information such as the private domain related API definition/business logic.
For the code generation field, as shown in fig. 1, according to the continuous evolution of a private domain API data source on a basic data set contained in the private domain API data source through a chatGPT (in particular, GPT 4), a large number of high-quality SFT (Supervised Fine Tuning ) data sets are constructed, and the method has disadvantages for private domain codes needing to be kept secret, because of the risk of leakage when storing the private domain code information, and the private domain code information is easy to intercept and misappropriate. In addition, the chatGPT lacks private domain knowledge, cannot be widely expanded, namely cannot combine a plurality of private domain combination expansion.
Disclosure of Invention
In view of the above, the present invention provides a method, apparatus and device for generating private domain data, which are used to improve the security of private domain codes and increase the breadth of private domain data sources.
In a first aspect, the present invention provides a method for generating private domain data, the method comprising:
acquiring an Application Programming Interface (API) data source, wherein the API data source comprises API data of a private domain and a data structure for describing the API data;
combining a plurality of items according to API data of the private domain, and obtaining at least one first QA pair according to the items and the data structure;
according to a preset rule, carrying out blurring processing on at least one first QA pair to obtain at least one second QA pair and a mapping relation table, wherein the mapping relation table is used for describing the corresponding relation between the first QA pair and the second QA pair before and after blurring processing;
performing deepening expansion on at least one second QA pair to obtain at least one third QA pair;
according to the mapping relation table, performing a sharpening process on the at least one third QA pair to obtain at least one fourth QA pair, wherein the fourth QA pair is a supervised fine tuning SFT data set conforming to a private domain API;
transmitting the at least one fourth QA pair as a supervised fine tuning SFT dataset conforming to the private domain API to a training model to enable the training model to train a large language model with the SFT dataset.
With reference to the first aspect, in one possible implementation manner, the API data of the private domain includes: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets;
combining a plurality of items according to the API data of the private domain, wherein the items specifically comprise: one or more API subsets selected from the first API set are combined with one or more API subsets selected from the second API set to generate an entry having the function of all the selected API subsets.
With reference to the first aspect, in another possible implementation manner, generating at least one first QA pair according to a plurality of items and data structures specifically includes: filling the combined entries into a preset template to generate a first instruction; and inputting the first instruction into an AI algorithm model, and outputting a first QA pair.
With reference to the first aspect, in yet another possible implementation manner, the blurring processing is performed on at least one first QA pair according to a preset rule, so as to obtain at least one second QA pair, which specifically includes: acquiring codes corresponding to each first QA pair; changing the condition space where the code of each first QA pair is located, or changing the function style of the code, or changing the function parameters of the code, at least one second QA pair is obtained.
With reference to the first aspect, in a further possible implementation manner, changing a condition space where the code of each first QA pair is located includes: the condition space where the code of the first QA pair is located is transformed into a field different from the current field by using a local large language model LLM.
Changing the functional style of the code, comprising: the function naming style of the code is changed.
Changing a function parameter of the code, comprising: at least one non-essential parameter is randomly added to the function parameters and/or the order between the function parameters is changed.
With reference to the first aspect, in yet another possible implementation manner, the performing, according to the mapping relation table, a sharpening process on at least one third QA pair to obtain at least one fourth QA pair includes: and performing inverse transformation on at least one third QA pair by using the mapping relation table to obtain at least one fourth QA pair, and taking the at least one fourth QA pair as an SFT data set.
In a second aspect, the present invention provides a device for generating private domain data, the device comprising:
the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring an Application Programming Interface (API) data source, wherein the API data source comprises API data of a private domain and a data structure for describing the API data;
The mixing module is used for combining a plurality of items according to the API data of the private domain, and obtaining at least one first QA pair according to the items and the data structure, wherein each first QA pair comprises a question and an answer;
the blurring module is used for carrying out blurring processing on the at least one first QA pair according to a preset rule to obtain at least one second QA pair and a mapping relation table, wherein the mapping relation table is used for describing the relation between the first QA pair and the second QA pair before and after the blurring processing;
the expansion module is used for deeply expanding the at least one second QA pair and generating at least one third QA pair;
the definition module is used for performing definition processing on at least one third QA pair according to the mapping relation table to obtain at least one fourth QA pair, wherein the fourth QA pair is a supervised fine-tuning SFT data set conforming to a private domain API;
and the output module is used for transmitting the data set of at least one fourth QA pair which is the supervised fine tuning SFT conforming to the private domain API to a training model so that the training model trains a large language model by utilizing the data set of the SFT.
With reference to the second aspect, in one possible implementation manner, the API data of the private domain includes: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets;
The blurring module is specifically configured to combine one or more API subsets selected from the first API set with one or more API subsets selected from the second API set to generate an entry, where the entry has a function of all the selected API subsets.
With reference to the second aspect, in another possible implementation manner, the blurring module is specifically configured to populate a preset template with the entries generated by the combination, generate a first instruction, input the first instruction into the AI algorithm model, and output a first QA pair.
With reference to the second aspect, in another possible implementation manner, the blurring module is specifically configured to obtain a code corresponding to each first QA pair; at least one second QA pair is generated by changing the conditional space in which the codes of each first QA pair are located, or by changing the functional style of the codes, or by changing the functional parameters of the codes.
With reference to the second aspect, in a further possible implementation manner, the blurring module is specifically configured to transform, by using a local large language model LLM, a conditional space where a code of the first QA pair is located into a field different from the current field; changing the function naming style of the code; and randomly adding at least one non-essential parameter to the function parameters and/or changing the order between the function parameters.
With reference to the second aspect, in a further possible implementation manner, the sharpening module is specifically configured to reverse-transform at least one third QA pair by using a mapping relationship to obtain at least one fourth QA pair, and use the at least one fourth QA pair as the SFT data set.
In a third aspect, the present invention provides a computer device comprising: the processor executes the computer instructions, thereby executing the private domain data generating method according to the first aspect or any of the corresponding embodiments.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the method for generating private domain data of the first aspect or any of the embodiments corresponding thereto.
According to the method, the device and the equipment provided by the invention, the API data of the private domain are combined to generate a plurality of items, so that the breadth of the chatGPT private domain data source is increased, the basic data of the private domain is expanded in breadth, and a high-quality private domain data set is obtained. And the method improves the safety of the private domain codes by carrying out blurring processing on at least one generated QA pair, increases the complexity and diversity of each QA pair, prevents the private domain API data from being stolen and information revealed in the evolution process, and carries out blurring processing on the blurred QA pair in the construction stage to generate an SFT data set conforming to the private domain API, thereby providing basis for subsequent model training.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a prior art private data training;
fig. 2 is a flow chart of a method for generating private domain data according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a method for generating private domain data according to an embodiment of the present invention;
fig. 4 is a structural block diagram of a method and apparatus for generating private domain data according to an embodiment of the present invention;
fig. 5 is a schematic hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The technical scheme provided by the embodiment is applied to the field of computers, and particularly is used for software development based on the chatGPT function.
There are some LLMs (Large Language Model, large language models) that are open-source locally deployable as alternatives to chatGPT. For example, the LLaMa family of Meta, when applied in the private domain, is primarily characterized by the following two points:
1. the ability of the open source model is very different from chatGPT.
2. The open source model is trained based on public domain data, as well as lacks private domain knowledge. If an enterprise wants to add private knowledge to an open source large language model LLM (e.g., LLaMa, washout chatGLM), a large number of high quality private-related supervised fine-tuned SFT datasets are needed. Typically, the enterprise is free of these SFT datasets. Alternatively, businesses typically inject knowledge by self-supervision. But from an effect point of view, this approach does not work well for smaller amounts/lower quality data if the private data follows a long tail distribution (typically like code data).
LLM refers to a language model with huge scale, which is an artificial intelligent model based on deep learning and aims at processing and generating natural language texts. Training of LLM is typically based on a large corpus of text, learning the probability distribution and semantic relationships of the language by predicting the next word or character.
In this embodiment, the large-scale language model LLM may be a generic name of models such as LLaMa, chartclm, etc.
For the private domain task with long tail distribution, the LLM deployed in the private domain is obtained in an open source model and fine tuning mode, and the performance of the deployed LLM is poor. Taking code generation as an example, there is an API call library of the crowd module X to obtain data, and the overall code ratio of using the API is very low, so that the "open source model+fine tuning" mode can make the model mastery for the API of the module X very poor. However, for the developer of the module, the API of the X module is mainly encountered in daily development, and the proficiency of LLM to the API directly affects the development efficiency of the developer. That is, LLM is often inadequate for such minority (less code amount modules) groups, requiring the ability to be enhanced in this regard.
The most important way to improve the ability of artificial intelligence represented by machine learning (including LLM mentioned above) is to improve the data quality, which includes two meanings: the quality of the data and the amount of the data. As Microsoft Orca's study shows, adding Chain of Thought (CoT) content helps the inference ability of the trained model, which places very high demands on the quality of the data set. Likewise, the diversity of the data set is also critical to the ability of the model to generalize. Only with large amounts of high quality data can the model learn the correct pattern. The minority mentioned above is just that the related data quality and quantity are low, so that the model cannot learn the characteristics and correctly apply.
The thinking chain CoT refers to a step-by-step analysis of decomposing a question into multiple steps by a Large Language Model (LLM) to obtain a correct answer step by step. It should be noted that for complex questions, the probability of LLM directly giving the wrong answer is relatively high. The thought chain may be seen as a kind of instruction fine tuning (Instruction FineTuning). Wherein the instruction fine-tuning gives additional instructions or annotates the data set to enhance the performance of the model for an already existing pre-trained model.
The traditional manual method for constructing the high-quality SFT data set needs to input a large amount of human resources, consumes huge resources and has lower output efficiency. For this reason, there is a need to propose a method of constructing a high quality data set by using chatGPT.
Aiming at the code generation field, the current advanced method is Evo-instruct. Evo-instruct evolves continuously through GPT4 on the underlying dataset, building a large number of high quality SFT datasets. The core method is that aiming at QA pairs (query Answer) in the original data set, GPT4 is enabled to increase the difficulty of Q to generate a new Question Q1, then GPT4 is enabled to Answer Q1 to generate a corresponding Answer A1, and a new QA pair, such as a Q1-A1 pair formed by the Question Q1 and the Answer A1, is formed. And then continue to evolve on the basis of Q1 and A1. The SFT dataset is internally composed of a series of QA pairs. For the code field, the method for increasing the difficulty of Evo-instruct comprises the following steps:
First, the temporal complexity of the original problem is increased, and/or the spatial complexity is increased.
Second, add reasoning step of original problem.
Thirdly, incoherent interference conditions are added for the original problems.
The Evo-instruct method can be used for constructing a large number of high-quality QA pairs which are used as raw materials for supervised fine-tuning SFT to construct a high-quality SFT data set. Evo-instruct uses the strong ability of chatGPT (in particular GPT 4) to achieve evolution, but this is not suitable for private code that needs to be kept secret, because private code knowledge is at risk of leakage, and chatGPT lacks private knowledge and cannot be widely expanded.
In addition, the core principle of chatGPT and similar LLM is a probability-based autoregressive model. The autoregressive model refers to that the chatGPT combines the previous input and output into a new round of input, predicts the next word on the basis of the new round of input, and outputs the word with the highest probability. This process is continued until certain conditions are reached. Based on the probability, the model considers the previous Token when outputting the next Tokenn (token), i.e. the conditional probability is calculated as follows:
specifically, given the first i-1 Token, the probability of each Token in the vocabulary under this scenario is predicted, and the Token with the highest probability is output. If the user used by OpenAI does not want to train on the private data that is disclosed, then the model may have to remember these private data. Under certain circumstances, such as under attack, the model may reveal such sensitive information as the user's code API, etc.
For example, the attack method mentioned above may be that the memory of the model is led out through the context, for example, the prompt model "credit card account number and password are respectively", if the training data of the model contains similar fragments and the model is memorized, the model may output the data in the training data, and the sensitive information is revealed.
In order to use chatGPT to safely and efficiently generate SFT training sets of private domains, this embodiment proposes a hybrid-fuzzification (Mix-Obfuscate) construction Evo-instruction method. For overcoming the two drawbacks, respectively: the breadth extension of a private domain cannot be carried out by the chatGPT by using the traditional Evo-instructor; 2. safety problems.
The following describes in detail the technical solutions provided in the embodiments of the present application.
According to an embodiment of the present invention, there is provided an embodiment of a method for generating high-quality private domain data, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different from that herein.
Fig. 2 is a flow chart of a method for generating private domain data according to the present embodiment, which is applicable to a computer device, and specifically includes:
step S101: an Application Programming Interface (API) data source is obtained, wherein the API data source comprises API data of a private domain and a data structure for describing the API data.
The private domain API data is some basic data, for example, the private domain API data includes: the interface management related data, DTQ (double linked list, doubly linked list) related data, hash data, and the like may correspond to different API sets and related data structures, respectively. The data structure for describing the API data refers to information such as an API interface index, an interface description, an interface type, etc., which are combined into a structure body, and the content included in the structure body can be understood as "the data structure of the API data".
The private domain API data may be received and retrieved from an API domain module (Module API Domain). Specifically, the API domain modules are categorized by module and function. Analogy is done using a general programming manual, and Module API Domain is categorized by modules and chapters of the programming manual. The interface management API data and associated data structures are obtained, for example, from a set of APIs traversed by the interface in an interface management programming manual.
Step S102: and combining a plurality of entries according to the API data of the private domain, and obtaining at least one first QA pair according to the plurality of entries and the data structure.
Wherein each first QA pair comprises a question and an answer, and the questions and the answers are in one-to-one correspondence.
Specifically, step S102 includes: combining and generating a plurality of items according to basic data in API data of the private domain; at least one first QA pair is then derived from the plurality of entries and the data structure. In one embodiment, the functions (typically 2-3) of multiple modules are randomly extracted from the private API data and then passed to the local LLM for combination design as an initial QA pair. For example, the interface API traversal function, which randomly decimates to interface management, and the DLL (Data Definition Language ) function, as a combination, send the relevant API description to the local AI algorithm model, which operates to output one or more first QA pairs.
For example, one entry of the private domain's API data is:
as a C language test prompter, please use the following API to promulgate a title, similar to HumanEval in form, the testee requires a function to be completed, and the following is the related API information:
API information for/(DLL and interface traversal)
Please you give the question and the corresponding answer.
Based on the above information, in step S102, LLM deployed according to the private domain will feed back an example as follows:
the title is:
one Q-A pair is:
problem Q: traversing interface information, and storing and returning indexes of the interfaces in a DLL mode;
DLL_HEAD getAllInterfacesInfo()
answer a:
the step utilizes a local AI algorithm model, and can obtain an initial QA pair mixed by a plurality of modules, namely at least one QA pair, under the condition of no disclosure.
Step S103: and according to a preset rule, blurring the at least one first QA pair to obtain at least one second QA pair and a mapping relation table.
The mapping relation table is used for describing the corresponding relation between the first QA pair and the second QA pair before and after the blurring processing, and the blurring processing means that the first QA pair is hidden, encrypted or subjected to blurring processing, so that the output second QA pair cannot be visually recognized as being obtained through the change of the first QA pair, the safety of the first QA pair is improved, and the ciphertext is generated similarly to the encryption of the plaintext. The mapping table can be understood as a key for restoring the second QA pair to generate the first QA pair.
In this step, LLM is predicted by conditional probability, and in order to avoid revealing private code information and preventing others from acquiring the private code information, the code information needs to be changed, i.e., blurred (english is called "Obfuscate"), which is also called camouflage.
Specifically, in one implementation, a code corresponding to each first QA pair is acquired; at least one second QA pair is generated by changing the condition space in which the code of each first QA pair is located, or by changing at least one of the functional style of the code, or by changing the functional parameters of the code.
Wherein the changing the condition space where the code of each first QA pair is located includes: the condition space where the code of the first QA pair is located is transformed into a field different from the current field by using a local large language model LLM, which is simply referred to as "metaphor mode rewrite". Specifically, the traffic space is changed by metaphorically changing the traffic space according to at least one first QA pair and a plurality of items (denoted as "a 1"), at least one second QA pair and a corresponding plurality of items (denoted as "a 2") are generated, and at the same time, a mapping relationship between each second QA pair and the first QA pair is also established.
Fuzzification processing: metaphor mode rewriting
Taking the above interface traversal as an example, the local LLM is used to transform the interface traversal into a representation form of other fields, for example:
prompt (hint): the router/switch may act as an airport, then the interface may act as what? Please answer in the following form: "Router < - > airport".
After step S103, the example after the blurring process returned by the local LLM is as follows:
interface-terminal building
The interface < - > terminal building is a mapping relation executed before and after blurring.
In this embodiment, the traversing task of the interface of the original network device is changed into the traversing task of the airport terminal. Disguising the original interface-related API as an "air transport-related API", for example: if_isvalidifindex determines whether it is a valid interface index, and changes to terminal_ IsValidTerminal Index (). Thereby improving the security of the private domain API data.
Optionally, other camouflage relationships may be established, such as:
interface management (IF) < - > FLIGHT management (FLIGHT) "
"Equipment < - > airport"
"interface < - > flight"
"iterator" < - > flight viewer "
"initialization" < - > Start "
"traversal < - > view"
"destroy < - > close"
Wherein "< - >" represents information before and after the blurring process, and before the blurring management, the acquired item and the first QA pair are:
original [ interface management (IF) API ]
ULONG IF_IteratorInit(VOID);
The API is used to initialize and return an iterator for traversing the interfaces on the device.
IF_INDEX IF_GetNextIfIndexByIter(ULONG ulIter);
The API traverses the return interface through the ulIter iterator.
VOID IF_IteratorFini(ULONG ulIter);
This API is used to destroy the iterator returned by IF_Itator Init ().
Rewritten [ FLIGHT management (FLIGHT) API ]
ULONG FLIGHT_ViewerStart(VOID);
The API is used to launch and return a flight viewer for viewing flights at the airport.
FLIGHT_INDEX FLIGHT_GetNextFlightIndexByViewer(ULONG ulViewer);
The API looks at the return flight through the ulViewer flight viewer.
VOID FLIGHT_ViewerStop(ULONG ulViewer);
This API is used to close the FLIGHT viewer returned by FLIGHT_ViewerStart ().
Please write the following questions and answers according to the camouflage relationship, so that the questions and answers adapt to the rewritten API.
The problem Q1 before blurring is: question(s)
"you need to complete a function void StoreInterfaceInfo (), which functions to obtain all interface information on the device and store it in a DTQ doubly linked list. You need to add a dtq_node_s field in the interface information structure to store interface information. "
Correspondingly, the interface information structure is defined as:
typedef struct tagInterfaceInfo {
IF_INDEX ifIndex;
DTQ_NODE_S stDtqNode;
} InterfaceInfo_S;
Answer A1 before blurring:
void StoreInterfaceInfo() {
DTQ_HEAD_S stList;
ULONG ulIter = IF_IteratorInit();
IF_INDEX ifIndex;
......
IF_IteratorFini(ulIter);
print interface information-
DTQ_NODE_S pstNode;
InterfaceInfo_S pstInterfaceInfo;
DTQ_FOREACH(&stList, pstNode) {
pstInterfaceInfo = DTQ_ENTRY(pstNode, InterfaceInfo_S, stDtqNode);
printf("Interface index: %d\n", pstInterfaceInfo->ifIndex);
}
}
After step S103, the blurred result is displayed: local LLM answer:
the blurred Q problem, Q2 in the second QA pair:
you need to complete a function void StoreFlightInfo (), which functions to obtain all flight information on the airport and store it in a DTQ doubly linked list according to the size order of the flight index (FlightIndex). You need to add a dtq_node_s field in the flight information structure to store the flight information.
The flight information structure is defined as:
typedef struct tagFlightInfo {
FLIGHT_INDEX FlightIndex;
DTQ_NODE_S stDtqNode;
} FlightInfo_S;
the blurred answer A, i.e. answer A2
Answer:
void StoreFlightInfo(){
while ((FlightIndex = FLIGHT_GetNextFlightIndexByViewer(ulViewer)) != 0) {
FlightInfo_S pstFlightInfo = (FlightInfo_S )malloc(sizeof(FlightInfo_S));
pstFlightInfo->FlightIndex = FlightIndex;
if (DTQ_IsEmpty(&stList)) {
DTQ_AddTail(&stList,&(pstFlightInfo->stDtqNode));
} else {
......pstCurrentFlightInfo = DTQ_ENTRY(pstCurrentNode, FlightInfo_S, stDtqNode);
}
FLIGHT_ViewerStop(ulViewer);
print flight information-
DTQ_NODE_S pstNode;
FlightInfo_S pstFlightInfo;
DTQ_FOREACH(&stList, pstNode) {
pstFlightInfo = DTQ_ENTRY(pstNode, FlightInfo_S, stDtqNode);
printf("Flight index: %d\n", pstFlightInfo->FlightIndex);
}
}
The above example is one example of obtaining the second QA pair after blurring the first QA pair.
And (2) blurring treatment II: function style overwriting
The changing the function style of the code includes: changing the function naming style of the code; simply referred to as "function style rewrite". In this embodiment, the naming style of the function is replaced, for example, by using hungarian naming method and underline naming method, and the mapping relation before and after the replacement is established, so as to obtain the mapping relation table containing the mapping relation.
For example, after the at least one second QA pair and the corresponding plurality of entries ("a 2") are subjected to a function style rewrite process, at least one new QA pair and the corresponding plurality of entries (denoted as "a 3") are generated, and a new mapping relationship is obtained.
And (3) blurring treatment three: function parameter override
The changing the function parameters of the code includes: randomly adding at least one non-essential parameter into the function parameters, and/or changing the sequence among the function parameters; simply referred to as "function parameter override". For example, for the original API data, 0 or more useless parameters are randomly added through the LLM model, or the sequence of the parameters is changed, so that at least one second QA pair and a plurality of corresponding items are not easy to identify and decipher.
For example, after the QA pair and the plurality of entries (a 3) which have been rewritten by the function style are rewritten by the function parameter, a new QA pair and a new plurality of entries (denoted as "a 4") are generated, a new mapping relationship is generated, and the mapping relationship table is updated.
Optionally, the updated mapping relationship table is a mapping relationship of 1:1, for example, the mapping relationship between one Q and one a is included, so that the updated mapping relationship table is convenient to restore and use in a subsequent defuzzification (i.e. definition) process.
After at least one of the three types of rewriting based on the blurring processing is processed, the scope of the API information can be greatly changed, so that other people cannot acquire the private domain knowledge information through the model, and the security of the private domain data is improved.
Step S104: and carrying out deepening expansion on the at least one second QA pair to obtain at least one third QA pair.
Specifically, in one implementation, at least one third QA pair is deeply expanded by using an Evo-instruct method, so that the third QA pair has a more complex or diversified structure than the second QA pair, and high-quality API data information is obtained.
The Evo-instruct method is derived from Evo-instruct literature aiming at codes and proposed by wizardCoder, and can generate a relatively complex and various instruction data sets by using a large model to generate instructions. The second QA pair generated by blurring is further evolved to generate a third QA pair which has a complex structure and is diversified. The complex and diversified structure means that the difficulty (including the increasing direction and the condition) is increased on the basis of the original QA pair, and the processing difficulty, the processing requirement and the like are increased.
For example, in step S103, a second QA pair is output, wherein a problem Q2 is:
"you need to complete a function void StoreFlightInfo (), which functions to obtain all flight information on the airport and store it in a DTQ doubly linked list according to the size order of the flight index (FlightIndex). You need to add a dtq_node_s field in the flight information structure to store the flight information. "
After the problem Q2 is processed by the Evo-instruct method of step S104, a more complex and diversified problem Q3 is output, where the problem Q3 is:
"you need to complete a function void StoreFlightInfo (), which functions to obtain all flight information on the airport and store it in a DTQ doubly linked list according to the size order of the flight index (FlightIndex). Furthermore, you need to store flights with even-numbered flights in another DTQ doubly linked list. You need to add a dtq_node_s field in the flight information structure to store the flight information. "
Comparing the problems Q2 and Q3, after the construction by Evo-instruct method, the condition that "you need to store the flights with even number of flight indexes in another DTQ bidirectional chain table" is added more than the problem Q3, so the complexity of the problem Q2 is increased. More specifically, the code form may be expressed in various ways, and the present embodiment does not limit the concrete expression form of the code layer.
In this step, the problems Q2 and A2 in the second QA pair are deepened and expanded to generate a new third QA pair, and the mapping relationship between Q2 and A2 in the blurring process is not changed, so that the mapping relationship table can still be used in the next process of the blurring process.
Step S105: and according to the mapping relation table, performing the sharpening process on the at least one third QA pair to obtain at least one fourth QA pair, wherein the fourth QA pair is a supervised fine tuning SFT data set conforming to a private domain API.
The sharpening process corresponds to the "blurring process" in the step S103, and is a defuzzification process (indicated by the english expression: de Obfuscate) of the blurring process.
Specifically, the mapping relationship table in the step S103 is utilized, where the mapping relationship table includes at least one QA pair before and after blurring, and at least one third QA pair in the step S104 is inversely transformed to obtain codes or data after Evo-instruct, which accords with the private API, for example, a Q4-A4 pair is obtained, including a question Q4 and an answer A4.
For example, before the sharpening process, the input information is as follows:
the following is an original API description of a module, and the corresponding domain-rewritten API.
The field is rewritten as: an API in a particular domain is disguised (metaphed) as an API in other domains, with similarities between the two domains. The purpose is to make it unknown to others that this is an API in this particular field. For the general concept, no overwriting may be used.
For example, the network device may masquerade (metaphor) as an airport and the corresponding data packet may masquerade (metaphor) as a flight. The ID of the data packet can be directly compared with the flight ID, and the general concept of the ID is not needed to be rewritten and only the modifier of 'data packet' is needed to be modified.
Camouflage relationship:
interface management (IF) < - > FLIGHT management (FLIGHT) "
"Equipment < - > airport"
"interface < - > flight"
"iterator" < - > flight viewer "
"initialization" < - > Start "
"traversal < - > view"
"destroy < - > close"
Original [ interface management (IF) API ]
ULONG IF_IteratorInit(VOID);
The API is used to initialize and return an iterator for traversing the interfaces on the device.
IF_INDEX IF_GetNextIfIndexByIter(ULONG ulIter);
The API traverses the return interface through the ulIter iterator.
VOID IF_IteratorFini(ULONG ulIter);
This API is used to destroy the iterator returned by IF_Itator Init ().
Rewritten [ FLIGHT management (FLIGHT) API ]
ULONG FLIGHT_ViewerStart(VOID);
The API is used to launch and return a flight viewer for viewing flights at the airport.
FLIGHT_INDEX FLIGHT_GetNextFlightIndexByViewer(ULONG ulViewer);
The API looks at the return flight through the ulViewer flight viewer.
VOID FLIGHT_ViewerStop(ULONG ulViewer);
This API is used to close the FLIGHT viewer returned by FLIGHT_ViewerStart ().
Please write the following questions and answers according to this disguised relationship, making them fit the original API.
[ problem Q3]
You need to complete a function void StoreFlightInfo (), which functions to obtain all flight information on the airport and store it in a DTQ doubly linked list according to the size order of the flight index (FlightIndex). Furthermore, you need to store flights with even-numbered flights in another DTQ doubly linked list. You need to add a dtq_node_s field in the flight information structure to store the flight information.
The flight information structure is defined as:
typedef struct tagFlightInfo {
FLIGHT_INDEX FlightIndex;
DTQ_NODE_S stDtqNode;
DTQ_NODE_S stOddDtqNode;
} FlightInfo_S;
[ answer A3]:
void StoreFlightInfo(){
DTQ_HEAD_S stEvenList;
while ((FlightIndex = FLIGHT_GetNextFlightIndexByViewer(ulViewer)) != 0) {
FlightInfo_S pstFlightInfo = (FlightInfo_S )malloc(sizeof(FlightInfo_S));
pstFlightInfo->FlightIndex = FlightIndex;
......}
FLIGHT_ViewerStop(ulViewer);
print all flight information-
DTQ_NODE_S pstNode;
FlightInfo_S pstFlightInfo;
DTQ_FOREACH(&stList, pstNode) {
pstFlightInfo = DTQ_ENTRY(pstNode, FlightInfo_S, stDtqNode);
printf("Flight index: %d\n", pstFlightInfo->FlightIndex);
}
Print even index flight info-
DTQ_FOREACH(&stEvenList, pstNode) {
pstFlightInfo = DTQ_ENTRY(pstNode, FlightInfo_S, stOddDtqNode);
printf("Even index Flight: %d\n", pstFlightInfo->FlightIndex);
}
}
After the process of step S105, a fourth QA pair, i.e., Q4-A4, is output, for example:
[ problem Q4]
You need to complete a function void StoreInterfaceInfo (), which functions to obtain all interface information on the device and store it in a DTQ doubly linked list according to the size order of the interface index (IfIndex). Furthermore, you need to store interfaces with even indexes in another DTQ doubly linked list. You need to add a dtq_node_s field in the interface information structure to store interface information.
The interface information structure is defined as:
typedef struct tagInterfaceInfo {
IF_INDEX IfIndex;
DTQ_NODE_S stDtqNode;
DTQ_NODE_S stOddDtqNode;
} InterfaceInfo_S;
[ answer A4]:
void StoreInterfaceInfo(){
DTQ_HEAD_S stEvenList;
while ((IfIndex = IF_GetNextIfIndexByIter(ulIter)) != 0) {
InterfaceInfo_S pstInterfaceInfo = (InterfaceInfo_S )malloc(sizeof(InterfaceInfo_S));
pstInterfaceInfo->IfIndex = IfIndex;
......}
}
IF_IteratorFini(ulIter);
printing all interface information-
DTQ_NODE_S pstNode;
InterfaceInfo_S pstInterfaceInfo;
DTQ_FOREACH(&stList, pstNode) {
pstInterfaceInfo = DTQ_ENTRY(pstNode, InterfaceInfo_S, stDtqNode);
printf("Interface index: %d\n", pstInterfaceInfo->IfIndex);
}
Print even index interface information-
DTQ_FOREACH(&stEvenList, pstNode) {
pstInterfaceInfo = DTQ_ENTRY(pstNode, InterfaceInfo_S, stOddDtqNode);
printf("Even index interface: %d\n", pstInterfaceInfo->IfIndex);
}
}
In this example, the third QA pair after the blurring process is defuzzified to obtain a fourth QA pair matching the original first QA pair. Such as Q3 in the third QA pair, "store these flight information in a DTQ doubly linked list according to the size order of the flight index (FlightIndex). In addition, you need to store the even number of flights in another DTQ doubly linked list ", and according to the mapping relationship," interface management (IF) < - > flight management "makes a reverse transformation to restore the flight information to the interface index, so as to obtain the problem Q4.
Part of problem Q4 is as follows: the interface information is sequentially stored in a DTQ doubly linked list according to the size of the interface index (IfIndex). Furthermore, you need to store interfaces with even indexes in another DTQ doubly linked list.
Step S106: at least one fourth QA pair is transmitted as a supervised fine tuning SFT dataset conforming to a private domain API to a training model to enable the training model to train a large language model with the SFT dataset.
And transmitting the fourth QA deeply expanded by the Evo-instruct to a training model so that the training model trains and generates a large language model with a GPT4 function by utilizing the data after the Evo-instruct. Specifically, the at least one fourth QA pair is input into a training model as a high quality STF data raw material, and after training, an SFT data set is generated, thereby obtaining a private model LLM having a function such as GPT 4.
According to the method provided by the embodiment, the API data of the private domain are combined to generate a plurality of entries, and a large number of SFT cases covering a plurality of service ranges are randomly generated, so that the breadth of a chatGPT private domain data source is increased, the basic data of the private domain is expanded in breadth, and the breadth is increased; the problem that the chatGPT private domain knowledge is insufficient and the breadth cannot be expanded is solved.
In addition, through blurring processing is carried out on at least one QA pair, the complexity and the diversity of each QA pair are increased, the private domain API data is prevented from being stolen and information is prevented from being revealed in the evolution process, and the private domain data is safely sent to a high-performance model of a public domain to be expanded, so that a high-quality private domain SFT data set is obtained, and a basis is provided for subsequent model training.
Optionally, in this embodiment, it is assumed that the API data of the private domain includes: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets. For example, the first API set includes: API subset 1, API subset 2, and API subset 3, a total of 3 subsets. Similarly, included in the second API set are: an API subset 4-an API subset 8, which comprises 5 subsets; each API subset is configured to implement a function, and the first API set has 3 functions and the second API set has 4 functions.
Step S102, a plurality of entries are combined according to the API data of the private domain, which specifically includes: combining one or more API subsets selected from the first API set with one or more API subsets selected from the second API set to generate an entry, the entry having the functionality of all API subsets selected.
The selection manner may be selected randomly, for example, the API subset 1 is selected from the first API set, and combined with the API subset 4 selected from the second API set, so as to generate a first entry containing two functions of the API subset 1 and the API subset 4, thereby, compared with the first entry before combination, the API function is newly added, and the API data function is expanded.
Optionally, the combination may further select other API subsets for combination, such as API subset 3 and API subset 4, to generate a second entry; or API subset 1, API subset 3, and API subset 5, generates a third entry, or may include other further combinations, which the present embodiment is not limited to.
In addition, generating at least one first QA pair according to the plurality of items, such as the first item to the third item, generated by the combination and the data structure for describing the API data specifically includes: filling the combined entries into a preset template to generate a first instruction; the first command is input to an AI algorithm model, and the first QA pair is output.
The generation process of the preset template is as follows:
firstly, a basic template sample is obtained, space occupying information is filled in the basic template sample, a first instruction is generated, the first instruction is input into a local AI algorithm model, and a first QA pair is output. Similarly, the second entry is filled in the preset template to generate a second instruction, the second instruction is input into the local AI algorithm model, and another first QA pair is output.
According to the embodiment, the API basic data of the private domain is divided into different API subsets, and the different API subsets are combined to generate a plurality of entries, so that the breadth of the chatGPT private domain data source is increased, and the basic data of the private domain is expanded in breadth.
Alternatively, the method provided in the foregoing embodiment may be implemented by a plurality of modules, for example, as shown in fig. 3, and may be implemented by the following modules:
the obtaining module is configured to receive an API data source from the outside, and send the API data source to the mixing module, corresponding to step S101.
The mixing module (Mix module) is responsible for the extension of the breadth of the private domain data, and can execute the step S102, so as to solve the problem that the traditional Evo-instruct cannot extend the breadth of the private domain data.
And the obfuscation module (obfuscation module) is configured to disguise the private domain knowledge by using the LLM of the private deployment, and generate at least one second QA pair, specifically referred to in the foregoing step S103.
The expansion module, which may be also referred to as an Evo-instruct module, is configured to perform depth expansion on the code data based on the second QA pair output by the blurring module, and output a third QA pair, so that the third QA pair has a more complex and diversified structure compared with the second QA pair, and the specific process refers to step S104.
And the definition module (De-obfuscate module) is used for performing defuzzification processing on the third QA pair output by the expansion module, recovering the third QA pair into a form related to private API data, and obtaining a fourth QA pair.
And the output module is used for outputting the fourth QA pair as the SFT data set conforming to the private domain API to the training model, and providing training data for the next training model module, which corresponds to the step S106.
In addition, in the step of providing the API data source, the API data of the private domain may also be generated and provided to the mixing module by the API domain module (Module API Domain) before step S101.
Specifically, the functions of the above modules refer to steps S101 to S106 of the foregoing method embodiment, which is not described in detail in this embodiment.
The embodiment also provides a device for generating private domain data, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a device for generating private domain data, as shown in fig. 4, where the device includes: an acquisition module 410, a blending module 420, a blurring module 430, an expansion module 440, a sharpening module 450, and an output module 460. In addition, other more or fewer modules, such as a memory module, may be included in the apparatus, which is not limited in this embodiment.
Wherein, the obtaining module 410 is configured to obtain an API data source.
The mixing module 420 is configured to combine a plurality of entries according to the API data of the private domain, and obtain at least one first QA pair according to the plurality of entries and the data structure.
The blurring module 430 is configured to perform blurring processing on the at least one first QA pair according to a preset rule, to obtain at least one second QA pair and a mapping relationship table, where the mapping relationship table is used to describe a relationship between the first QA pair and the second QA pair before and after the blurring processing.
And an expansion module 440, configured to construct the at least one second QA pair, and generate at least one third QA pair.
And the sharpening module 450 is configured to sharpen the at least one third QA pair according to the mapping relationship table to obtain at least one fourth QA pair, where the fourth QA pair is a supervised fine tuning SFT data set conforming to a private domain API.
An output module 460 for transmitting at least one fourth QA pair as a data set of the supervised fine tuning SFT compliant with the privacy domain API to a training model to enable the training model to train a large language model using the data set of the SFT.
Optionally, in a specific embodiment, the API data of the private domain includes: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets.
The obfuscation module 430 is specifically configured to combine the one or more API subsets selected from the first API set with the one or more API subsets selected from the second API set to generate an entry, where the entry has a function of all the selected API subsets.
Optionally, in another specific embodiment, the blurring module 430 is specifically further configured to populate a preset template with the entries generated by the combination, generate a first instruction, input the first instruction into an AI algorithm model, and output the first QA pair.
Optionally, in another specific embodiment, the blurring module 430 is specifically further configured to obtain a code corresponding to each of the first QA pairs; the condition space where the code of each first QA pair is located is changed, or the function style of the code is changed, or the function parameters of the code are changed, and the at least one second QA pair is generated.
Optionally, in another specific embodiment, the blurring module 430 is specifically further configured to use the local large-scale language model LLM to transform the conditional space where the code of the first QA pair is located into a field different from the current field.
And/or changing the function naming style of the code.
And/or randomly adding at least one non-essential parameter to the function parameters, and/or changing the sequence between the function parameters.
Optionally, in another specific embodiment, the sharpening module 450 is specifically configured to reverse-transform the at least one third QA pair by using a mapping relationship table to obtain at least one fourth QA pair that conforms to the private API, and output the at least one fourth QA pair as the data set of the SFT to the output module 460.
The private domain data generating means in this embodiment are presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above described functionality.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The device provided by the embodiment provides a framework for the public domain LLM, and solves the problem that the traditional Evo-instruct is not suitable for private data. The mixing module performs private domain data mixing to combine a plurality of items, so that the breadth of QA is expanded. The blurring module 430 and the sharpening module 450 respectively carry out blurring and sharpening processing on at least one QA pair, so that the safety of data is improved, the existing Evo-instruct is utilized for carrying out depth expansion, a large number of high-quality SFT data sets are constructed and output, and a basis is provided for generating a LLM model with a chatGPT function for subsequent model training.
The embodiment of the invention also provides computer equipment, which is provided with the private domain data generating device shown in the figure 4.
Referring to fig. 5, a schematic structural diagram of a computer device according to an embodiment of the present invention is shown in fig. 5, where the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface.
In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Multiple computer devices may also be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 5.
The processor 10 may be a central processing unit CPU, a network processor or a combination thereof. In addition, the processor 10 may further include a hardware chip. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
The memory 20 stores instructions executable by the at least one processor 10, so that the at least one processor 10 performs the method for generating private domain data shown in the above embodiment.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks. Alternatively, the number of the communication interfaces 30 may be one or more.
The above-mentioned computer device furthermore comprises input means and output means (not shown in fig. 5). And the processor 10, memory 20, input devices, and output devices may be connected by a bus or other means.
Wherein the input device may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer device, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the method of generating private domain data shown in the above embodiments.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (12)

1. A method for generating private domain data, the method comprising:
acquiring an Application Programming Interface (API) data source, wherein the API data source comprises API data of a private domain and a data structure for describing the API data;
combining a plurality of items according to the API data of the private domain, and obtaining at least one first QA pair according to the plurality of items and the data structure;
according to a preset rule, carrying out blurring processing on the at least one first QA pair to obtain at least one second QA pair and a mapping relation table, wherein the mapping relation table is used for describing the corresponding relation between the first QA pair and the second QA pair before and after blurring processing;
performing deepening expansion on the at least one second QA pair to obtain at least one third QA pair;
according to the mapping relation table, performing a sharpening process on the at least one third QA pair to obtain at least one fourth QA pair, wherein the fourth QA pair is a supervised fine tuning SFT data set conforming to a private domain API;
transmitting the at least one fourth QA pair as a supervised fine tuning SFT dataset conforming to a private domain API to a training model to enable the training model to train a large language model with the SFT dataset.
2. The method of claim 1, wherein the API data of the private domain comprises: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets;
the method comprises the steps of combining a plurality of items according to the API data of the private domain, and specifically comprises the following steps:
combining one or more API subsets selected from the first API set with one or more API subsets selected from the second API set to generate an entry, the entry having the functionality of all API subsets selected.
3. The method of claim 2, wherein said generating at least one first QA pair from said plurality of entries and said data structure, comprises:
filling the combined entries into a preset template to generate a first instruction;
and inputting the first instruction into an AI algorithm model, and outputting the first QA pair.
4. The method of claim 1, wherein the blurring the at least one first QA pair according to a predetermined rule to obtain at least one second QA pair, specifically comprising:
Acquiring codes corresponding to each first QA pair;
and changing the condition space where the code of each first QA pair is located, or changing the function style of the code, or changing the function parameters of the code, so as to obtain at least one second QA pair.
5. The method of claim 4, wherein said changing the conditional space in which the code of each of said first QA pairs resides, comprises:
transforming the conditional space where the code of the first QA pair is located into a field different from the current field by using a local large language model LLM;
the changing the function style of the code includes: changing the function naming style of the code;
the changing the function parameters of the code includes: at least one non-essential parameter is randomly added to the function parameters and/or the order between the function parameters is changed.
6. The method of any one of claims 1 to 5, wherein said performing a sharpening process on said at least one third QA pair according to said mapping table results in at least one fourth QA pair, comprising:
and inversely transforming the at least one third QA pair by using the mapping relation table to obtain at least one fourth QA pair, and taking the at least one fourth QA pair as the SFT data set.
7. A device for generating private domain data, the device comprising:
the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring an Application Programming Interface (API) data source, wherein the API data source comprises API data of a private domain and a data structure for describing the API data;
the mixing module is used for combining a plurality of items according to the API data of the private domain and obtaining at least one first QA pair according to the items and the data structure;
the blurring module is used for carrying out blurring processing on the at least one first QA pair according to a preset rule to obtain at least one second QA pair and a mapping relation table, wherein the mapping relation table is used for describing the relation between the first QA pair and the second QA pair before and after the blurring processing;
the expansion module is used for deeply expanding the at least one second QA pair and generating at least one third QA pair;
the sharpening module is used for sharpening the at least one third QA pair according to the mapping relation table to obtain at least one fourth QA pair, wherein the fourth QA pair is a supervised fine adjustment SFT data set conforming to a private domain API;
and the output module is used for transmitting the at least one fourth QA pair serving as a supervised fine tuning SFT data set conforming to a private domain API to a training model so that the training model trains a large language model by utilizing the SFT data set.
8. The apparatus of claim 7, wherein the API data of the private domain comprises: the system comprises first API basic data and second API basic data, wherein the first API basic data corresponds to a first API set, the second API basic data corresponds to a second API set, and each API set comprises one or more API subsets;
the blurring module is specifically configured to combine one or more API subsets selected from the first API set with one or more API subsets selected from the second API set to generate an entry, where the entry has a function of all the selected API subsets.
9. The apparatus of claim 8, wherein the obfuscation module is specifically configured to populate a preset template with the combined entries, generate a first instruction, and input the first instruction into an AI algorithm model to output the first QA pair.
10. The apparatus of claim 7, wherein the obfuscation module is specifically configured to obtain a code corresponding to each of the first QA pairs; the condition space where the code of each first QA pair is located is changed, or the function style of the code is changed, or the function parameters of the code are changed, and the at least one second QA pair is generated.
11. The apparatus of claim 10, wherein the obfuscation module is specifically configured to transform a conditional space in which the code of the first QA pair is located into a domain different from a current domain using a local large language model LLM; changing the function naming style of the code; and randomly adding at least one non-essential parameter to the function parameters and/or changing the sequence between the function parameters.
12. The apparatus of any of claims 7 to 11, wherein the sharpening module is specifically configured to reverse-transform the at least one third QA pair using the mapping table to obtain at least one fourth QA pair, and use the at least one fourth QA pair as the SFT dataset.
CN202311546225.2A 2023-11-20 2023-11-20 Private domain data generation method, device, equipment and storage medium Active CN117252251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311546225.2A CN117252251B (en) 2023-11-20 2023-11-20 Private domain data generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311546225.2A CN117252251B (en) 2023-11-20 2023-11-20 Private domain data generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117252251A true CN117252251A (en) 2023-12-19
CN117252251B CN117252251B (en) 2024-03-12

Family

ID=89137351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311546225.2A Active CN117252251B (en) 2023-11-20 2023-11-20 Private domain data generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117252251B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130268357A1 (en) * 2011-09-15 2013-10-10 Stephan HEATH Methods and/or systems for an online and/or mobile privacy and/or security encryption technologies used in cloud computing with the combination of data mining and/or encryption of user's personal data and/or location data for marketing of internet posted promotions, social messaging or offers using multiple devices, browsers, operating systems, networks, fiber optic communications, multichannel platforms
CN113569263A (en) * 2021-07-30 2021-10-29 拉扎斯网络科技(上海)有限公司 Secure processing method and device for cross-private-domain data and electronic equipment
CN114970522A (en) * 2022-05-20 2022-08-30 北京百度网讯科技有限公司 Language model pre-training method, device, equipment and storage medium
CN115062003A (en) * 2022-05-26 2022-09-16 电子科技大学 Cloud ERP community generation type question-answering method based on GPT2
CN116910535A (en) * 2023-06-07 2023-10-20 清华大学 Programming-based large language model fine tuning-free pre-training method and device
CN116910217A (en) * 2023-07-24 2023-10-20 上海数珩信息科技股份有限公司 Natural language question-answering method, device and medium based on small language model cluster
CN117009490A (en) * 2023-08-31 2023-11-07 之江实验室 Training method and device for generating large language model based on knowledge base feedback
CN117033608A (en) * 2023-09-28 2023-11-10 中国电子科技集团公司第十研究所 Knowledge graph generation type question-answering method and system based on large language model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130268357A1 (en) * 2011-09-15 2013-10-10 Stephan HEATH Methods and/or systems for an online and/or mobile privacy and/or security encryption technologies used in cloud computing with the combination of data mining and/or encryption of user's personal data and/or location data for marketing of internet posted promotions, social messaging or offers using multiple devices, browsers, operating systems, networks, fiber optic communications, multichannel platforms
CN113569263A (en) * 2021-07-30 2021-10-29 拉扎斯网络科技(上海)有限公司 Secure processing method and device for cross-private-domain data and electronic equipment
CN114970522A (en) * 2022-05-20 2022-08-30 北京百度网讯科技有限公司 Language model pre-training method, device, equipment and storage medium
CN115062003A (en) * 2022-05-26 2022-09-16 电子科技大学 Cloud ERP community generation type question-answering method based on GPT2
CN116910535A (en) * 2023-06-07 2023-10-20 清华大学 Programming-based large language model fine tuning-free pre-training method and device
CN116910217A (en) * 2023-07-24 2023-10-20 上海数珩信息科技股份有限公司 Natural language question-answering method, device and medium based on small language model cluster
CN117009490A (en) * 2023-08-31 2023-11-07 之江实验室 Training method and device for generating large language model based on knowledge base feedback
CN117033608A (en) * 2023-09-28 2023-11-10 中国电子科技集团公司第十研究所 Knowledge graph generation type question-answering method and system based on large language model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TING FANG TAN ET AL: "Generative Artificial Intelligence Through ChatGPT and Other Large Language Models in Ophthalmology: Clinical Applications and Challenges", 《OPHTHALMOLOGY SCIENCE》, pages 1 - 9 *
肖仰华: "生成式语言模型与通用人工智能:内涵、路径与启示", 《学术前沿》, pages 50 - 57 *

Also Published As

Publication number Publication date
CN117252251B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
US11902413B2 (en) Secure machine learning analytics using homomorphic encryption
Liu et al. From distributed machine learning to federated learning: A survey
US20220269942A1 (en) Privacy Enhancing Deep Learning Cloud Service Using a Trusted Execution Environment
EP3168771B1 (en) Poly-logarythmic range queries on encrypted data
Eckles et al. Design and analysis of experiments in networks: Reducing bias from interference
Thornton et al. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms
Mugunthan et al. Privacyfl: A simulator for privacy-preserving and secure federated learning
US20200082272A1 (en) Enhancing Data Privacy in Remote Deep Learning Services
CN110110163A (en) Safe substring search is with filtering enciphered data
CN111400504B (en) Method and device for identifying enterprise key people
CN110597963A (en) Expression question-answer library construction method, expression search method, device and storage medium
DE112021004694T5 (en) TRAINING A QUESTION AND ANSWER DIALOGUE SYSTEM TO AVOID ENEMY ATTACKS
Mitra et al. Toward mining of temporal roles
EP4147414A1 (en) Quantum computing machine learning for security threats
Takeda et al. Molecular inverse-design platform for material industries
Omolara et al. A comprehensive review of honey encryption scheme
DE112021000689T5 (en) ATTESTATION OF NEURAL PROCESSES
Chen et al. Boosting with online binary learners for the multiclass bandit problem
CN117252251B (en) Private domain data generation method, device, equipment and storage medium
Benson et al. Learning multifractal structure in large networks
Yesseyeva et al. Tri-Pass: A new graphical user authentication scheme
Davidson et al. Security gets smart with AI
Syed et al. Performance analysis of cloud computing for distributed data center using Cloud-Sim
CN109657456B (en) Tcam verification method and system
Levin Submodular Optimization Under Uncertainty

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant