CN117786737A

CN117786737A - Question-answer data generation method and device, electronic equipment and medium

Info

Publication number: CN117786737A
Application number: CN202311709710.7A
Authority: CN
Inventors: 周涛; 常力元; 童则余; 张鑫; 马尚荣
Original assignee: Tianyi Safety Technology Co Ltd
Current assignee: Tianyi Safety Technology Co Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-03-29

Abstract

The application provides a question and answer data generation method, a question and answer data generation device, electronic equipment and a medium, and relates to the technical field of information processing. The method comprises the steps of obtaining data information to be processed; inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model is trained by the following steps: training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first conditional submodel and the output of the second conditional submodel, which has lower similarity with the input sample. The method can improve the privacy protection efficiency in the question and answer data generation process.

Description

Question-answer data generation method and device, electronic equipment and medium

Technical Field

The embodiment of the application relates to the technical field of information processing, in particular to a question and answer data generation method, a question and answer data generation device, electronic equipment and a medium.

Background

With the rapid development of information technology, big data and machine learning models are becoming more and more popular in various fields. Enterprises and organizations accumulate large amounts of private data that are critical to training and improving machine learning models. However, this also poses a serious problem of how to secure protection of privacy while utilizing these sensitive data.

In this context, large language models (Large Language Model, LLM) have been developed that are specifically trained for a particular industry or domain to improve the utility and performance of the model. However, with the widespread use of these models, a new problem emerges that the model may generate copies of data that are similar to training samples, thus violating the privacy of the service provider.

In the existing solutions, the sensitive data desensitization can only identify the sensitive fragments, but not all data, and the prevention of the model generation of data similar to the training samples requires the definition of a well-matched mode in advance, so that the definition of the well-matched mode of each sensitive data item is a huge workload for generating a language model, and therefore, the privacy protection efficiency of the large language model in the question-answer data generation process is low. How to provide a question and answer data generation method, the efficiency of privacy protection in the question and answer data generation process is improved, and the method has important practical significance.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the application provides a question and answer data generation method, a question and answer data generation device, electronic equipment and a medium, which can reduce the security risk of privacy data and improve the privacy protection efficiency in the question and answer data generation process.

In order to achieve the above purpose, the technical solution of the embodiments of the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for generating question-answer data, where the method includes:

acquiring data information to be processed;

inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model;

the desensitization generation model is trained by the following steps:

training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; the first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises the second type of data item and a third type of data item; the first type data items and the third type data items are not repeated data items and are provided with sensitive word labels; the second class of data items do not carry the sensitive word tags; the first class of data items, the second class of data items, and the third class of data items are derived based on an initial training sample set;

Training an initial control sub-model based on the condition samples in the initial training sample set to obtain the control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with an input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet the preset probability stabilizing condition.

In an alternative embodiment, training a first basic large language model using the first set of sliced samples to obtain a first conditional sub-model; and before training the second basic large language model with the second set of sliced samples to obtain a second conditional sub-model, the method further comprises:

performing data deduplication on the initial training sample set based on the hash value of the sample data in the initial training sample set to obtain a first training sample set;

acquiring sample data of the first training sample set one by one, and judging whether the currently acquired sample data has a sensitive word label or not every time one sample data is acquired;

and distributing the acquired sample data with the sensitive word labels to the first set and the second set alternately in turn, and copying the acquired sample data without the sensitive word labels to the first set and the second set simultaneously to obtain the first sliced sample set and the second sliced sample set.

In an alternative embodiment, before the training of the initial control sub-model to obtain the control sub-model based on the condition samples in the initial training sample set, the method further includes:

sample data in an initial training sample set are selected one by one;

inputting the currently selected sample data to the first condition sub-model and the second condition sub-model every time one sample data is selected;

calculating the probability ratio of any one of the first generation probabilities of the currently selected sample data to the first conditional sub-model to any one of the second generation probabilities of the currently selected sample data to the second conditional sub-model;

and if the numerical value of the logarithm of the probability ratio is smaller than or equal to a preset ratio threshold value, taking the currently selected sample data as a condition sample.

In an alternative embodiment, the objective function of the initial control sub-model is constructed based on a conditional model minimum probability, a difference coefficient mean, and a sensitive data penalty term;

the minimum probability of the condition model is a smaller value in the generation probabilities of the two sub-models when the same sample is input to the first condition sub-model and the second condition sub-model respectively; the difference coefficient mean value is an average value of the difference coefficients of the first condition sub-model and the second condition sub-model on the same input sample; the difference coefficient is the maximum value of the text distance of the output text when the same sample is respectively input into the first condition sub-model and the second condition sub-model; the sensitive data penalty term includes an additional output function; the additional output function is the inverse of the probability of generation when the target condition sample is input to the third basic large language model; the target condition sample is the same sample as the condition sample in the sensitive data item.

In an alternative embodiment, the sensitive data penalty term is the product of the additional output function and a scalar coefficient; the scalar coefficients are used to adjust the value of the sensitive data penalty term derived based on the additional output function.

In an optional embodiment, the training the initial control sub-model based on the condition samples in the initial training sample set to obtain the control sub-model includes:

optimizing an objective function of the initial control sub-model according to the generated information corresponding to the condition sample, and adjusting model parameters of the initial control sub-model until a preset model convergence condition is reached, so as to obtain the control sub-model; the generated information comprises first generated information and second generated information; the first generation information is information generated by inputting a conditional sample into a first conditional sub-model; the second generation information is information generated by inputting a condition sample into a second condition sub-model.

In a second aspect, an embodiment of the present application further provides a question-answer data generating device, where the device includes:

the data information acquisition unit is used for acquiring data information to be processed;

the questioning and answering text generating unit is used for inputting the data information to be processed into a desensitization generating model to obtain a desensitization questioning and answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model;

The desensitization model training unit is used for training to obtain the desensitization generation model through the following steps:

In an alternative embodiment, the apparatus further comprises a training sample slicing unit; the training sample slicing unit is used for:

In an alternative embodiment, the apparatus further comprises a conditional sample acquisition unit; the conditional sample acquisition unit is configured to:

sample data in an initial training sample set are selected one by one;

In an alternative embodiment, the desensitization model training unit is specifically configured to:

In a third aspect, embodiments of the present application further provide a computer readable storage medium, in which a computer program is stored, which when executed by a processor, implements the method according to the first aspect.

In a fourth aspect, embodiments of the present application further provide an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and when the computer program is executed by the processor, causes the processor to implement the method according to the first aspect.

In the above embodiment of the present application, a question-answer data generating method includes: acquiring data information to be processed; inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model; the desensitization generation model is trained by the following steps: training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; the first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises the second type of data item and a third type of data item; the first type data items and the third type data items are not repeated data items and are provided with sensitive word labels; the second class of data items do not carry the sensitive word tags; the first class of data items, the second class of data items, and the third class of data items are derived based on an initial training sample set; training an initial control sub-model based on the condition samples in the initial training sample set to obtain the control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with an input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet a preset probability stabilizing condition, a mechanism for limiting the similarity between the output target question-answer data and the sensitive data items in the question-answer data generation process is provided, the relevance between the generated target question-answer data and the privacy data is reduced, the privacy protection is enabled to be more comprehensive and effective, the safety risk of the privacy data can be reduced, and the privacy protection efficiency in the question-answer data generation process is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a question-answer data generating method according to an embodiment of the present application;

FIG. 2 is a second flowchart of a method for generating question-answer data according to an embodiment of the present application;

fig. 3 is a schematic flow chart of slicing a training sample set according to the method for generating question-answer data provided in the embodiment of the present application;

fig. 4 is a schematic flow chart of a selection condition sample of a question-answer data generating method according to an embodiment of the present application;

FIG. 5 is a third flowchart of a method for generating question-answer data according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of a training method of a desensitization generation model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a question-answer data generating device according to an embodiment of the present application;

Fig. 8 is a second schematic structural diagram of a question-answer data generating device according to an embodiment of the present application;

fig. 9 is a third schematic structural diagram of a question-answer data generating device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, wherein it is apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Some of the terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Large language model (Large Language Model, LLM): the large language model is also called a large language model, and is an artificial intelligence model, and aims to understand and generate human language. They train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and so forth. LLMs are characterized by a large scale, containing billions of parameters, which help them learn complex patterns in linguistic data.

(2) The conditional generation model, which can be expressed as p (|·|·) e M, where M is the spatial model of the conditional generation. The condition generating model p may take a certain hint X e X as input and then output Y e Y with a probability p (y|x).

(3) Adam algorithm (Adaptive moment Estimation, adaptive motion estimation algorithm): adam's algorithm is an optimization algorithm, an extension of gradient descent optimization algorithm. Adam combines the advantages of the Momentum gradient descent method Momentum and the root mean square back propagation method RMSProp for adaptively adjusting the learning rate when training the deep learning model. The Adam algorithm has high calculation efficiency and low memory requirement, and is suitable for solving the problem with large-scale data or parameters.

In order to improve the efficiency of privacy protection in the question and answer data generation process, the embodiment of the application provides a question and answer data generation method, a device, electronic equipment and a medium. In order to better understand the technical solution provided by the embodiments of the present application, a simple description is made here of the basic principle of the solution.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The following describes the technical scheme provided by the embodiment of the application with reference to the accompanying drawings.

In view of this, embodiments of the present application provide a method, an apparatus, an electronic device, and a medium for generating question-answer data, including: acquiring data information to be processed; inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model; the desensitization generation model is trained by the following steps: training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; the first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises a second type of data item and a third type of data item; the first type data item and the third type data item have no repeated data item and are provided with sensitive word labels; the second class of data items do not have sensitive word tags; the first class data item, the second class data item and the third class data item are obtained based on an initial training sample set; training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with the input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet a preset probability stabilizing condition, a mechanism for limiting the similarity between the output target question-answer data and the sensitive data items in the question-answer data generation process is provided, the relevance between the generated target question-answer data and the privacy data is reduced, the privacy protection is enabled to be more comprehensive and effective, the safety risk of the privacy data can be reduced, and the privacy protection efficiency in the question-answer data generation process is improved.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.

The method for generating question-answer data provided in the embodiment of the present application is further explained below. As shown in fig. 1, the method comprises the following steps:

s101, obtaining data information to be processed.

S102, inputting data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model; the desensitization generation model is trained by the following steps: training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first conditional submodel and the output of the second conditional submodel, which has lower similarity with the input sample.

The first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises a second type of data item and a third type of data item; the first type data item and the third type data item have no repeated data item and are provided with sensitive word labels; the second class of data items do not have sensitive word tags; the first class data item, the second class data item and the third class data item are obtained based on an initial training sample set; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet the preset probability stabilizing condition.

In an alternative embodiment, the objective functions of the first basic large language model and the second basic large language model are each likelihood functions.

In specific implementation, the basic large language model selects a proper LLM model architecture, for example, GPT-3 architecture, BERT architecture and the like.

The likelihood function is defined as an objective function of the underlying large language model to enable the model to generate text that is similar to the training data.

In some embodiments of the present application, the output of the first conditional sub-model and the second conditional sub-model may each be a conditional generation model. The output of the first conditional sub-model and the output of the second conditional sub-model may contain probability information.

In an alternative embodiment, the Adam algorithm is used to update the first model parameters of the first basic large language model and the second model parameters of the second basic large language model to optimize the objective function of the first basic large language model and the objective function of the second basic large language model, respectively, and the cross grid verification is used to perform super parameter adjustment and early stop operations to obtain the first condition sub-model and the second condition sub-model.

In an alternative embodiment, the first sliced sample set and the second sliced sample set are obtained by dividing all the first type data items in the initial training sample set into two sets in an average mode, and making the two sets contain different first type data items and the same second type data items; the difference between the first number of the first type data items contained in the first sliced sample set and the second number of the first type data items contained in the second sliced sample set is less than or equal to 1.

The question-answer data generation method of the above embodiment includes: acquiring data information to be processed; inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model; the desensitization generation model is trained by the following steps: training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; the first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises a second type of data item and a third type of data item; the first type data item and the third type data item have no repeated data item and are provided with sensitive word labels; the second class of data items do not have sensitive word tags; the first class data item, the second class data item and the third class data item are obtained based on an initial training sample set; training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with the input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet a preset probability stabilizing condition, a mechanism for limiting the similarity between the output target question-answer data and the sensitive data items in the question-answer data generation process is provided, the relevance between the generated target question-answer data and the privacy data is reduced, the privacy protection is enabled to be more comprehensive and effective, the safety risk of the privacy data can be reduced, and the privacy protection efficiency in the question-answer data generation process is improved.

In order to further explain the question-answer data generation method provided in the embodiment of the present application, in one embodiment, as shown in fig. 2, the question-answer data generation method includes the following steps:

s201, training a first basic large language model by adopting a first slicing sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model.

The first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises a second type of data item and a third type of data item; the first type data item and the third type data item have no repeated data item and are provided with sensitive word labels; the second class of data items do not have sensitive word tags; the first class of data items, the second class of data items, and the third class of data items are derived based on an initial training sample set.

S202, training an initial control sub-model based on a condition sample in an initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with the input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet the preset probability stabilizing condition.

In the implementation, the first generation probability of the first output information item generated when any condition sample is input into the first condition sub-model and the second generation probability of the second output information item generated when the second condition sub-model are input can meet the preset probability stabilizing condition.

S203, obtaining data information to be processed.

S204, inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model.

In an alternative embodiment, a first condition sub-model is obtained by training a first basic large language model by using a first sliced sample set; and training the second basic large language model with the second set of sliced samples to obtain a second conditional sub-model, as shown in FIG. 3, the method further comprises:

s301, performing data deduplication on the initial training sample set based on hash values of sample data in the initial training sample set to obtain a first training sample set.

In particular, sample data in an initial training sample set is used as a data item to be processed, the data item to be processed is mapped to a unique hash value by using a hash map, and then the hash values are compared to detect duplicate items. If two data items have the same hash value, performing the same data puncturing and puncturing can ensure that each data item exists in only one of the first and second sets of sliced samples.

S302, sample data of a first training sample set are acquired one by one, and whether the currently acquired sample data has a sensitive word label is judged every time one sample data is acquired.

In specific implementation, the method can be to define marked sensitive data for sample data in an initial training sample set according to preset sensitive data, and mark the sensitive data by adding unique identifiers or attribute information to each sample data so as to add sensitive word labels to data items.

S303, distributing the acquired sample data with the sensitive word labels to the first set and the second set alternately in turn, and copying the acquired sample data without the sensitive word labels to the first set and the second set simultaneously to obtain a first sliced sample set and a second sliced sample set.

In particular, the first training sample set is assumed to be data set D. Two empty sets, specifically set D1 and set D2, are preset. The entire data set D is traversed and each data item is examined one by one. For each data item, performing the presence check of the sensitive data C epsilon of the sensitive data set C based on the sensitive word tag, and if the sensitive data is contained, performing a rotation allocation strategy to sequentially and alternately allocate each data item containing the sensitive data C in the data set D into the set D1 and the set D2, so that the data items containing the sensitive data are evenly distributed. And then copying the rest data items which do not contain the sensitive data c into D1 and D2 simultaneously to obtain a first sliced sample set and a second sliced sample set.

In an alternative embodiment, before training the initial control sub-model to obtain the control sub-model based on the condition samples in the initial training sample set, as shown in fig. 4, the method further includes:

s401, sample data in an initial training sample set are selected one by one.

S402, inputting the currently selected sample data into the first condition sub-model and the second condition sub-model every time one sample data is selected.

S403, calculating the probability ratio of any one of the first generation probabilities of the currently selected sample data to the first conditional sub-model to any one of the second generation probabilities of the currently selected sample data to the second conditional sub-model.

S404, if the logarithm value of the probability ratio is smaller than or equal to the preset ratio threshold value, taking the currently selected sample data as a condition sample.

According to the method, any sample data in the selected condition samples can be realized, and when the first condition submodel and the second condition submodel are input, the similarity between the output information generated by the two submodels and the input sample data is lower than a preset similarity threshold value.

In an alternative embodiment, the objective function of the initial control sub-model is also used to limit the similarity of the question-answer data output by the model to sensitive data items.

In the implementation, the objective function of the initial control sub-model can limit the similarity between the question-answer data output by the initial control sub-model and the input sensitive data item, and one of the output of the first condition sub-model and the output of the second condition sub-model is selected as the question-answer data.

In an alternative embodiment, the objective function of the initial control sub-model is constructed based on the conditional model minimum probability, the mean of the difference coefficients, and the sensitive data penalty term;

the minimum probability of the conditional model is a smaller value in the generation probabilities of the two sub-models when the same sample is input to the first conditional sub-model and the second conditional sub-model respectively; the mean value of the difference coefficient is the mean value of the difference coefficient of the same input sample by the first condition sub-model and the second condition sub-model; the difference coefficient is the maximum value of the text distance of the output text when the same sample is respectively input into the first condition submodel and the second condition submodel; the sensitive data penalty term includes an additional output function; the additional output function is the inverse of the probability of generation when the target condition sample is input to the third basic large language model; the target condition sample is the same sample as in the sensitive data item.

In specific implementation, the objective function of the initial control sub-model can be determined by a minimum probability min (safe A (y|x), safe B (y|x)), a difference coefficient mean value diffV (A, B), and a sensitive data penalty term F (P (y1|x1) ^-1 ) And (5) construction.

The minimum probability min (safe A (y|x), safe B (y|x)) of the condition model indicates the smaller value of the generation probabilities safe A (y|x) and safe B (y|x) of the two sub-models when the same sample x is respectively input to the first condition sub-model safe A and the second condition sub-model safe B;

the difference coefficient mean diffV (a, B) represents the mean value of the difference coefficient delta (a, B) of the same sample x input by the first conditional sub-model safeA and the second conditional sub-model safeB, and the difference coefficient mean value can be determined by the following formula:

wherein,

x represents the same sample input to the first conditional sub-model safe a and the second conditional sub-model safe b;

m represents the number of samples;

delta (A, B) represents the coefficient of difference of the first conditional sub-model safe A and the second conditional sub-model safe B for the same sample x input.

In some embodiments, given a model safe a, model safe b, a difference metric function δ (a, b) is defined, which may be selected according to the needs of a particular application, and may be, for example, hamming distance, edit distance, semantic similarity, or other suitable distance metric. And then obtaining an output combination of the model safe A and the model safe B under the same input x, calculating the distance between the output combinations by using a difference metric function delta (a, b), and taking the obtained maximum distance as a difference coefficient.

In some embodiments, the coefficient of difference may be a maximum value of a text distance of the text output when the same sample is input into the first conditional sub-model and the second conditional sub-model, respectively.

Sensitive data penalty term F (P (y1|x1) ^-1 ) Including an additional output function P (y1|x1) ^-1 The method comprises the steps of carrying out a first treatment on the surface of the Additional output function P (y1|x1) ^-1 Inverse of the generation probability P (y1|x1) for inputting the target condition sample x1 to the third basic large language model; the target condition sample x1 is the same sample in the condition sample as in the sensitive data item.

Some embodiments of the present application achieve a reduction in the probability of generating an output that matches the copyrighted data by setting additional output functions. The inverse of P (y|x) will be greater when the model generates an output that is close to sensitive data, thereby reducing the tendency of the model to generate such an output.

In an alternative embodiment, the sensitive data penalty term is the product of the additional output function and the scalar coefficient; scalar coefficients are used to adjust the value of the sensitive data penalty term derived based on the additional output function.

Illustratively, sensitive data penalty term F (P (y1|x1) ^-1 ) Can be an additional output function P (y1|x1) ^-1 Product with scalar coefficient yc; scalar coefficient yc is used to adjust the basis of the additional output function P (y1|x1) ^-1 The value of the penalty term of the obtained sensitive data.

In practice, the scalar coefficient yc may have a range of values (0, 1).

Some embodiments of the present application implement control protection levels by setting scalar coefficients. The scalar coefficient yc serves to fine tune the protection level to balance the relationship between generating the desired output and avoiding generating sensitive data during training. A larger yc value will enhance the protection of sensitive data, while a smaller yc value reduces the protection level.

In an alternative embodiment, the objective function of the initial control submodel may be expressed as:

wherein,

x1 is an element in the intersection of the condition sample set X and the sensitive data set C;

y1 is output data obtained by inputting x1 into the LLM base model, which may be the third basic large language model described previously.

In an optional embodiment, based on a condition sample in an initial training sample set, training an initial control sub-model to obtain a control sub-model, specifically, optimizing an objective function of the initial control sub-model according to generation information corresponding to the condition sample, and adjusting model parameters of the initial control sub-model until a preset model convergence condition is reached to obtain the control sub-model; the generated information comprises first generated information and second generated information; the first generated information is information generated by inputting a conditional sample into a first conditional sub-model; the second generation information is information generated by inputting the condition sample into the second condition sub model.

In an alternative embodiment, according to the generated information corresponding to the condition sample, optimizing an objective function of the initial control sub-model and adjusting model parameters of the initial control sub-model until a preset model convergence condition is reached, wherein the generated information further comprises third generated information in the process of obtaining the control sub-model; the third generation information is information generated by inputting the conditional sample into the third basic large language model.

In an alternative embodiment, as shown in fig. 5, the question-answer data generating method includes the steps of:

s501, training a first basic large language model by adopting a first slicing sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model.

S502, optimizing an objective function of the initial control sub-model according to the generated information corresponding to the condition sample, and adjusting model parameters of the initial control sub-model until a preset model convergence condition is reached, so as to obtain the control sub-model.

Wherein the generated information includes first generated information and second generated information; the first generated information is information generated by inputting a conditional sample into a first conditional sub-model; the second generation information is information generated by inputting the condition sample into the second condition sub model.

The output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with the input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet the preset probability stabilizing condition.

By training to obtain a first condition sub-model, a second condition sub-model and a control sub-model, a desensitization generation model comprising the first condition sub-model, the second condition sub-model and the control sub-model can be constructed.

S503, obtaining data information to be processed.

S504, inputting the data information to be processed into a desensitization generation model to obtain a desensitization question-answering text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model.

Further, the embodiment of the application also provides a training method of the desensitization generation model, which is used for training the desensitization generation model used in the question-answer data generation method. Fig. 6 shows a training method of a desensitization generation model provided in an embodiment of the present application, where the training method of the desensitization generation model may be performed by a server or may be performed by a terminal device. The present embodiment will be described by taking a server executing the training method as an example.

As shown in fig. 6, the training method of the desensitization generation model specifically includes the following steps:

s601, training a first basic large language model by adopting a first slicing sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model.

S602, training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model.

In specific implementation, the desensitization generation model can be obtained through the training process of the steps. The desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model.

Based on the same inventive concept as the question-answer data generation method shown in fig. 1, a question-answer data generation device is also provided in the embodiment of the present application. Because the device is a device corresponding to the question-answer data generation method, and the principle of the device for solving the problem is similar to that of the method, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Fig. 7 shows a schematic structural diagram of a question-answer data generating device provided in an embodiment of the present application, and as shown in fig. 7, the question-answer data generating device includes a data information acquisition unit 701, a question-answer text generating unit 702, and a desensitization model training unit 703.

The data information acquiring unit 701 is configured to acquire data information to be processed;

a question-answer text generating unit 702, configured to input data information to be processed into a desensitization generating model, and obtain a desensitization question-answer text corresponding to the data information to be processed; the desensitization generation model comprises a first condition sub-model, a second condition sub-model and a control sub-model, wherein the outputs of the first condition sub-model and the second condition sub-model are connected with the input of the control sub-model;

a desensitization model training unit 703, configured to train to obtain a desensitization generation model through the following steps:

training a first basic large language model by adopting a first sliced sample set to obtain a first condition sub-model; training a second basic large language model by adopting a second sliced sample set to obtain a second condition sub-model; the first sliced sample set comprises a first type of data item and a second type of data item, and the second sliced sample set comprises a second type of data item and a third type of data item; the first type data item and the third type data item have no repeated data item and are provided with sensitive word labels; the second class of data items do not have sensitive word tags; the first class data item, the second class data item and the third class data item are obtained based on an initial training sample set;

Training the initial control sub-model based on the condition samples in the initial training sample set to obtain a control sub-model; the output text of the control submodel is one of the output of the first condition submodel and the output of the second condition submodel, which has lower similarity with the input sample; the condition sample enables the generation probability of the information items generated by the first condition sub-model and the second condition sub-model to meet the preset probability stabilizing condition.

In an alternative embodiment, as shown in fig. 8, the apparatus further comprises a training sample slicing unit 801; training sample slicing unit 801 for:

based on the hash value of the sample data in the initial training sample set, performing data deduplication on the initial training sample set to obtain a first training sample set;

acquiring sample data of a first training sample set one by one, and judging whether the currently acquired sample data has a sensitive word label or not every time the sample data is acquired;

and distributing the acquired sample data with the sensitive word labels to the first set and the second set alternately in turn, and copying the acquired sample data without the sensitive word labels to the first set and the second set simultaneously to obtain a first sliced sample set and a second sliced sample set.

In an alternative embodiment, as shown in fig. 9, the apparatus further comprises a conditional sample acquiring unit 901; a conditional sample acquisition unit 901 for:

sample data in an initial training sample set are selected one by one;

inputting the currently selected sample data into the first condition sub-model and the second condition sub-model every time one sample data is selected;

In an alternative embodiment, the desensitization model training unit 703 is specifically configured to:

optimizing an objective function of the initial control sub-model according to the generated information corresponding to the condition sample, and adjusting model parameters of the initial control sub-model until a preset model convergence condition is reached, so as to obtain the control sub-model; the generated information comprises first generated information and second generated information; the first generated information is information generated by inputting a conditional sample into a first conditional sub-model; the second generation information is information generated by inputting the condition sample into the second condition sub model.

The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method. The electronic device may be used for question-answer data generation. In one embodiment, the electronic device may be a server, a terminal device, or other electronic device. In this embodiment, the electronic device may be configured as shown in fig. 10, including a memory 1001, a communication module 1003, and one or more processors 1002.

Memory 1001 for storing computer programs for execution by processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 1001 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 1001 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or the memory 1001 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 1001 may be a combination of the above.

The processor 1002 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. A processor 1002 for implementing the question-answer data generation method described above when calling a computer program stored in the memory 1001.

The communication module 1003 is used for communicating with a terminal device and other servers.

The specific connection medium between the memory 1001, the communication module 1003, and the processor 1002 is not limited in the embodiments of the present application. The embodiment of the present disclosure is illustrated in fig. 10 by a bus 1004 between the memory 1001 and the processor 1002, where the bus 1004 is indicated by a thick line in fig. 10, and the connection between other components is merely illustrative, and not limited thereto. The bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the question-answer data generation method in the above-described embodiment. The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application.

Claims

1. A question-answer data generation method, the method comprising:

acquiring data information to be processed;

the desensitization generation model is trained by the following steps:

2. The method of claim 1, wherein training a first basic large language model using the first set of sliced samples results in a first conditional sub-model; and before training the second basic large language model with the second set of sliced samples to obtain a second conditional sub-model, the method further comprises:

3. The method of claim 1, wherein the method further comprises, prior to training an initial control sub-model to obtain the control sub-model based on the condition samples in the initial training sample set:

sample data in an initial training sample set are selected one by one;

4. The method of claim 1, wherein the objective function of the initial control sub-model is constructed based on a conditional model minimum probability, a coefficient of difference mean, a sensitive data penalty term;

5. The method of claim 4, wherein the sensitive data penalty term is a product of the additional output function and a scalar coefficient; the scalar coefficients are used to adjust the value of the sensitive data penalty term derived based on the additional output function.

6. The method of claim 1, wherein training an initial control sub-model based on the condition samples in the initial training sample set to obtain the control sub-model comprises:

7. A question-answer data generation device, the device comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises a training sample slicing unit; the training sample slicing unit is used for:

9. A computer-readable storage medium having a computer program stored therein, characterized in that: the computer program, when executed by a processor, implements the method of any of claims 1-6.

10. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, which when executed by the processor causes the processor to implement the method of any of claims 1-6.