CN118014011A - Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model - Google Patents
Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model Download PDFInfo
- Publication number
- CN118014011A CN118014011A CN202410405159.5A CN202410405159A CN118014011A CN 118014011 A CN118014011 A CN 118014011A CN 202410405159 A CN202410405159 A CN 202410405159A CN 118014011 A CN118014011 A CN 118014011A
- Authority
- CN
- China
- Prior art keywords
- data
- privacy
- real
- privacy data
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 265
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000010276 construction Methods 0.000 title claims abstract description 37
- 238000006243 chemical reaction Methods 0.000 claims abstract description 143
- 238000013507 mapping Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 14
- 230000014509 gene expression Effects 0.000 claims description 13
- 210000002569 neuron Anatomy 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 11
- 238000013136 deep learning model Methods 0.000 description 10
- 230000004044 response Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000013501 data transformation Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000000903 blocking effect Effects 0.000 description 2
- 238000013506 data mapping Methods 0.000 description 2
- 238000000586 desensitisation Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/157—Transformation using dictionaries or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The specification provides a large language model training and training data construction method, device, equipment and medium, and relates to the technical field of artificial intelligence. The large language model training method comprises the following steps: determining real privacy data in an initial training data set; performing data conversion on the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data; constructing a target training data set according to the simulated privacy data and the initial training data set; and carrying out model training on the pre-constructed large language model through the target training data set to obtain a trained large language model. According to the technical scheme, the privacy data can be protected from the source, privacy data in the training data is prevented from being revealed, the personal privacy data safety is improved, the understanding capability of the large language model on the privacy data is prevented from being lost, and the accuracy of the output result of the large language model is improved.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a large language model training method, a large language model training apparatus, a training data construction method, a training data construction apparatus, an electronic device, and a computer-readable storage medium.
Background
With the rapid development of science and technology, large language models (Large Language Model, LLM) are receiving more and more attention. Large language models are deep learning models trained on massive amounts of text data that are capable of generating natural language text or understanding meaning of language text, and such models can perform a variety of natural language processing (Natural Language Processing, NLP) tasks including, but not limited to, text classification, question-answering, dialogue, and the like. The large language model is trained based on massive text data, has strong memory capacity for training data, and once the training data contains private data such as a mailbox, a mobile phone number and other personal identification information (Personal Identifiable Information, PII), an attacker has a certain probability to restore part of private data from answers of the large language model through question-answer interaction with the large language model.
At present, in a related large language model privacy data protection scheme, either request or response content of a large language model is detected, and request and response containing privacy data are filtered or blocked; or the privacy data in the training data is directly deleted or desensitized. However, by filtering or blocking the request and response containing the private data, the filtering or blocking measures are easily bypassed, and the problem of privacy information disclosure of the open-source large language model cannot be effectively solved; the privacy data in the training data is deleted or desensitized, so that the problem of privacy information disclosure of the large language model can be solved, but privacy data such as personal identification information and the like cannot be identified, the privacy data cannot be understood, the understanding capability of the large language model is reduced, and the accuracy of an output result of the large language model is lower.
It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the present specification and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a large language model training method, a large language model training device, a training data construction method, a training data construction device, an electronic device, and a computer readable storage medium, so as to effectively solve the problem of disclosure of private information of a large language model, and meanwhile, effectively ensure understanding and identifying ability of the large language model to private data, and improve accuracy of an output result of the large language model.
Additional features and advantages of the present description will be set forth in the detailed description which follows, or in part will be apparent from the practice of the present description.
According to a first aspect of embodiments of the present specification, there is provided a large language model training method, including: acquiring an initial training data set and determining real privacy data in the initial training data set; performing data conversion on the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data; constructing a target training data set according to the simulated privacy data and the initial training data set; and carrying out model training on the pre-constructed large language model through the target training data set to obtain a trained large language model.
In some example embodiments of the present disclosure, based on the foregoing solution, the performing data conversion on the real privacy data to obtain simulated privacy data includes: and inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data so as to obtain the simulated privacy data.
In some example embodiments of the present specification, based on the foregoing solution, the privacy data conversion model includes an input conversion network, a self-codec, and an output conversion network, and the inputting the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data includes: inputting the real privacy data into the input conversion network, and outputting to obtain a first digital sequence; inputting the first digital sequence into the self-encoding decoder, and outputting to obtain a second digital sequence; and converting the second digital sequence through the output conversion network to obtain the analog privacy data.
In some example embodiments of the present specification, based on the foregoing solution, the self-codec includes an input layer, at least one hidden layer, and an output layer, the inputting the first digital sequence into the self-codec, and outputting the second digital sequence includes: performing input conversion on the first digital sequence through the input layer to obtain a converted first digital sequence; inputting the converted first digital sequence into the hidden layer to obtain hidden variables corresponding to the first digital sequence; reconstructing the hidden variable through the output layer, and outputting to obtain a second digital sequence; the number of neurons of the input layer and the output layer is consistent and is larger than that of neurons of the hidden layer.
In some example embodiments of the present specification, based on the foregoing solution, the input conversion network includes a word segmentation sub-network and a first word bag dictionary mapping sub-network, and the inputting the real privacy data into the input conversion network and outputting the real privacy data to obtain a first digital sequence includes: inputting the real privacy data into the word segmentation sub-network to obtain a first character string sequence; and mapping the first character string sequence through the first word bag dictionary mapping sub-network to obtain a first number sequence.
In some example embodiments of the present disclosure, based on the foregoing solution, the output conversion network includes a second bag-of-word dictionary mapping sub-network and a character string concatenation sub-network, and the converting the second digital sequence through the output conversion network to obtain the simulated privacy data includes: inputting the second digital sequence into the second word bag dictionary mapping sub-network to obtain a second character string sequence; and splicing the second character string sequence through the character string splicing sub-network to obtain the simulated privacy data.
In some example embodiments of the present specification, based on the foregoing, the method further includes: constructing privacy conversion sample data, wherein the privacy conversion sample data comprises real privacy sample data and simulated privacy sample data corresponding to the real privacy sample data; and carrying out model training on the constructed privacy data conversion model through the privacy conversion sample data to obtain a trained privacy data conversion model.
In some example embodiments of the present specification, based on the foregoing scheme, the constructing privacy conversion sample data includes: acquiring real personal identification information, and performing de-duplication processing on the real personal identification information to obtain real privacy sample data; randomly generating different types of replacement character strings, wherein the replacement character strings correspond to virtual personal identification information; replacing part of character strings in the real privacy sample data belonging to the same type by the replacement character strings to obtain simulated privacy sample data; the data structure of the simulated privacy sample data is the same as the data structure of the real privacy sample data.
In some example embodiments of the present specification, based on the foregoing, the determining the true privacy data in the initial training data set includes: matching from the initial training data set based on a preset regular expression, and determining the real privacy data; or inputting the initial training data set into a pre-trained named entity recognition model to obtain entity classification labels of all data in the initial training data set, and determining the real privacy data through the entity classification labels.
According to a second aspect of embodiments of the present specification, there is provided a training data construction method, including: acquiring an initial training data set and determining real privacy data in the initial training data set; inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data; and constructing a target training data set according to the simulated privacy data and the initial training data set.
In some example embodiments of the present specification, based on the foregoing solution, the privacy data conversion model includes an input conversion network, a self-codec, and an output conversion network, and the inputting the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data includes: inputting the real privacy data into the input conversion network, and outputting to obtain a first digital sequence; inputting the first digital sequence into the self-encoding decoder, and outputting to obtain a second digital sequence; and converting the second digital sequence through the output conversion network to obtain the analog privacy data.
According to a third aspect of embodiments of the present specification, there is provided a large language model training apparatus, comprising: the real privacy data determining module is used for acquiring an initial training data set and determining real privacy data in the initial training data set; the real privacy data reconstruction module is used for carrying out data conversion on the real privacy data to obtain simulated privacy data, and the data structure of the simulated privacy data is the same as that of the real privacy data; the target training data set construction module is used for constructing a target training data set according to the simulated privacy data and the initial training data set; and the large language model training module is used for carrying out model training on the pre-constructed large language model through the target training data set to obtain a trained large language model.
According to a fourth aspect of embodiments of the present specification, there is provided a training data construction apparatus comprising: the privacy data acquisition module is used for acquiring an initial training data set and determining real privacy data in the initial training data set; the privacy data reconstruction module is used for inputting the real privacy data into a trained privacy data conversion model so as to reconstruct the real privacy data to obtain simulated privacy data, and the data structure of the simulated privacy data is the same as that of the real privacy data; and the privacy data replacing module is used for constructing a target training data set according to the simulated privacy data and the initial training data set.
According to a fifth aspect of embodiments of the present specification, there is provided an electronic device comprising: a processor; and a memory having stored thereon computer readable instructions which when executed by the processor implement the large language model training method of the first aspect or implement the training data construction method of the second aspect.
According to a sixth aspect of embodiments of the present specification, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the large language model training method in the first aspect, or implements the training data construction method in the second aspect.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
According to the large language model training method in the example embodiment of the specification, real privacy data in an initial training data set can be determined, data conversion is carried out on the real privacy data, simulated privacy data is obtained, and the data structure of the simulated privacy data is the same as that of the real privacy data; then, a target training data set can be constructed according to the simulated privacy data and the initial training data set, and model training is carried out on the pre-constructed large language model through the target training data set, so that a trained large language model is obtained. On one hand, the real privacy data are subjected to data conversion, a target training data set is constructed by simulating the privacy data, the privacy data are protected from the source, the privacy data in the training data are prevented from being revealed, the problem of revealing the privacy information of a large language model can be effectively solved, and the security of personal privacy data is improved; on the other hand, the real privacy data are converted to obtain the simulated privacy data with the same data structure as the real privacy data, and the large language model is subjected to model training through the target training data set formed by the simulated privacy data, so that the understanding and recognition capability of the large language model on the privacy data can be effectively ensured, and the accuracy of the output result of the large language model is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification. It is obvious that the drawings in the following description are only some embodiments of the present specification, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a system architecture of an exemplary application environment to which the large language model training method and apparatus, training data construction method and apparatus of the embodiments of the present specification may be applied.
FIG. 2 schematically illustrates a schematic diagram of a large language model training method flow in accordance with some embodiments of the present description.
Fig. 3 schematically illustrates a schematic diagram of a training data construction method flow according to some embodiments of the present description.
FIG. 4 schematically illustrates a schematic diagram of a large language model training apparatus according to some embodiments of the present description.
Fig. 5 schematically shows a schematic diagram of a training data construction device according to some embodiments of the present description.
Fig. 6 schematically illustrates a structural schematic diagram of a computer system of an electronic device according to some embodiments of the present description.
Fig. 7 schematically illustrates a schematic diagram of a computer-readable storage medium according to some embodiments of the present description.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description.
The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present specification. One skilled in the relevant art will recognize, however, that the aspects of the specification can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the description.
Moreover, the drawings are only schematic illustrations and are not necessarily drawn to scale. The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
FIG. 1 is a schematic diagram of a system architecture of an exemplary application environment to which the large language model training method and apparatus, training data construction method and apparatus of the embodiments of the present specification may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of the terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, 103 may be a variety of electronic devices having artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing capabilities, including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.
The large language model training method or training data constructing method provided in the embodiments of the present specification is generally performed by the terminal devices 101, 102, 103, and accordingly, the large language model training apparatus or training data constructing apparatus is generally provided in the terminal devices 101, 102, 103. However, it is easily understood by those skilled in the art that the large language model training method or the training data constructing method provided in the present embodiment may be executed by the server 105, and accordingly, the large language model training device or the training data constructing device may be provided in the server 105, which is not particularly limited in the present exemplary embodiment.
Because the large language model is obtained by training based on massive text data, the large language model has stronger memory capacity for training data. Once the training data contains private data such as mailbox, mobile phone number and other personal identification information, an attacker has a certain probability to restore part of the private data from the answers of the large language model through question-answer interaction with the large language model. This is a typical privacy data disclosure attack against large language models.
The common privacy protection modes are mainly divided into two types, one is plug-in privacy protection, namely, the request and the response containing the privacy data are filtered or blocked by detecting the request and the response content of a large language model; the other is to delete or desensitize the training data directly to PII or the like in the training data. The external-hanging privacy protection scheme cannot effectively solve the problem of privacy information disclosure of the open-source large language model, and the filtering rule of the external-hanging privacy protection scheme also has the risk of being bypassed; aiming at deletion or desensitization of PII data in training data, although the problem of privacy disclosure can be solved, the large language model does not identify privacy data such as PII or only the privacy data after desensitization. Once the large language model is needed to process the text of the privacy data, the large language model cannot correctly understand the privacy data (such as the PII data in a section of text extracted by the large language model), so that the understanding capability of the large language model is reduced, and the accuracy of the output result of the large language model is greatly influenced.
Based on one or more problems in the related art, the present disclosure first provides a large language model training method, which may be applied to a terminal device or a server, and the present exemplary embodiment is not limited thereto, and the following description uses a server to execute the method as an example. FIG. 2 schematically illustrates a schematic diagram of a large language model training method flow in accordance with some embodiments of the present description. Referring to fig. 2, the large language model training method may include the steps of:
Step S210, an initial training data set is obtained, and real privacy data in the initial training data set is determined;
Step S220, performing data conversion on the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data;
Step S230, constructing a target training data set according to the simulated privacy data and the initial training data set;
And step S240, performing model training on the pre-constructed large language model through the target training data set to obtain a trained large language model.
According to the large language model training method in the specification, on one hand, data conversion is carried out on real privacy data, a target training data set is constructed through simulating the privacy data, protection of the privacy data is realized from the source, privacy data leakage in the training data is avoided, the problem of privacy information leakage of the large language model can be effectively solved, and the safety of personal privacy data is improved; on the other hand, the real privacy data are converted to obtain the simulated privacy data with the same data structure as the real privacy data, and the large language model is subjected to model training through the target training data set formed by the simulated privacy data, so that the understanding and recognition capability of the large language model on the privacy data can be effectively ensured, and the accuracy of the output result of the large language model is improved.
Next, a large language model training method in the embodiment of the present specification will be further described.
In step S210, an initial training data set is acquired, and real privacy data in the initial training data set is determined.
In an exemplary embodiment of the present specification, the initial training data set refers to a sample data set for training a large language model that has not undergone any processing, for example, the initial training data set may be a text abstract extraction sample data set, an intelligent question-answer sample data set, or a language translation sample data set, and the type of the initial training data set is not particularly limited in this exemplary embodiment.
The real privacy data refers to data containing personal identification information in the initial training data set, for example, the real privacy data may be personal identification information such as a mobile phone number, a certificate number, a mailbox, or personal text information such as a name, a communication address, or of course, personal image information such as an identity document image and a driver license, and the type of the real privacy data is not limited in this example embodiment.
The real privacy data may be obtained by screening in the initial training data set, for example, the real privacy data may be matched from the initial training data set through a pre-constructed regular expression library, or the real privacy data may be obtained by identifying and classifying by a pre-trained data classification recognition model or a named entity recognition model, or of course, the real privacy data may be determined from the initial training data set through a key field detection, a third party detection tool, or the like, which is not limited in this example embodiment.
In step S220, data conversion is performed on the real privacy data to obtain simulated privacy data, where the data structure of the simulated privacy data is the same as the data structure of the real privacy data.
In an exemplary embodiment of the present disclosure, the simulated privacy data refers to data having the same data structure as the real privacy data (such as the length and the coding mode of an id card, the coding mode of a mobile phone number, the data structure of a provincial and urban area in a communication address, etc.), but the personal privacy information contained therein is data of virtual privacy, for example, bits 4-7 in the mobile phone number are region codes, the simulated privacy data may be privacy data obtained by replacing region codes of bits 4-7 in the real privacy data with reserved or unused region code sections, and the reserved or unused region code sections may also identify region information corresponding to the mobile phone number, but the region information is region information which does not exist in the real scene; of course, the simulated privacy data may be privacy data obtained by replacing address information in the real privacy data, such as information of provincial regions and the like, with information of provincial regions and the like which do not exist in the real scene, and the simulated privacy data is only schematically illustrated herein, and the embodiment does not make any special limitation on the type and the expression form of the simulated privacy data. The simulated privacy data can be identified through a large language model or other tools to obtain corresponding identification information, but the identification information is converted virtual information, so that the real privacy data is hidden, and the integrity of the data structure of the simulated privacy data is ensured.
The simulated private data may be obtained by performing data conversion on the real private data, for example, the simulated private data may be obtained by performing conversion on the real private data by a pre-trained generated deep learning model (such as a variational self-encoder VAE, a generated countermeasure network GAN, etc.), or the simulated private data may be obtained by recognizing a personal identification information portion in the real private data by a regular expression and performing data conversion on the personal identification information portion, which is not particularly limited in this example embodiment.
In step S230, a target training data set is constructed from the simulated privacy data and the initial training data set.
In an example embodiment of the present specification, the target training data set refers to a training data set constructed based on an initial training data set, which does not include real privacy data, and after the simulated privacy data is obtained, the target training data set may be constructed by simulating the privacy data and the initial training data set.
The target training data set can be obtained by directly replacing corresponding real privacy data in the initial training data set by the simulated privacy data corresponding to the real privacy data, the privacy data type of the simulated privacy data can also be determined, and the target training data set can be obtained by randomly replacing the real privacy data with the same privacy data type in the target training data set by the simulated privacy data; of course, the replacement of the real privacy data in the initial training data set by the simulated privacy data may also be implemented in other manners, which the present example embodiment does not specifically limit.
Through carrying out data conversion to real privacy data to construct target training data set through simulating privacy data, realize the protection to privacy data from the source, avoid privacy data in the training data reveal, can effectively solve the privacy information disclosure problem of big language model, promote individual privacy data security.
In step S240, model training is performed on the pre-constructed large language model through the target training data set, so as to obtain a trained large language model.
In an example embodiment of the present disclosure, the pre-built large language model refers to a large language model that is built with model parameters being initialization parameters, for example, the pre-built large language model may be a pre-built deep learning model for a text classification task or a pre-built deep learning model for an intelligent question-answering task, and the type of the pre-built large language model is not particularly limited in this example embodiment.
The real privacy data are converted to obtain the simulated privacy data with the same data structure as the real privacy data, and the model training is carried out on the pre-constructed large language model through the target training data set formed by the simulated privacy data, so that the understanding and recognition capability of the trained large language model on the privacy data can be effectively ensured, and the accuracy of the output result of the large language model is improved.
Next, step S210 to step S240 will be described in detail.
In an example embodiment of the present disclosure, the data conversion of the real privacy data may be implemented to obtain the simulated privacy data by the following steps, which may specifically include:
The real privacy data can be input into a trained privacy data conversion model to reconstruct the real privacy data to obtain simulated privacy data.
The privacy data conversion model may be a generated deep learning model for reconstructing real privacy data, for example, the privacy data conversion model may be constructed based on a self-codec (Autoencoder, AE), may be constructed based on a variable self-encoder (Variational Autoencoders, VAE), and may be constructed based on a generated countermeasure Network (GAN), which is not limited in type.
The high-efficiency conversion from the real privacy data to the analog privacy data can be realized through the trained privacy data conversion model, and compared with the mode of converting the real privacy data through manual work or a third party tool, the method can improve the conversion efficiency of the real analog data while guaranteeing the accuracy of the analog privacy data obtained through conversion, so that the training efficiency of a large language model can be improved.
Optionally, the privacy data conversion model may be a deep learning model constructed based on a self-codec, which is an unsupervised learning architecture based on a neural network, and may be used for tasks such as data dimension reduction and feature learning. Specifically, the privacy data conversion model may include an input conversion network, a self-codec and an output conversion network, and the inputting of the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data may include:
the real privacy data can be input into an input conversion network, the first digital sequence is output to be obtained, the first digital sequence is input into a self-encoding decoder, the second digital sequence is output to be obtained, and then the second digital sequence can be converted through the output conversion network to obtain the analog privacy data.
The input conversion network refers to a network that performs input processing on input real privacy data, for example, the input conversion network may be a front-end data processing network that performs size conversion and word segmentation on the real privacy data to obtain input that may be used as a self-codec network, and of course, the input conversion network may also be a front-end data processing network that performs data encoding on the real privacy data to reduce data complexity.
The first digital sequence is a digital sequence obtained by performing pre-data processing on real privacy data through an input conversion network, and the second digital sequence is a digital sequence obtained by performing data reconstruction based on a trained self-encoding decoder. It should be noted that, in the present embodiment, "first" and "second" in the "first digital sequence" and "second digital sequence" are used only to distinguish the digital sequences before and after the data conversion from the codec, and have no special meaning, and should not cause any special limitation to the present exemplary embodiment.
The output conversion network is a network for performing post-data reduction processing on the input second digital sequence, for example, the real privacy data can be text information such as real address information, the input conversion network can obtain a first digital sequence corresponding to the address information, the data reconstruction is performed through the self-coding decoder to obtain the second digital sequence, at this time, the second digital sequence cannot represent any information, further, the post-data reduction can be performed on the second digital sequence through the output conversion network, the analog privacy data containing the virtual address information can be obtained, and the data structure of the analog privacy data output by the privacy data conversion model is identical with the data structure of the input real privacy data.
The input real privacy data can be converted into a digital sequence through the input conversion network and the output conversion network, and the self-encoding decoder is used for reconstructing the digital sequence, so that compared with the method for reconstructing the real privacy data through the self-encoding decoder, the method has the advantages that the complexity of the data processed by the self-encoding decoder can be effectively reduced, the consumption of computing resources is reduced, the data processing efficiency is improved, and the conversion efficiency of the real privacy data is improved; meanwhile, the self-coding decoder can also improve the characteristic learning efficiency, reduce the training difficulty and improve the training efficiency of the privacy data conversion model in the training stage.
In an alternative embodiment, the self-codec in the private data conversion model may include an input layer, at least one hidden layer and an output layer, where the inputting of the first digital sequence into the self-codec and the outputting of the second digital sequence may be implemented by:
The first digital sequence can be input and converted through the input layer to obtain a converted first digital sequence, and then the converted first digital sequence can be input into the hidden layer to obtain hidden variables corresponding to the first digital sequence; and reconstructing the hidden variable through an output layer, and outputting to obtain a second digital sequence.
Wherein the input layer is the part of the self-codec that interacts with the input data and can receive high-dimensional input data, such as an image pixel matrix or text vector representation, the number of neurons in the input layer generally matches the characteristic dimension of the input data and is responsible for converting the input data into a form or size that the self-codec can handle.
The hidden layer is a part of the self-encoding decoder for extracting data characteristics, performs nonlinear transformation on input data and gradually compresses information to generate a low-dimensional potential representation (latent representation), namely hidden variables, so as to realize dimension reduction and expression of internal characteristics in the input data.
The output layer is a network in the self-codec for reconstructing or generating near-original input data, and the reconstructed data can be obtained by minimizing the difference (e.g., mean square error) between the input data and the reconstructed data.
The number of neurons of the input layer and the output layer is consistent, so that the output data with the same size as the input data is expected to be reconstructed, and the number of neurons of the input layer and the output layer is larger than that of neurons of the hidden layer, so that the hidden layer can be prevented from simply copying the input signal, and the hidden layer can be forced to learn a compact and meaningful potential representation (potential space or feature space) of the data, for example, the number of neurons of the input layer and the output layer can be represented as w, the number of neurons of the hidden layer can be represented as h, and w can be larger than h.
In an alternative embodiment, the input conversion network may include a word segmentation sub-network and a first word bag dictionary mapping sub-network, and the inputting of the real privacy data into the input conversion network and the outputting to obtain the first digital sequence may include:
The real privacy data can be input into a word segmentation sub-network to obtain a first character string sequence, and the first character string sequence is mapped through a first word bag dictionary mapping sub-network to obtain a first digital sequence.
The word segmentation sub-network is a network for splitting real privacy data, for example, taking the real privacy data input by the privacy data conversion model as text data as an example, the real privacy data can be segmented through the word segmentation sub-network and split into a character string sequence composed of Chinese characters, lower-case english words or common symbols. The first character string sequence refers to a character string sequence obtained by splitting real privacy data through a word segmentation sub-network.
The first word Bag dictionary mapping sub-network refers to a network for mapping the first character string sequence to the target representation form, for example, the first word Bag dictionary mapping sub-network may be a data mapping network constructed based on a word Bag Model (Bag-of-words Model), or may be a data mapping network constructed based on a word embedding Model (Word embedding Model), and the type of the first word Bag dictionary mapping sub-network is not particularly limited in this example embodiment. The first character string sequence obtained by word segmentation can be mapped into a first number sequence through a first word bag dictionary mapping sub-network.
Optionally, the output conversion network may include a second word bag dictionary mapping sub-network and a character string splicing sub-network, and the converting the second digital sequence through the output conversion network to obtain the analog privacy data may specifically include:
and inputting the second digital sequence into a second word bag dictionary mapping sub-network to obtain a second character string sequence, and then splicing the second character string sequence through a character string splicing sub-network to obtain the simulated privacy data.
The second word bag dictionary mapping sub-network may have the same network structure as the first word bag dictionary mapping sub-network, and is mainly used for mapping the second digital sequence obtained by data reconstruction back to a data form corresponding to the real privacy data, and the second word bag dictionary mapping sub-network may map the second digital sequence obtained by reconstruction into a second character string sequence.
The character string splicing sub-network is a network for sequentially splicing the second character string sequences based on the character string positions to obtain the simulated privacy data.
It should be noted that, in the present embodiment, "first" and "second" in the "first string sequence", "second string sequence", "first bag-of-word dictionary mapping sub-network" and "second bag-of-word dictionary mapping sub-network" are used only to distinguish between different string sequences and bag-of-word dictionary mapping sub-networks, and have no special meaning, and should not cause any special limitation to the present exemplary embodiment.
The word segmentation sub-network and the first word bag dictionary mapping sub-network can be used for converting high-complexity real privacy data into a low-complexity first digital sequence, and compared with the method that the real privacy data are directly reconstructed through the self-encoding decoder, the method has the advantages that the data complexity processed by the self-encoding decoder can be effectively reduced by the self-encoding decoder for reconstructing the first digital sequence, the consumption of calculation resources is reduced, and the data processing efficiency is improved; meanwhile, the second digital sequence obtained through reconstruction can be restored into the simulated privacy data with the same structure as the real privacy data through the second word bag dictionary mapping sub-network and the character string splicing sub-network, and the accuracy of the data structure of the output simulated privacy data can be ensured while the complexity of the data processed by the self-encoding decoder is reduced, so that the accuracy of the output result of the privacy data conversion model is ensured.
In an example embodiment of the present specification, training of the private data transformation model may be achieved by:
privacy conversion sample data can be constructed, the privacy conversion sample data can comprise real privacy sample data and simulated privacy sample data corresponding to the real privacy sample data, and then the constructed privacy data conversion model can be subjected to model training through the privacy conversion sample data to obtain a trained privacy data conversion model.
Optionally, the construction of the privacy conversion sample data may be implemented by the following steps, which may specifically include:
The real personal identification information can be collected, the real personal identification information is subjected to duplication removal processing to obtain real privacy sample data, then different types of replacement character strings can be randomly generated, and the replacement character strings correspond to the virtual personal identification information; and replacing part of character strings in the real privacy sample data belonging to the same type by replacing character strings to obtain the simulated privacy sample data, wherein the data structure of the simulated privacy sample data is the same as that of the real privacy sample data.
For example, taking the real privacy sample data as a mobile phone number as an example, in the mobile phone number, the 1 st to 3 rd bits are network identification codes, the 4 th to 7 th bits are area codes, the 8 th to 11 th bits are user numbers, and the simulated privacy data can randomly generate unused network identification codes, area code sections, user numbers and other replacement character strings, and replace the same code sections in the real personal identification information through the randomly generated network identification codes, area code sections and user numbers to obtain the simulated privacy sample data; taking the real privacy sample data as address information for example, assuming that the address information can include information such as provincial and urban areas, the non-existing provincial and urban areas can be randomly generated to serve as replacement character strings, and the non-existing provincial and urban areas replace the corresponding provincial and urban areas in the real privacy sample data respectively, so that the simulated privacy sample data is obtained. It should be understood that this is a simple illustrative example only and should not be construed as limiting the present exemplary embodiment in any way.
The real privacy sample data obtained by carrying out the de-duplication processing on the real personal identification information can ensure the data validity, reduce the data processing amount and improve the construction efficiency of the privacy conversion sample data; meanwhile, different types of replacement character strings containing virtual personal identification information are randomly generated, and partial character strings in the real privacy sample data of the same type are replaced through the replacement character strings to obtain the simulated privacy sample data, so that the data structure of the simulated privacy sample data is ensured to be identical to that of the real privacy sample data while the real personal identification information is hidden by the simulated privacy sample data, the data structure of the simulated privacy sample data is ensured not to be damaged, and the recognition understanding capability of the privacy data conversion model on the privacy data is ensured.
In an example embodiment of the present specification, determining the true privacy data in the initial training data set may be achieved by:
The real privacy data may be determined by matching from the initial training dataset based on a preset regular expression. Wherein a regular expression (Regular Expression, RE) is used to describe and match a series of text conforming to a certain pattern (rule), the regular expression is composed of common characters (e.g., letters and numbers) and special characters (called meta-characters). These meta-characters include character classes, predefined matching patterns, adjectives, boundary matches, etc., and regular expressions can be used to retrieve, replace, or extract substrings in text that fit a certain pattern.
Optionally, the initial training data set may be input into a pre-trained named entity recognition model to obtain an entity classification label of each data in the initial training data set, and the real privacy data may be determined by the entity classification label. The named entity recognition model refers to a deep learning model for recognizing entities with specific meanings in a text, for example, the entities with specific meanings can be names of persons, places, institutions, proper nouns and the like, and the named entity recognition model can label words to be recognized from the text from a text sequence.
The real privacy data can be efficiently matched and screened from the initial training data set through the regular expression and the named entity recognition model, so that the screening efficiency of the real privacy data is effectively improved, and the construction efficiency of the target training data set is improved.
In summary, the real privacy data in the initial training data set can be determined, and the real privacy data is subjected to data conversion to obtain the simulated privacy data, wherein the data structure of the simulated privacy data is the same as the data structure of the real privacy data; then, a target training data set can be constructed according to the simulated privacy data and the initial training data set, and model training is carried out on the pre-constructed large language model through the target training data set, so that a trained large language model is obtained. On one hand, the real privacy data are subjected to data conversion, a target training data set is constructed by simulating the privacy data, the privacy data are protected from the source, the privacy data in the training data are prevented from being revealed, the problem of revealing the privacy information of a large language model can be effectively solved, and the security of personal privacy data is improved; on the other hand, the real privacy data are converted to obtain the simulated privacy data with the same data structure as the real privacy data, and the large language model is subjected to model training through the target training data set formed by the simulated privacy data, so that the understanding and recognition capability of the large language model on the privacy data can be effectively ensured, and the accuracy of the output result of the large language model is improved.
In addition, the embodiment of the present disclosure further provides a training data construction method, which may be applied to a terminal device or a server, and the present exemplary embodiment is not limited thereto, and the method is described below by taking a server to execute the method as an example. Fig. 3 schematically illustrates a schematic diagram of a training data construction method flow according to some embodiments of the present description. Referring to fig. 3, the large language model training method may include the steps of:
step S310, an initial training data set is obtained, and real privacy data in the initial training data set is determined;
Step S320, inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data;
Step S330, constructing a target training data set according to the simulated privacy data and the initial training data set.
In an example embodiment of the present disclosure, the privacy data conversion model includes an input conversion network, a self-codec, and an output conversion network, and the inputting of the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data may specifically include:
the real privacy data can be input into an input conversion network and output to obtain a first digital sequence; inputting the first digital sequence into a self-encoding decoder, and outputting to obtain a second digital sequence; and converting the second digital sequence through an output conversion network to obtain analog privacy data.
The specific details of each step in the training data construction method are described in detail in the corresponding large language model training method, and are not described herein.
The real privacy data is converted to obtain the simulated privacy data with the same data structure as the real privacy data, and the simulated privacy data is used for forming the target training data set, so that the target training data set can protect the privacy data and can ensure that the data structure of the privacy data is not destroyed, when other deep learning models are obtained through training of the target training data set, the problem that the deep learning models reveal the privacy data can be avoided, and the recognition and understanding capability of the deep learning models on the privacy data can be effectively ensured.
It should be noted that although the steps of the methods in the present specification are illustrated in a particular order in the figures, this does not require or imply that the steps must be performed in that particular order or that all of the illustrated steps must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
In addition, in the present exemplary embodiment, a large language model training apparatus is also provided. Referring to fig. 4, the large language model training apparatus 400 includes: the real privacy data determination module 410, the real privacy data reconstruction module 420, the target training data set construction module 430, and the large language model training module 440. Wherein:
A real privacy data determining module 410, configured to acquire an initial training data set, and determine real privacy data in the initial training data set;
The real privacy data reconstruction module 420 is configured to perform data conversion on the real privacy data to obtain simulated privacy data, where a data structure of the simulated privacy data is the same as a data structure of the real privacy data;
A target training data set construction module 430 for constructing a target training data set from the simulated privacy data and the initial training data set;
and the large language model training module 440 is configured to perform model training on the pre-constructed large language model through the target training data set, so as to obtain a trained large language model.
In an example embodiment of the present description, the real privacy data reconstruction module 420 is configured to:
And inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data so as to obtain the simulated privacy data.
In an example embodiment of the present description, the privacy data transformation model includes an input transformation network, a self-codec, and an output transformation network, and the real privacy data reconstruction module 420 includes:
The input conversion unit is used for inputting the real privacy data into the input conversion network and outputting to obtain a first digital sequence;
A data reconstruction unit, configured to input the first digital sequence into the self-codec, and output a second digital sequence;
And the output conversion unit is used for converting the second digital sequence through the output conversion network to obtain the analog privacy data.
In an example embodiment of the present specification, the self-codec includes an input layer, at least one hidden layer, and an output layer, the data reconstruction unit is configured to:
Performing input conversion on the first digital sequence through the input layer to obtain a converted first digital sequence;
inputting the converted first digital sequence into the hidden layer to obtain hidden variables corresponding to the first digital sequence;
reconstructing the hidden variable through the output layer, and outputting to obtain a second digital sequence;
the number of neurons of the input layer and the output layer is consistent and is larger than that of neurons of the hidden layer.
In an example embodiment of the present specification, the input conversion network includes a word segmentation sub-network and a first bag of words dictionary mapping sub-network, the input conversion unit being configured to:
Inputting the real privacy data into the word segmentation sub-network to obtain a first character string sequence;
and mapping the first character string sequence through the first word bag dictionary mapping sub-network to obtain a first number sequence.
In an example embodiment of the present specification, the output conversion network includes a second bag of words dictionary mapping sub-network and a string concatenation sub-network, the output conversion unit being configured to:
inputting the second digital sequence into the second word bag dictionary mapping sub-network to obtain a second character string sequence;
and splicing the second character string sequence through the character string splicing sub-network to obtain the simulated privacy data.
In an example embodiment of the present specification, the large language model training apparatus 400 further includes a privacy data transformation model training module including:
A privacy conversion sample data construction unit configured to construct privacy conversion sample data including real privacy sample data and simulated privacy sample data corresponding to the real privacy sample data;
The model training unit is used for carrying out model training on the constructed privacy data conversion model through the privacy conversion sample data to obtain a trained privacy data conversion model.
In an example embodiment of the present specification, the privacy conversion sample data constructing unit is configured to:
acquiring real personal identification information, and performing de-duplication processing on the real personal identification information to obtain real privacy sample data;
randomly generating different types of replacement character strings, wherein the replacement character strings correspond to virtual personal identification information;
Replacing part of character strings in the real privacy sample data belonging to the same type by the replacement character strings to obtain simulated privacy sample data;
The data structure of the simulated privacy sample data is the same as the data structure of the real privacy sample data.
In an example embodiment of the present description, the real privacy data determination module 410 is configured to:
matching from the initial training data set based on a preset regular expression, and determining the real privacy data; or alternatively
And inputting the initial training data set into a pre-trained named entity recognition model to obtain entity classification labels of all data in the initial training data set, and determining the real privacy data through the entity classification labels.
In addition, in the present exemplary embodiment, a training data construction apparatus is also provided. Referring to fig. 5, the large language model training apparatus 500 includes: a private data acquisition module 510, a private data reconstruction module 520, and a private data replacement module 530. Wherein:
The privacy data acquisition module 510 is configured to acquire an initial training data set, and determine real privacy data in the initial training data set;
The privacy data reconstruction module 520 is configured to input the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data, so as to obtain simulated privacy data, where a data structure of the simulated privacy data is the same as a data structure of the real privacy data;
A privacy data replacement module 530, configured to construct a target training data set according to the simulated privacy data and the initial training data set.
In one exemplary embodiment of the present description, based on the foregoing scheme, the privacy data transformation model includes an input transformation network, a self-codec, and an output transformation network, and the privacy data reconstruction module 520 is configured to:
Inputting the real privacy data into the input conversion network, and outputting to obtain a first digital sequence;
Inputting the first digital sequence into the self-encoding decoder, and outputting to obtain a second digital sequence;
and converting the second digital sequence through the output conversion network to obtain the analog privacy data.
The specific details of each module of the medium-large language model training device or the training data constructing device are described in detail in the corresponding large language model training method or training data constructing method, so that the details are not repeated here.
It should be noted that although in the above detailed description several modules or units of a large language model training apparatus or training data construction apparatus are mentioned, this division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
In addition, in the exemplary embodiments of the present specification, an electronic device capable of implementing the above-described large language model training method or training data constructing method is also provided.
Those skilled in the art will appreciate that the various aspects of the specification may be implemented as a system, method, or program product. Accordingly, aspects of the present specification may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 600 according to such an embodiment of the present specification is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments herein.
As shown in fig. 6, the electronic device 600 is in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting the different system components (including the memory unit 620 and the processing unit 610), a display unit 640.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs steps according to various exemplary embodiments of the present specification described in the above-mentioned "exemplary methods" section of the present specification. For example, the processing unit 610 may perform step S210 as shown in fig. 2, acquire an initial training data set, and determine real privacy data in the initial training data set; step S220, performing data conversion on the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data; step S230, constructing a target training data set according to the simulated privacy data and the initial training data set; step S240, model training is carried out on the pre-constructed large language model through the target training data set, and a trained large language model is obtained; or step S310 shown in fig. 3 is implemented, an initial training data set is acquired, and real privacy data in the initial training data set is determined; step S320, inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data; step S330, constructing a target training data set according to the simulated privacy data and the initial training data set.
The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 621 and/or cache memory 622, and may further include Read Only Memory (ROM) 623.
The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. As shown, network adapter 660 communicates with other modules of electronic device 600 over bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solutions according to the embodiments of the present specification may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and include several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present specification.
In an exemplary embodiment of the present specification, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible embodiments, the various aspects of the present description may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the present description as described in the "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 7, a program product 700 for implementing the above-described large language model training method according to an embodiment of the present specification is described, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of this specification is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present specification may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present specification, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solutions according to the embodiments of the present specification may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and include several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present specification.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains.
It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof.
Claims (15)
1. A large language model training method, comprising:
acquiring an initial training data set and determining real privacy data in the initial training data set;
Performing data conversion on the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data;
constructing a target training data set according to the simulated privacy data and the initial training data set;
And carrying out model training on the pre-constructed large language model through the target training data set to obtain a trained large language model.
2. The large language model training method according to claim 1, wherein the performing data conversion on the real privacy data to obtain simulated privacy data comprises:
And inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data so as to obtain the simulated privacy data.
3. The large language model training method according to claim 2, wherein the privacy data conversion model includes an input conversion network, a self-codec and an output conversion network, and the inputting the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data includes:
Inputting the real privacy data into the input conversion network, and outputting to obtain a first digital sequence;
Inputting the first digital sequence into the self-encoding decoder, and outputting to obtain a second digital sequence;
and converting the second digital sequence through the output conversion network to obtain the analog privacy data.
4. A large language model training method according to claim 3, said self-codec comprising an input layer, at least one hidden layer and an output layer, said inputting said first digital sequence into said self-codec, outputting resulting in a second digital sequence, comprising:
Performing input conversion on the first digital sequence through the input layer to obtain a converted first digital sequence;
inputting the converted first digital sequence into the hidden layer to obtain hidden variables corresponding to the first digital sequence;
reconstructing the hidden variable through the output layer, and outputting to obtain a second digital sequence;
the number of neurons of the input layer and the output layer is consistent and is larger than that of neurons of the hidden layer.
5. A large language model training method according to claim 3, wherein the input conversion network comprises a word segmentation sub-network and a first word bag dictionary mapping sub-network, the real privacy data is input into the input conversion network, and a first digital sequence is output and obtained, and the method comprises the steps of:
Inputting the real privacy data into the word segmentation sub-network to obtain a first character string sequence;
and mapping the first character string sequence through the first word bag dictionary mapping sub-network to obtain a first number sequence.
6. The large language model training method according to claim 3, wherein the output conversion network includes a second word bag dictionary mapping sub-network and a character string splicing sub-network, the converting the second digital sequence through the output conversion network to obtain the simulated privacy data includes:
inputting the second digital sequence into the second word bag dictionary mapping sub-network to obtain a second character string sequence;
and splicing the second character string sequence through the character string splicing sub-network to obtain the simulated privacy data.
7. The large language model training method of any one of claims 2 to 6, the method further comprising:
Constructing privacy conversion sample data, wherein the privacy conversion sample data comprises real privacy sample data and simulated privacy sample data corresponding to the real privacy sample data;
And carrying out model training on the constructed privacy data conversion model through the privacy conversion sample data to obtain a trained privacy data conversion model.
8. The large language model training method of claim 7, the constructing privacy conversion sample data comprising:
acquiring real personal identification information, and performing de-duplication processing on the real personal identification information to obtain real privacy sample data;
randomly generating different types of replacement character strings, wherein the replacement character strings correspond to virtual personal identification information;
Replacing part of character strings in the real privacy sample data belonging to the same type by the replacement character strings to obtain simulated privacy sample data;
The data structure of the simulated privacy sample data is the same as the data structure of the real privacy sample data.
9. The large language model training method of claim 1, the determining real privacy data in the initial training data set comprising:
matching from the initial training data set based on a preset regular expression, and determining the real privacy data; or alternatively
And inputting the initial training data set into a pre-trained named entity recognition model to obtain entity classification labels of all data in the initial training data set, and determining the real privacy data through the entity classification labels.
10. A training data construction method comprising:
acquiring an initial training data set and determining real privacy data in the initial training data set;
inputting the real privacy data into a trained privacy data conversion model to reconstruct the real privacy data to obtain simulated privacy data, wherein the data structure of the simulated privacy data is the same as that of the real privacy data;
And constructing a target training data set according to the simulated privacy data and the initial training data set.
11. The training data construction method according to claim 10, wherein the privacy data conversion model includes an input conversion network, a self-codec, and an output conversion network, and the inputting the real privacy data into the trained privacy data conversion model to reconstruct the real privacy data to obtain the simulated privacy data includes:
Inputting the real privacy data into the input conversion network, and outputting to obtain a first digital sequence;
Inputting the first digital sequence into the self-encoding decoder, and outputting to obtain a second digital sequence;
and converting the second digital sequence through the output conversion network to obtain the analog privacy data.
12. A large language model training apparatus, comprising:
the real privacy data determining module is used for acquiring an initial training data set and determining real privacy data in the initial training data set;
the real privacy data reconstruction module is used for carrying out data conversion on the real privacy data to obtain simulated privacy data, and the data structure of the simulated privacy data is the same as that of the real privacy data;
the target training data set construction module is used for constructing a target training data set according to the simulated privacy data and the initial training data set;
and the large language model training module is used for carrying out model training on the pre-constructed large language model through the target training data set to obtain a trained large language model.
13. A training data construction apparatus comprising:
The privacy data acquisition module is used for acquiring an initial training data set and determining real privacy data in the initial training data set;
The privacy data reconstruction module is used for inputting the real privacy data into a trained privacy data conversion model so as to reconstruct the real privacy data to obtain simulated privacy data, and the data structure of the simulated privacy data is the same as that of the real privacy data;
and the privacy data replacing module is used for constructing a target training data set according to the simulated privacy data and the initial training data set.
14. An electronic device, comprising:
a processor; and
A memory having stored thereon computer readable instructions which, when executed by the processor, implement the large language model training method of any one of claims 1 to 9, or implement the training data construction method of claim 10 or 11.
15. A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the large language model training method according to any one of claims 1 to 7, or implements the training data construction method according to claim 10 or 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410405159.5A CN118014011B (en) | 2024-04-07 | 2024-04-07 | Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410405159.5A CN118014011B (en) | 2024-04-07 | 2024-04-07 | Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118014011A true CN118014011A (en) | 2024-05-10 |
CN118014011B CN118014011B (en) | 2024-07-05 |
Family
ID=90947340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410405159.5A Active CN118014011B (en) | 2024-04-07 | 2024-04-07 | Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118014011B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103853697A (en) * | 2012-12-07 | 2014-06-11 | 比亚迪股份有限公司 | Mobile terminal and back-up method for application data thereof |
WO2020119432A1 (en) * | 2018-12-11 | 2020-06-18 | 腾讯科技(深圳)有限公司 | Speech recognition method and apparatus, and device and storage medium |
CN111680672A (en) * | 2020-08-14 | 2020-09-18 | 腾讯科技(深圳)有限公司 | Face living body detection method, system, device, computer equipment and storage medium |
CN112818390A (en) * | 2021-01-26 | 2021-05-18 | 支付宝(杭州)信息技术有限公司 | Data information publishing method, device and equipment based on privacy protection |
CN114676458A (en) * | 2022-03-24 | 2022-06-28 | 浙江大学 | Pre-training language model privacy disclosure risk oriented evaluation method and system |
CN116436704A (en) * | 2023-06-13 | 2023-07-14 | 深存科技(无锡)有限公司 | Data processing method and data processing equipment for user privacy data |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN116775837A (en) * | 2023-06-30 | 2023-09-19 | 山东浪潮科学研究院有限公司 | Intelligent dialogue privacy protection method, device, equipment and storage medium |
CN116861460A (en) * | 2023-07-06 | 2023-10-10 | 弗兰威尔信息科技(苏州)有限公司 | Customer data encryption method based on E-commerce marketing |
CN116993421A (en) * | 2023-06-29 | 2023-11-03 | 上海诊瑞医疗科技有限公司 | Patient evaluation system based on large language model |
-
2024
- 2024-04-07 CN CN202410405159.5A patent/CN118014011B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103853697A (en) * | 2012-12-07 | 2014-06-11 | 比亚迪股份有限公司 | Mobile terminal and back-up method for application data thereof |
WO2020119432A1 (en) * | 2018-12-11 | 2020-06-18 | 腾讯科技(深圳)有限公司 | Speech recognition method and apparatus, and device and storage medium |
CN111680672A (en) * | 2020-08-14 | 2020-09-18 | 腾讯科技(深圳)有限公司 | Face living body detection method, system, device, computer equipment and storage medium |
CN112818390A (en) * | 2021-01-26 | 2021-05-18 | 支付宝(杭州)信息技术有限公司 | Data information publishing method, device and equipment based on privacy protection |
CN114676458A (en) * | 2022-03-24 | 2022-06-28 | 浙江大学 | Pre-training language model privacy disclosure risk oriented evaluation method and system |
CN116436704A (en) * | 2023-06-13 | 2023-07-14 | 深存科技(无锡)有限公司 | Data processing method and data processing equipment for user privacy data |
CN116993421A (en) * | 2023-06-29 | 2023-11-03 | 上海诊瑞医疗科技有限公司 | Patient evaluation system based on large language model |
CN116775837A (en) * | 2023-06-30 | 2023-09-19 | 山东浪潮科学研究院有限公司 | Intelligent dialogue privacy protection method, device, equipment and storage medium |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN116861460A (en) * | 2023-07-06 | 2023-10-10 | 弗兰威尔信息科技(苏州)有限公司 | Customer data encryption method based on E-commerce marketing |
Non-Patent Citations (1)
Title |
---|
原永滨 等: "Parzen窗核密度估计的大规模数据模式分类隐私保护方法", 《科技导报》, no. 36, 28 December 2014 (2014-12-28), pages 104 - 109 * |
Also Published As
Publication number | Publication date |
---|---|
CN118014011B (en) | 2024-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021082953A1 (en) | Machine reading understanding method and apparatus, storage medium, and device | |
CN111046679B (en) | Quality information acquisition method and device of translation model and computer equipment | |
CN109559363B (en) | Image stylization processing method and device, medium and electronic equipment | |
CN107644011A (en) | System and method for the extraction of fine granularity medical bodies | |
CN111866004B (en) | Security assessment method, apparatus, computer system, and medium | |
CN114047929B (en) | Knowledge enhancement-based user defined function identification method, device and medium | |
CN112287069A (en) | Information retrieval method and device based on voice semantics and computer equipment | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN113010679A (en) | Question and answer pair generation method, device and equipment and computer readable storage medium | |
CN118378631B (en) | Text examination method, device, equipment and storage medium | |
CN115795038A (en) | Intention identification method and device based on localization deep learning framework | |
CN114297022A (en) | Cloud environment anomaly detection method and device, electronic equipment and storage medium | |
CN112732896B (en) | Target information display method, device, electronic equipment and medium | |
CN111582284B (en) | Privacy protection method and device for image recognition and electronic equipment | |
CN110674497B (en) | Malicious program similarity calculation method and device | |
CN118014011B (en) | Training method, training device, training data construction method, training device, training data construction equipment and training data construction medium for large language model | |
CN112507388B (en) | Word2vec model training method, device and system based on privacy protection | |
CN115565186A (en) | Method and device for training character recognition model, electronic equipment and storage medium | |
CN115687136A (en) | Script program processing method, system, computer equipment and medium | |
CN115145980A (en) | Dialog reply generation method and device, electronic equipment and storage medium | |
CN115221288A (en) | Semantic analysis method, semantic analysis device, electronic device, and storage medium | |
CN115080735A (en) | Relation extraction model optimization method and device and electronic equipment | |
CN114925757A (en) | Multi-source threat intelligence fusion method, device, equipment and storage medium | |
CN112926314A (en) | Document repeatability identification method and device, electronic equipment and storage medium | |
CN112989820A (en) | Legal document positioning method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |