CN116975927A - LLM language user privacy information protection method based on natural language prompt - Google Patents

LLM language user privacy information protection method based on natural language prompt Download PDF

Info

Publication number
CN116975927A
CN116975927A CN202311042595.2A CN202311042595A CN116975927A CN 116975927 A CN116975927 A CN 116975927A CN 202311042595 A CN202311042595 A CN 202311042595A CN 116975927 A CN116975927 A CN 116975927A
Authority
CN
China
Prior art keywords
natural language
model
prompt
information
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311042595.2A
Other languages
Chinese (zh)
Inventor
李雨晨
宫晓利
张金
李浩然
邹先予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202311042595.2A priority Critical patent/CN116975927A/en
Publication of CN116975927A publication Critical patent/CN116975927A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a LLM model user privacy information protection method based on natural language prompt, which comprises the steps of constructing a natural language prompt model with a p-tune soft template as a core, and realizing automatic generation of a natural language prompt continuous address through the p-tune soft template; pre-training and learning natural language prompts; and for key information in the input sentence, generating synthetic data by combining the input information and the output information to realize data enhancement. Compared with the prior art, the prompt message is generated by using the prompt engineering, the input message of the user is subjected to homosemantic replacement, and the privacy protection of the user is better completed on the premise of not damaging the result generation.

Description

LLM language user privacy information protection method based on natural language prompt
Technical Field
The application relates to the technical field of data privacy protection, in particular to a prompt-based LLM model user privacy information protection method.
Background
The prior art related to the present application is as follows:
language model data privacy protection:
in recent years, the ability of natural language models has been significantly improved and large-scale deployment has been achieved in several real-world scenarios. Training these models on domain-specific user data can further improve their utility. The amount of data required by the model, as well as the inherent sparsity of natural language (which generally means that all data is unique), can lead to the appearance of a series of privacy attacks against the model and its training data. The language model has high memory capacity for training samples, and the memory can cause model inversion attack, so that an attacker can reconstruct part of the training samples by inquiring the pre-training language model on any data record, thereby acquiring the privacy data of the user. How to effectively protect the privacy of user data is one of the hot research points of the current language big model.
The existing privacy protection research focuses on "preventing information with personal damage from being revealed", but it is insufficient to determine that a model can comprehensively protect privacy by relying on "one model can achieve prevention of information disclosure under various attacks". With the continual improvement of attack means, the defenses against models are generally at lower risk of privacy disclosure. Thus, in order to achieve better privacy protection, a more thorough understanding of the definition of privacy is required. Unlike direct non-recording of private information, people decide to keep data private by having to speak to the current conversation and socio-cultural context. However, such a determination requires additional information beyond the current dialog, and the lack of understanding of the context will result in an inability to make a proper determination. The scope of sharing data by users must therefore be considered in deciding to use the data for model training. Based on Nissenbaum's context integrity theory, privacy exposure can be considered "when information is shared out of acceptable range". Privacy violation is no longer a binary concept, defined as a number of degrees.
The representation of privacy has complexity. Some fixed formats of private information (e.g., telephone numbers, email boxes, credit card numbers, etc.) may also come in many different forms, and may be more difficult to identify for private information contained in free text. For private information, the repeated appearance of a piece of information does not mean that the information is not private. For example: corporate bank card numbers may appear inside a company multiple times, but for people outside the company this is private information. Therefore, the model needs to be able to identify the privacy information and find out the content related to all the privacy data in the training data in order to ensure the data privacy.
In addition, the variation of social idioms also causes the variation of ways in which people talk about privacy. The system used to detect the private parts in the input text should therefore also be concerned with these semantic transitions. However, language models are typically trained on a static data set, and thus over time, these static data sets, and the language models trained based thereon, will be less effective in understanding the evolving language.
It is therefore very challenging to build a machine learning model that "takes context into account fully" in practice to determine information privacy. In addition to privacy preserving research, identifying implicit contexts in text and making appropriate responses has been a popular topic in LM. The current related studies mainly include: evaluating whether the chat robot can respond to the complex situation properly, generating the situation perception, responding to the long text according with the moral personality, and the like.
(II) Prompt Learning (Prompt Learning):
for Natural Language Understanding (NLU) tasks, the user is asking, the language model determines corresponding key information according to context information (input records of the user) to generate an answer more in line with expectations, and the records may generally contain some personal privacy information of the user, so that a privacy disclosure problem exists. On the other hand, since definition of privacy information is complicated, more suitable decision needs to be made in combination with natural language understanding, and DP technology is insufficient to support wide privacy protection due to its drawbacks. In order to improve the performance of a pre-trained language model on a wide Natural Language Understanding (NLU) task, a more common approach is to fine tune, which implements optimization of the generated results by updating the entire model parameter set of the target task. Although good performance can be achieved using fine tuning, memory is consumed during training because gradients and optimizer states for all parameters need to be stored in the process. Furthermore, since pre-trained models are typically large, it is very inconvenient to keep a copy of the model parameters for each task during reasoning.
Prompt learning (Prompt learning) may freeze all parameters of the pre-trained model and query the language model using natural language prompts. For example, as shown in FIG. 1, for emotion analysis, a sample (e.g., "highlight movie |") may be concatenated with the hint that "this movie is [ MASK ]" and a pre-trained language model is required to predict the probability that the MASK will be marked "active" and "passive" to determine the label of the sample. The promt Learning can normalize all tasks to tasks of the pre-trained language model, training data is not needed, and effects exceeding fine tuning can be achieved on a data set with few samples, so that all NLU tasks become consistent in method. The Prompt learning can help the language model to better understand the input text. The preset task of shielding the language model is realized by using the prompt engineering so as to shield the input key text information, and the set shielding is used for model prediction to generate the replacement information with the same semantic knowledge so as to realize fuzzy processing on the input text and achieve the aim of protecting the privacy of the key information of the user.
(III) the prior art and the problems are: differential privacy Differential privacy
Differential Privacy (DP) is one of the standard methods to solve the problem of model privacy disclosure due to its strong and strict privacy guarantees. Differential privacy is not guaranteed for the worst case, as DP cannot effectively protect population-size data and therefore incurs significant loss of model utility as population size increases. DP in the model is also much slower to train because of cumbersome hyper-parametric tuning and development. In addition, the loss of utility of DP is much more severe for representatively inadequate populations, potentially causing property damage and social consequences.
DP currently main technical challenges:
1) Hiding personal records is not sufficient to ensure that user privacy is not violated (privacy information for a single user may appear in the data of multiple users).
2) The boundaries of the private information cannot be given.
3) The difference in granularity of data at word and sentence level makes DP insufficient to hide most of the private information.
Disclosure of Invention
The application provides a LLM model user privacy information protection method based on natural language prompt, which is used for masking input key text information, and generating replacement information with the same semantic knowledge by carrying out model prediction on the set masking so as to realize fuzzy processing on the input text and achieve the purpose of protecting the user key information privacy.
In order to achieve the above object, the present application provides the following technical solutions:
a LLM model user privacy information protection method based on natural language prompt comprises the following steps:
step 1, constructing a natural language prompt model with a p-training soft template as a core, and automatically generating continuous addresses through natural language prompt of the p-training soft template;
step 2, pre-training and learning natural language prompt are carried out, and the specific process is described as follows:
pre-training process of natural language prompt model: inputting natural language text information, extracting keyword information in the text, implementing shielding by synonymous replacement of the keyword information through a wordNet network model, reconstructing the text information by using a natural language prompt model for the shielded keyword information, and initializing parameters of the natural language prompt model by executing more than one pre-training process;
prompting a learning process of a model: prompting to learn a discrete natural language task instruction which is ready to be trained in advance, and inputting a natural language task to a pre-training language model;
and 3, generating synthetic data by combining the input information and the output information for realizing data enhancement on key information in the input sentence, generating a prompt word P after fine tuning a natural language prompt model, extracting a key word and a label from the prompt word P, and feeding the key word and the label into a pre-training language model to generate new synthetic training data.
Compared with the prior art, the application can achieve the following beneficial technical effects:
1) By combining language understanding tasks, potential privacy information in the text is better mined;
2) The adoption of the prompt engineering realizes the preset task of shielding the language model, and the user privacy protection is better completed on the premise of not damaging the result generation.
Drawings
FIG. 1 is a schematic overall flow diagram of a LLM model user privacy information protection method based on natural language prompt of the present application;
FIG. 2 is a block diagram of a p-tuning soft template;
FIG. 3 is an exemplary diagram of natural language cues;
FIG. 4 is a block diagram of a data enhancement embodiment.
Detailed Description
The technical scheme will be described in detail below with reference to the accompanying drawings and examples.
For Natural Language Understanding (NLU) tasks, the user is asking, the language model determines corresponding key information according to context information (input records of the user) to generate an answer more in line with expectations, and the records may generally contain some personal privacy information of the user, so that a privacy disclosure problem exists. On the other hand, since definition of privacy information is complicated, more suitable decision needs to be made in combination with natural language understanding, and DP technology is insufficient to support wide privacy protection due to its drawbacks.
Prompt learning (Prompt learning) is a popular technique that can help a language model to better understand input text. The preset task of shielding the language model is realized by using the prompt engineering so as to shield the input key text information, and the set shielding is used for model prediction to generate the replacement information with the same semantic knowledge so as to realize fuzzy processing on the input text and achieve the aim of protecting the privacy of the key information of the user.
As shown in fig. 1, the overall flow of the LLM (Large Language Model ) model user privacy information protection method based on natural language prompt of the present application includes the following steps:
step 1, constructing a natural language prompting model with a p-prompting soft template as a core, and realizing automatic generation of a natural language prompting continuous address through the p-prompting soft template;
as shown in FIG. 3, in Natural Language Understanding (NLU) tasks, natural language hints (Prompts) [ P ] 0 ],...,[P i ]、[P i+1 ],...,[P m ]Coding, namely prompting P of ith natural language in template T i Regarding as token, obtaining a discrete token set token { X } of input X, mapping a given group of discrete input token set token { X } into e (X) through an embedding layer, taking e (X) as given context information, and forming a template T together with the token set token { X } corresponding to the natural language prompt, wherein the output target of the e (Y) is e (Y).
The P-turn soft template is T= { [ P 0:i ],x,[P i+1:m ,y]Mapping T to { e ([ P ] 0:i ]),e(x),e([P i+1:m ]) E (y) } is mapped again to { h } 0 ,...,h i ,e(x),h i+1 ,...,h m ,e(y)},h i Representing a trainable embedded tensor. The p-tunneling soft template is the template type to be constructed by the natural language hint model.
Step 2, pre-training and learning natural language prompts (probes) are carried out, and the specific process is described as follows:
1) Pre-training process of natural language prompt model: inputting natural language text information, extracting keyword information in the text, carrying out synonymous replacement on the keyword information through a wordNet network model, reconstructing the text information for the replaced keyword information by using a natural language prompt model, and initializing parameters of the natural language prompt model by executing a pre-training process for more than one time;
2) During learning of the prompt model: prompt learning (Prompt learning) prepares discrete natural language task instructions to be trained in advance, inputs natural language tasks to a pre-training language model, updates parameters of natural language prompts by using the natural language Prompt model, and fixes parameters of all pre-training language models (PLMs). As shown in fig. 2, where the parameters of the sample encoder section are parameters of the pre-training model.
Step 3, generating synthetic data by combining input information and output information for key information in an input sentence to realize data enhancement, which specifically comprises the following steps:
the method comprises the steps of generating synthetic training data by using a pre-training language model (PLM) based on a seq2seq transducer architecture, wherein the pre-training language model based on the seq2seq transducer architecture comprises defined time steps, outputting a background variable through an encoder, and encoding information of an input sequence as a hiding state of the last time step.
The training data is synthesized by a data enhancement method in order to adapt to the model training process, and if the situation of insufficient data quantity and poor quality is met, the training data is synthesized by the data enhancement method and then is input into a pre-training language model;
setting upAs natural language hint at layer j, the i hidden state of layer j in the pre-trained language model based on the seq2seq Transformer architecture ≡>The following formula is shown:
wherein Trans () represents the forward function of the transducer layer, w i The fixed word embedding vector representing the input layer,an ith natural language hint template representing a jth layer;
updating gradient of each layer in the transducer model is realized through gradient feedback, so that a learning task is better completed;
however, finding a suitable discrete natural language task (generation) is not easy to introduce, and optimizing in an end-to-end fashion requires additional labor. Thus, freezing parameters of the natural language hint model helps generalize model parameters during training;
after fine tuning the natural language hint model, a hint word P is generated by the natural language hint model, and then keywords and labels are extracted from the hint word P, and these new keywords and labels are fed into the pre-trained language model to generate new synthetic data.
As shown in fig. 4, after fine tuning the natural language model, a hint word T is generated by the natural language hint model, and then keywords and tags are extracted from the hint word T, and these new keywords and tags are fed into the pre-trained language model to generate new synthetic data. The output text thus obtained remains of higher diversity.
The foregoing is merely exemplary of the present application and is not intended to limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, equivalent substitutions or variations can be made therein without departing from the spirit and principles of the application, and it is intended to cover the technical disclosure and the scope of the application.

Claims (3)

1. A LLM model user privacy information protection method based on natural language prompt is characterized by comprising the following steps:
step 1, constructing a natural language prompting model with a p-prompting soft template as a core, and realizing automatic generation of a natural language prompting continuous address through the p-prompting soft template;
step 2, pre-training and learning natural language prompt are carried out, and the specific process is described as follows:
pre-training process of natural language prompt model: inputting natural language text information, extracting keyword information in the text, implementing shielding by synonymous replacement of the keyword information through a wordNet network model, reconstructing the text information by using a natural language prompt model for the shielded keyword information, and initializing parameters of the natural language prompt model by executing more than one pre-training process;
learning process of natural language prompt model: prompting to learn a discrete natural language task instruction which is ready to be trained in advance, and inputting a natural language task to a pre-training language model;
and 3, generating synthetic data by combining the input information and the output information for realizing data enhancement on key information in the input sentence, generating a prompt word P after fine tuning a natural language prompt model, extracting a key word and a label from the prompt word P, and feeding the key word and the label into a pre-training language model to generate new synthetic training data.
2. The LLM model user privacy information protection method based on natural language prompt as set forth in claim 1, wherein the p-tuning soft template comprises:
in natural language understanding tasks, natural language cues (propts) [ P 0 ],…,[P i ]、[P i+1 ],…,[P m ]Coding, namely prompting P of ith natural language in template T i Regarding as token, obtaining a discrete token set token { X } of input X, mapping a given group of discrete input token set token { X } into e (X) through an embedding layer, taking e (X) as given context information, and forming a template T together with the token set token { X } corresponding to natural language prompt, wherein the output target of the e (Y) is e (Y);
the P-turn soft template is T= { [ P 0:i ],x,[P i+1:m ,y]Map T to { e (P) 0:i ]),e(x),e([P i+1:m ]) E (y) } is mapped again to { h } 0 ,...,h i ,e(x),h i+1 ,...,h m ,e(y)},h i Representing a trainable embedded tensor.
3. The LLM model user privacy information protection method based on natural language prompt as set forth in claim 1, wherein the p-tuning soft template comprises:
generating synthetic training data using a pre-training language model based on the seq2seq transducer architecture;
setting upAs natural language hint at layer j, the i hidden state of layer j in the pre-trained language model based on the seq2seq Transformer architecture ≡>The following formula is shown:
wherein Trans () represents the forward function of the transducer layer, w i The fixed word embedding vector representing the input layer,an ith natural language hint template representing a jth layer;
and updating the gradient of each layer in the transducer model through gradient feedback, so as to complete the learning task.
CN202311042595.2A 2023-08-17 2023-08-17 LLM language user privacy information protection method based on natural language prompt Pending CN116975927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311042595.2A CN116975927A (en) 2023-08-17 2023-08-17 LLM language user privacy information protection method based on natural language prompt

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311042595.2A CN116975927A (en) 2023-08-17 2023-08-17 LLM language user privacy information protection method based on natural language prompt

Publications (1)

Publication Number Publication Date
CN116975927A true CN116975927A (en) 2023-10-31

Family

ID=88485088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311042595.2A Pending CN116975927A (en) 2023-08-17 2023-08-17 LLM language user privacy information protection method based on natural language prompt

Country Status (1)

Country Link
CN (1) CN116975927A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251473A (en) * 2023-11-20 2023-12-19 摩斯智联科技有限公司 Vehicle data query analysis method, system, device and storage medium
CN117521116A (en) * 2024-01-04 2024-02-06 卓世科技(海南)有限公司 Large language model privacy information protection method
CN117725610A (en) * 2023-11-23 2024-03-19 中金金融认证中心有限公司 Privacy protection proxy method for third party large language model
CN117972024A (en) * 2024-02-06 2024-05-03 佛山科学技术学院 Automatic selection construction prompting method and system based on reinforcement learning
CN118095359A (en) * 2024-04-25 2024-05-28 蚂蚁科技集团股份有限公司 Large language model training method and device for privacy protection, medium and equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251473A (en) * 2023-11-20 2023-12-19 摩斯智联科技有限公司 Vehicle data query analysis method, system, device and storage medium
CN117251473B (en) * 2023-11-20 2024-03-15 摩斯智联科技有限公司 Vehicle data query analysis method, system, device and storage medium
CN117725610A (en) * 2023-11-23 2024-03-19 中金金融认证中心有限公司 Privacy protection proxy method for third party large language model
CN117521116A (en) * 2024-01-04 2024-02-06 卓世科技(海南)有限公司 Large language model privacy information protection method
CN117521116B (en) * 2024-01-04 2024-04-19 卓世科技(海南)有限公司 Large language model privacy information protection method
CN117972024A (en) * 2024-02-06 2024-05-03 佛山科学技术学院 Automatic selection construction prompting method and system based on reinforcement learning
CN117972024B (en) * 2024-02-06 2024-07-05 佛山科学技术学院 Automatic selection construction prompting method and system based on reinforcement learning
CN118095359A (en) * 2024-04-25 2024-05-28 蚂蚁科技集团股份有限公司 Large language model training method and device for privacy protection, medium and equipment

Similar Documents

Publication Publication Date Title
CN116975927A (en) LLM language user privacy information protection method based on natural language prompt
JP7346609B2 (en) Systems and methods for performing semantic exploration using natural language understanding (NLU) frameworks
Guo et al. Long text generation via adversarial training with leaked information
CN110134968B (en) Poem generation method, device, equipment and storage medium based on deep learning
JP7346610B2 (en) Deriving multiple semantic representations for utterances in a natural language understanding framework
CN109740053B (en) Sensitive word shielding method and device based on NLP technology
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN111401037A (en) Natural language generation method and device, electronic equipment and storage medium
CN113505198A (en) Keyword-driven generating type dialogue reply method and device and electronic equipment
CN115906815A (en) Error correction method and device for modifying one or more types of wrong sentences
CN111241843B (en) Semantic relation inference system and method based on composite neural network
Prakash et al. Chatterbot implementation using transfer learning and LSTM encoder-decoder architecture
CN112417118B (en) Dialog generation method based on marked text and neural network
CN114265921A (en) Question-answer knowledge base construction method and device, equipment, medium and product thereof
CN117746186A (en) Training method of low-rank adaptive model, text image generation method and system
CN117725610A (en) Privacy protection proxy method for third party large language model
Oh et al. BERTAC: Enhancing transformer-based language models with adversarially pretrained convolutional neural networks
CN115357720B (en) BERT-based multitasking news classification method and device
CN111400484B (en) Keyword extraction method and system
Gupta A review of generative AI from historical perspectives
CN115589446A (en) Meeting abstract generation method and system based on pre-training and prompting
CN115409078A (en) Sample attack resisting defense method based on integrated reconstruction mechanism
Dasgupta et al. A Review of Generative AI from Historical Perspectives
CN112434143A (en) Dialog method, storage medium and system based on hidden state constraint of GRU (generalized regression Unit)
CN112509559A (en) Audio recognition method, model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination