CN114841142A - Text generation method and device, electronic equipment and storage medium - Google Patents

Text generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114841142A
CN114841142A CN202210431002.0A CN202210431002A CN114841142A CN 114841142 A CN114841142 A CN 114841142A CN 202210431002 A CN202210431002 A CN 202210431002A CN 114841142 A CN114841142 A CN 114841142A
Authority
CN
China
Prior art keywords
text
training data
privacy
data set
occurrence probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210431002.0A
Other languages
Chinese (zh)
Inventor
龚笠
吴新维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210431002.0A priority Critical patent/CN114841142A/en
Publication of CN114841142A publication Critical patent/CN114841142A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a text generation method, a text generation device, an electronic device and a storage medium. One embodiment of the method comprises: acquiring a first text; inputting the text vector of the first text into a language model trained in advance by using a differential privacy model training method to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text; and generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word. According to the embodiment, the privacy protection degree of the second text corresponding to the first text generated under different application scenes is improved, and the risk that the training data information of the language model for generating the second text is leaked can be reduced.

Description

Text generation method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of natural language processing, in particular to a text generation method, a text generation device, electronic equipment and a storage medium.
Background
In the field of natural language processing, large-scale language models are widely used. Most language models are obtained by adopting various training data to train in advance. However, in the process of training the language model, there is a risk of leakage of training data information due to the use of a large amount of corpus data. The language model has the risk of revealing training data information in different application scenes in practical application.
Disclosure of Invention
The embodiment of the disclosure provides a text generation method and device, electronic equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a text generation method, where the method includes: acquiring a first text; inputting a text vector of the first text into a language model pre-trained by a training method based on a differential privacy model to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text, wherein in the process of training the language model by using a training data set, a gradient corresponding to training data in the training data set is cut, noise to be added is determined according to the privacy degree of a sample text in the training data set, and the noise to be added is added to an average cutting gradient corresponding to the training data set; and generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word.
In some optional embodiments, the method further comprises: presenting the second text.
In some optional embodiments, the current application scenario is a text input scenario or a speech recognition scenario, and the first text is an input text or a recognized text.
In some optional embodiments, the generating, based on the first probability of occurrence of each candidate word, a second text corresponding to the first text in the current application scenario includes: sorting the candidate words according to the sequence of the first occurrence probability from large to small; generating a candidate word subset by using the candidate words ranked in a first preset higher occurrence probability range in each candidate word; forming the second text based on each candidate word in the subset of candidate words and the respective first probability of occurrence.
In some optional embodiments, the current application scenario is an auxiliary writing scenario, and the first text is an input text; and generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word, including: splicing the candidate word with the highest first occurrence probability in all the candidate words after the first text to form a spliced text; performing a preset number of the following splicing operations: inputting the text vector of the spliced text into the language model to obtain a second occurrence probability of each candidate word appearing after the spliced text; splicing the candidate word with the highest second occurrence probability in all the candidate words behind the spliced text; and generating the second text by using texts except the first text in the spliced text.
In some optional embodiments, the current application scenario is a question and answer scenario, and the first text is a question text; and generating a second text corresponding to the first text in the current application scene based on the occurrence probability of each candidate word, including: sorting the candidate words according to the sequence of the first occurrence probability from large to small; generating a reply candidate word set by using the candidate words ranked in a second preset higher occurrence probability range in each candidate word; determining reply texts corresponding to the reply candidate words according to the corresponding relation between preset reply keywords and the reply texts; generating the second text based on the response text corresponding to each of the response candidate words.
In some alternative embodiments, the language model is pre-trained by the following training steps: acquiring an initial language model and at least one training data set, wherein the training data comprises a sample text, a sample text vector and the label occurrence probability of each candidate word; performing the following parameter adjustment operations on training data sets in the at least one training data set until a preset training end condition is met, wherein the parameter adjustment operations comprise: adjusting preset random noise according to the privacy weight of the training data set to obtain privacy-adjusted random noise; cutting the gradient corresponding to each training data in the training data set to obtain a corresponding cutting gradient, and determining an average cutting gradient corresponding to the training data set; adding the obtained privacy-adjusted random noise to the average clipping gradient corresponding to the training data set to obtain a noise gradient corresponding to the training data set; based on the noise gradient corresponding to the training data set, adjusting the model parameters of the initial language model by adopting a preset gradient descent optimization algorithm; determining the initial language model as the pre-trained language model.
In some optional embodiments, before adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise, the parameter adjusting operation further includes: for the training data in the set of training data, performing the following privacy degree calculation operation: inputting the sample text vector in the training data into the initial language model to obtain the predicted occurrence probability corresponding to the training data and used for predicting each candidate word after the sample text in the training data; based on the obtained predicted occurrence probability of each candidate word, determining the privacy degree, corresponding to the training data, for representing the privacy degree of the sample text in the training data set; and determining privacy weight of the training data set, which is used for representing the integral privacy degree of each sample text in the training data set, based on the privacy degree corresponding to each training data in the training data set.
In some optional embodiments, the determining, based on the privacy degree corresponding to each piece of training data in the training data set, a privacy weight of the training data set for characterizing an overall privacy degree of each sample text in the training data set includes: and determining the average value of the privacy degrees corresponding to the training data in the training data set as the privacy weight of the training data set.
In some optional embodiments, the clipping the gradient corresponding to each training data in the training data set to obtain a corresponding clipping gradient includes: for the training data in the set of training data, the following gradient clipping operations are performed: determining a loss function value between the predicted occurrence probability and the labeled occurrence probability of each candidate word corresponding to the training data; determining a gradient corresponding to the training data based on the determined loss function value; and cutting the gradient corresponding to the training data according to a preset gradient cutting norm to obtain a cutting gradient corresponding to the training data.
In some optional embodiments, the adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise includes: and adjusting the preset random noise according to the privacy weight of the training data set and the preset gradient clipping norm to obtain the privacy-adjusted random noise of the training data set.
In some optional embodiments, the determining, based on the obtained predicted occurrence probability of each candidate word, a privacy degree corresponding to the training data and used for characterizing a privacy degree of a sample text in the training data set includes: determining the perplexity of the initial language model for predicting the sample text in the training data based on the obtained predicted occurrence probability of each candidate word; and predicting the confusion degree of the sample text in the training data according to the initial language model, and determining the privacy degree corresponding to the training data, wherein the privacy degree corresponding to the training data is in positive linear correlation with the confusion degree of the sample text in the training data predicted by the initial language model.
In a second aspect, an embodiment of the present disclosure provides a text generation apparatus, including: an acquisition unit configured to acquire a first text; the input unit is configured to input a text vector of the first text into a language model pre-trained by using a differential privacy model training method, so as to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text, wherein in the process of training the language model by using a training data set, a gradient corresponding to training data in the training data set is cut, noise to be added is determined according to the privacy degree of sample texts in the training data set, and the noise to be added is added to an average cutting gradient corresponding to the training data set; and the generating unit is configured to generate a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word.
In some optional embodiments, the apparatus further comprises: a presentation unit configured to present the second text.
In some optional embodiments, the current application scenario is a text input scenario or a speech recognition scenario, and the first text is an input text or a recognized text.
In some optional embodiments, the generating unit is further configured to: sequencing all the candidate words according to the sequence of the first occurrence probability from large to small; generating a candidate word subset by using the candidate words ranked in a first preset higher occurrence probability range in each candidate word; forming the second text based on each candidate word in the subset of candidate words and the respective first probability of occurrence.
In some optional embodiments, the current application scenario is an auxiliary writing scenario, and the first text is an input text; and the generating unit is further configured to: splicing the candidate word with the highest first occurrence probability in all the candidate words after the first text to form a spliced text; performing a preset number of the following splicing operations: inputting the text vector of the spliced text into the language model to obtain a second occurrence probability of each candidate word appearing after the spliced text; splicing the candidate word with the highest second occurrence probability in all the candidate words behind the spliced text; and generating the second text by using texts except the first text in the spliced text.
In some optional embodiments, the current application scenario is a question and answer scenario, and the first text is a question text; and the generating unit is further configured to: sorting the candidate words according to the sequence of the first occurrence probability from large to small; generating a reply candidate word set by using the candidate words ranked in a second preset higher occurrence probability range in each candidate word; determining reply texts corresponding to the reply candidate words according to the corresponding relation between preset reply keywords and the reply texts; generating the second text based on the response text corresponding to each of the response candidate words.
In some alternative embodiments, the language model is pre-trained by the following training steps: acquiring an initial language model and at least one training data set, wherein the training data comprises a sample text, a sample text vector and the label occurrence probability of each candidate word; performing the following parameter adjustment operations on training data sets in the at least one training data set until a preset training end condition is met, wherein the parameter adjustment operations comprise: adjusting preset random noise according to the privacy weight of the training data set to obtain privacy-adjusted random noise; cutting the gradient corresponding to each training data in the training data set to obtain a corresponding cutting gradient, and determining an average cutting gradient corresponding to the training data set; adding the obtained privacy-adjusted random noise to the average clipping gradient corresponding to the training data set to obtain a noise gradient corresponding to the training data set; based on the noise gradient corresponding to the training data set, adjusting the model parameters of the initial language model by adopting a preset gradient descent optimization algorithm; determining the initial language model as the pre-trained language model.
In some optional embodiments, before adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise, the parameter adjusting operation further includes: for the training data in the set of training data, performing the following privacy degree calculation operation: inputting the sample text vector in the training data into the initial language model to obtain the predicted occurrence probability corresponding to the training data and used for predicting each candidate word after the sample text in the training data; based on the obtained predicted occurrence probability of each candidate word, determining the privacy degree, corresponding to the training data, for representing the privacy degree of the sample text in the training data set; and determining privacy weight of the training data set, which is used for representing the integral privacy degree of each sample text in the training data set, based on the privacy degree corresponding to each training data in the training data set.
In some optional embodiments, the determining, based on the privacy degree corresponding to each piece of training data in the training data set, a privacy weight of the training data set for characterizing an overall privacy degree of each sample text in the training data set includes: and determining the average value of the privacy degrees corresponding to the training data in the training data set as the privacy weight of the training data set.
In some optional embodiments, the clipping a gradient corresponding to each training data in the training data set to obtain a corresponding clipping gradient includes: for the training data in the set of training data, the following gradient clipping operations are performed: determining a loss function value between the predicted occurrence probability and the labeled occurrence probability of each candidate word corresponding to the training data; determining a gradient corresponding to the training data based on the determined loss function value; and cutting the gradient corresponding to the training data according to a preset gradient cutting norm to obtain a cutting gradient corresponding to the training data.
In some optional embodiments, the adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise includes: and adjusting the preset random noise according to the privacy weight of the training data set and the preset gradient clipping norm to obtain the privacy-adjusted random noise of the training data set.
In some optional embodiments, the determining, based on the obtained predicted occurrence probability of each candidate word, a privacy degree corresponding to the training data and used for characterizing a privacy degree of a sample text in the training data set includes: determining the perplexity of the initial language model for predicting the sample text in the training data based on the obtained predicted occurrence probability of each candidate word; and predicting the confusion degree of the sample text in the training data according to the initial language model, and determining the privacy degree corresponding to the training data, wherein the privacy degree corresponding to the training data is in positive linear correlation with the confusion degree of the sample text in the training data predicted by the initial language model.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect.
In order to reduce the risk of leakage of training data information in the existing text generation process, in a scene specifically applied to a language model, a first text is obtained, then the first text is input into the language model trained in advance by using a differential privacy model-based training method, so as to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text, wherein in the process of training the language model by using a training data set, a gradient corresponding to training data in the training data set is cut, noise to be added is determined according to the privacy degree of a sample text in the training data set, and the noise to be added is added to an average cutting gradient corresponding to the training data set; and finally, generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word. Because the language model is obtained by adopting the differential privacy model training method for pre-training, and the noise amplitude of the added noise is properly adjusted according to the privacy degree of the sample text in the training data in the training process, the risk that the language model reveals the training data in a specific application scene can be reduced, and the effects of protecting the privacy text and normally using the non-privacy text by the language model are realized.
Drawings
Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a text generation method according to the present disclosure;
FIG. 3 is an exploded flow diagram for one embodiment of step 203 according to the present disclosure;
FIG. 4 is an exploded flow diagram of yet another embodiment of step 203 according to the present disclosure;
FIG. 5 is an exploded flow diagram of another embodiment of step 203 according to the present disclosure;
FIG. 6 is a flow chart of one embodiment of training steps according to the present disclosure;
FIG. 7 is a schematic structural diagram of one embodiment of a text generation apparatus according to the present disclosure;
FIG. 8 is a schematic structural diagram of a computer system suitable for use with the electronic device used to implement embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the text generation method, apparatus, electronic device, and storage medium of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a text editing application, an auxiliary writing application, an intelligent question and answer application, a natural language processing application, a voice recognition application, a short video social application, an audio and video conference application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal devices 101, 102, and 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a sound collecting device (e.g. a microphone), a video collecting device (e.g. a camera), and a display screen, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as a plurality of software or software modules (for example to provide text generation services) or as a single software or software module. And is not particularly limited herein.
In some cases, the text generation method provided by the present disclosure may be executed by the terminal devices 101, 102, 103, and accordingly, the text generation apparatus may be provided in the terminal devices 101, 102, 103. In this case, the system architecture 100 may not include the server 105.
In some cases, the text generation method provided by the present disclosure may be performed by the terminal devices 101, 102, 103 and the server 105 together, for example, the step of "obtaining the first text" may be performed by the terminal devices 101, 102, 103, the step of "inputting the first text with a language model trained in advance based on the differential privacy model training method" may be performed by the server 105. The present disclosure is not limited thereto. Accordingly, the text generation means may be provided in the terminal devices 101, 102, and 103 and the server 105, respectively.
In some cases, the text generation method provided by the present disclosure may be executed by the server 105, and accordingly, the text generation apparatus may also be disposed in the server 105, and in this case, the system architecture 100 may also not include the terminal devices 101, 102, and 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a text generation method according to the present disclosure is shown, the text generation method comprising the steps of:
step 201, a first text is obtained.
In this embodiment, an execution subject of the text generation method (for example, the terminal device shown in fig. 1) may acquire the first text by using a corresponding method according to a difference of a specific current application scenario.
In some alternative embodiments, the executing entity (e.g., the terminal device shown in fig. 1) may obtain the first text locally.
In some optional embodiments, the execution principal (e.g., the server shown in fig. 1) may also remotely obtain the first text from other electronic devices (e.g., the terminal device shown in fig. 1) connected to the execution principal through a network.
In some alternative embodiments, the first text may be text that has been entered by a user using the terminal device. For example, the first text may be a text on screen (e.g., a text displayed in an input field of an input method) that has been input by the user using an input method application installed on the terminal device and has not been confirmed for selection. For another example, the first text may be a text that has been input and has confirmed the selection by the user using an input method application installed on the terminal device.
In some optional embodiments, the first text may also be a recognized text that has been recognized during speech recognition of certain speech data.
Step 202, inputting a text vector of the first text into a language model pre-trained by using a training method based on a differential privacy model, and obtaining a first occurrence probability of different candidate words in a preset candidate word set after the first text.
In this embodiment, the executing entity may first perform vector representation on the first text to obtain a text vector of the first text. And then, inputting the text vector of the first text into a language model trained in advance by using a differential privacy model training method to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text.
It should be noted that how to perform vector representation on text is the existing technology of extensive research and application at present, and is not limited in particular here. For example, the text may be first subjected to word segmentation processing to obtain a word segmentation sequence; then, carrying out vector representation on each participle in the analysis sequence by various word vector representation methods to obtain a participle vector sequence; and finally, calculating the mean vector of each participle vector in the participle vector sequence to obtain a text vector. Various methods may be used to vector represent the word. For example, One Hot (One Hot) method and a distributed representation method may be employed, and wherein the distributed representation method may include: a matrix-based distributed representation, a cluster-based distributed representation, or a neural network-based distributed representation, among others. Of course, existing word vector representation tools may be employed in practice, for example: glove, word2vec, fasttext, WordRank, etc. to vector represent words.
Here, the language model may be various models trained using text data and used for representing a correspondence between a text vector of the text and an occurrence probability of each candidate word in the preset candidate word set after the text.
For example, the language model may be N-Gram, BERT (Bidirectional Encoder Representation from transformations), GPT (generated Pre-Training), etc.
In the process of training the language model, the language model can be trained in advance by adopting a differential privacy model training method. Specifically, in the process of training a language model by using a training data set (which can also be understood as a Batch (Batch) of training data), the average gradient of each training data in the Batch of training data is not directly propagated backwards to optimize the model parameters of the language model. But rather is performed as follows: cutting the gradient corresponding to the training data in the training data set, determining the noise to be added according to the privacy degree of the sample text in the training data set, and adding the noise to be added to the average cutting gradient corresponding to the training data set
In the training process of each batch of training data, performing gradient clipping on the gradient corresponding to each training data in the batch to obtain a clipping gradient; determining noise to be added according to the privacy degree of the sample texts in the batch of training data; then, adding the noise to be added to the average clipping gradient corresponding to the batch of training data; and finally, reversely transmitting the average clipping gradient of the batch of training data after the noise is added by using a parameter optimization method so as to update the model parameters. In order to protect the privacy information of the sample text in the training data, when determining that noise is to be added, determining a higher noise amplitude for the sample text with a higher privacy degree, so that the language model is difficult to learn in the sample text with the higher privacy degree, and the protection of the language model on the sample text with the higher privacy degree is improved; and otherwise, determining a lower noise amplitude for the sample text with lower privacy degree, and enabling the language model to easily learn in the sample text with higher privacy degree so as to improve the learning degree of the language model on the sample text with lower privacy degree. And further, the effects of protecting the private text and normally using the non-private text are achieved. In addition, random noise is added to the gradient in the process, so that the risk that the training data information is leaked by the language model can be reduced.
And step 203, generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word.
In this embodiment, the execution subject may generate, according to different specific current application scenarios, a second text corresponding to the first text in the current application scenario by using a corresponding method based on the first occurrence probability of each candidate word.
In some alternative embodiments, the current application scenario is a text input scenario or a speech recognition scenario. In the text entry scenario, the first text is an entered text. The first text is a recognized text in a speech recognition scenario. Accordingly, step 203 may include steps 2031a to 2033a as shown in fig. 3:
step 2031a, sorting the candidate words according to the order of the first occurrence probability from large to small.
Step 2032a, generating a candidate word subset by the candidate words ranked in the first preset higher occurrence probability range in each candidate word.
Here, the first preset higher occurrence probability range may correspond to a candidate word ranked before the first preset ratio among the candidate words. For example, the first probability of occurrence ranks all candidate words in the top 5%. Accordingly, the candidate word subset may be composed of candidate words having a first probability of occurrence in the candidate words ranked before a first preset ratio.
The first preset higher occurrence probability range may also correspond to candidate words ranked in a first preset position before and before among the candidate words. For example, the first probability of occurrence ranks the top three candidate words. Accordingly, the candidate word subset may be composed of candidate words with first occurrence probability ranked at first preset position and before.
In summary, the candidate word subset is composed of candidate words with a higher first occurrence probability in the preset candidate word set.
Step 2033a, a second text is formed based on each candidate word in the subset of candidate words and the corresponding first probability of occurrence.
For example, the second text may be composed of N second sub-texts, where N may be the number of candidate words in the candidate word subset. Each second sub-text may correspond to each candidate word in the candidate word subset. For example, the second subfile may include only candidate words. For another example, the second sub-text may also include the ranking numbers of the first occurrence probabilities corresponding to the candidate words in the candidate word subset and the corresponding candidate words. In addition, each second sub-text in the second text may also be arranged in the order of the first probability of occurrence of the corresponding candidate word in the subset of candidate words.
For a text input scene, according to the method, the words with higher first occurrence probability can be sequenced at the front position, so that the user can select the words conveniently, and the text input efficiency of the user is improved.
For a voice recognition scene, according to the mode, the words with higher first occurrence probability can be sequenced at the front position, the selection of the words with the same tone or similar tone in the voice recognition process is facilitated, and the voice recognition accuracy is improved.
In some alternative embodiments, the current application scenario may be an assisted composition scenario, the first text being entered text. Accordingly, step 203 may include steps 2031b to 2033b as shown in fig. 4:
step 2031b, the candidate words with the highest first occurrence probability in the candidate words are spliced to the first text to form a spliced text.
After step 2031b, the spliced text adds a candidate word with the highest first occurrence probability after the first text on the basis of the first text.
Step 2032b, a preset number of splicing operations are performed.
Here, the splicing operation may include:
firstly, inputting the text vector of the spliced text into a language model to obtain a second occurrence probability of each candidate word after the text is spliced.
And secondly, splicing the candidate words with the highest second occurrence probability in the candidate words after the spliced text.
It can be understood that, without a splicing operation, the spliced text is added with a candidate word with the highest second occurrence probability after the spliced text on the original basis.
Assume that the predetermined number is a positive integer M. After step 2031b, the spliced text is added with one candidate word on the basis of the first text, and after step 2032b, the spliced text is added with (1+ M) candidate words on the basis of the first text. And then the auxiliary writing is realized on the basis of the first text, the writing burden of the user can be reduced, and the writing efficiency of the user is improved.
Step 2033b, generating a second text by using the texts except the first text in the spliced text.
That is, here, the second text is generated by the text spliced to the first text in step 2031b and step 2032 b.
In some alternative embodiments, the current application scenario may be a question and answer scenario, and the first text is a question text. Accordingly, step 203 may include steps 2031c to 2033c as shown in fig. 5:
step 2031c, sorting the candidate words according to the order of the first occurrence probability from large to small.
Step 2032c, generating a reply candidate term set by the candidate terms ranked in the second preset higher occurrence probability range among the candidate terms.
Here, the second preset higher occurrence probability range may correspond to a candidate word ranked before the second preset ratio among the candidate words. For example, the second probability of occurrence ranks all candidate words in the top 5%. Accordingly, the reply candidate term set may be comprised of candidate terms having a first probability of occurrence in each candidate term that is ordered before a second predetermined percentage.
The second preset higher occurrence probability range may also correspond to a second preset position and a preceding candidate word in the candidate words. For example, the first probability of occurrence ranks the top three candidate words. Accordingly, the reply candidate word set may be composed of candidate words with a first probability of occurrence in each candidate word ordered at a first second predetermined position and before.
In summary, the reply candidate word set is composed of candidate words with a higher first occurrence probability in the preset candidate word set.
Step 2033c, determining the reply texts corresponding to the reply candidate words according to the correspondence between the preset reply keywords and the reply texts.
Here, a corresponding correspondence table may be preset for the response keyword and the response text, and the response text corresponding to each response candidate word in step 2032c may be searched according to the correspondence table.
Step 2034c, generating a second text based on the response text corresponding to each candidate word.
For example, the second text may be composed of S second subfolders, where S may be the number of candidate words in the reply candidate word set. Wherein each second sub-text may correspond to a respective reply candidate term in the set of reply candidate terms, respectively. For example, the second subfolder may include only the response text corresponding to the response candidate word. For another example, the second subfile may also include the ranking numbers of the first probabilities of occurrence corresponding to the response candidate words in the set of response candidate words and the response text corresponding to the corresponding response candidate words. In addition, each second sub-text in the second text may also be ranked in order of the first probability of occurrence of the corresponding response candidate term in the set of response candidate terms.
In some optional embodiments, the above flow 200 may further include the following step 204:
step 204, presenting the second text.
For example, for a text entry scenario, an assisted composition scenario, a question and answer scenario, the second text may be displayed in a preset area relative to the first text. For a speech recognition scenario, the second text may be displayed continuously adjacent to the first text display position.
In some cases, this embodiment may have the following optional implementations:
alternative embodiment (a): the language model may be pre-trained by a training step 600 as shown in fig. 6, the training step 600 includes the following steps 601 to 603:
step 601, obtaining an initial language model and at least one training data set.
Here, the initial language model may be various models for characterizing correspondence between a text vector of the text and an occurrence probability of each candidate word in the preset candidate word set occurring after the text. This is not a particular limitation of the present application.
Assuming that the initial language model is Lm, positive integers T training data sets, which are respectively training data sets B, are obtained in step 601 t And T is a positive integer between 1 and T. Hypothesis training data set B t Includes a positive integer of J training data, which are respectively training data x j (J is a positive integer between 1 and J). Suppose that the preset candidate word set comprises a positive integer I of candidate words w i And I is a positive integer between 1 and I.
Here, the training data x j Sample text vector v which may comprise sample text s, sample text s s And each candidate word w of the I candidate words i Probability of occurrence of a label appearing after sample text s
Figure BDA0003610504780000151
In practice, label probability of occurrence
Figure BDA0003610504780000152
May be obtained through manual annotation or through statistical analysis of the material data.
Step 602, performing a parameter adjustment operation on a training data set of at least one training data set until a preset training end condition is met.
Here, the training data set B in step 601 may be considered t And executing parameter adjustment operation until a preset training end condition is met.
Here, the parameter adjustment operation may include the following steps 6021 to 6024:
step 6021, adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise.
Here, each training data set B t Corresponding privacy weight for representing the privacy degree of the sample text content in each training data in the training data set
Figure BDA0003610504780000153
Training data set B t The higher the privacy degree of the sample text content in each training data, the training data set B t Privacy weight of
Figure BDA0003610504780000154
The higher the training data set B is, and to reduce the leakage of the trained language model to a greater extent t May be applied to the training data set B t The corresponding average clipping gradient adds higher amplitude noise. That is, in step 6021, when the preset random noise is adjusted, the privacy-adjusted random noise may be compared with the training data set B t Privacy weight of
Figure BDA0003610504780000161
And (4) positively correlating. Alternatively, step 6021 may be performed according to equation 1 below:
Figure BDA0003610504780000162
wherein N is a preset random noise,
Figure BDA0003610504780000163
as privacy weights according to the training data set
Figure BDA0003610504780000164
And adjusting the preset random noise N to obtain privacy adjusted random noise.
Training data set B t Privacy weight of
Figure BDA0003610504780000165
May be determined in various ways. For example, training data set B t Privacy weight of
Figure BDA0003610504780000166
May be manually specified based on the degree of privacy of the sample text content in each training data of the set of training data.
Alternatively, here, the preset random noise N may be various random noises, and may be gaussian noise, for example.
Step 6022, clipping the gradient corresponding to each training data in the training data set to obtain the corresponding clipping gradient, and determining the average clipping gradient corresponding to the training data set.
Here, assume that training data set B t Training data x in (1) j After inputting the initial language model, the corresponding output result is obtained, i.e. the training data x for prediction is obtained j Each candidate word w appears after the middle sample text i Predicted probability of occurrence of
Figure BDA0003610504780000167
Next, various loss function calculation methods can be used to determine the training data x j Corresponding candidate words w i Predicted probability of occurrence of
Figure BDA0003610504780000168
And label probability of occurrence
Figure BDA0003610504780000169
Value of loss function in between
Figure BDA00036105047800001610
Again, the value of the loss function may be based on the determined value
Figure BDA00036105047800001611
Determining and training data x j Corresponding gradient
Figure BDA00036105047800001612
Here, various gradient clipping methods may be employed on the training data set B t Of (1) training data x j Corresponding gradient
Figure BDA00036105047800001613
Cutting to obtain corresponding cutting gradient
Figure BDA00036105047800001614
Here, the training data x j Gradient of cutting
Figure BDA00036105047800001615
Usually less than or equal to its original gradient
Figure BDA00036105047800001616
To avoid gradient explosions.
Obtaining a training data set B t Of (1) training data x j Gradient of cutting
Figure BDA00036105047800001617
Thereafter, a training data set B may be calculated t Of (1) training data x j Gradient of cutting
Figure BDA00036105047800001618
As training data set B t Average clipping gradient of
Figure BDA00036105047800001619
Step 6023, adding the obtained privacy-adjusted random noise to the average clipping gradient corresponding to the training data set to obtain the noise gradient corresponding to the training data set.
That is, the training data set B obtained in step 6022 t Corresponding average clipping gradient
Figure BDA0003610504780000171
The training data set B obtained in step 6021 is added t Privacy adjusted random noise of
Figure BDA0003610504780000172
To obtain a training data set B t Corresponding noise gradient
Figure BDA0003610504780000173
Specifically, the following formula (2) can be adopted:
Figure BDA0003610504780000174
in practice, consider that
Figure BDA0003610504780000175
And
Figure BDA0003610504780000176
magnitude difference between (in practice, with training data x) j Corresponding gradient
Figure BDA0003610504780000177
Of the order of e -4 Wherein e is a natural number, and training data x is obtained after clipping j Gradient of cutting
Figure BDA0003610504780000178
Will be smaller in magnitude, and thus training data set B t Corresponding average clipping gradient
Figure BDA0003610504780000179
Of smaller order, and the training data set B t Privacy adjusted random noise of
Figure BDA00036105047800001710
Of relatively large magnitude), the noise addition can also be performed using the following equation (3):
Figure BDA00036105047800001711
step 6024, based on the noise gradient corresponding to the training data set, adjusting the model parameters of the initial language model by adopting a preset gradient descent optimization algorithm.
For example, Gradient Descent (GD) optimization algorithms including, but not limited to, the following may be employed: batch Gradient Descent (BGD), small Batch Gradient Descent (Mini-Batch Gradient Descent, MBGD), random Gradient Descent (SGD), Momentum Gradient Descent (GDM), Nesterov Accelerated Gradient (NAG), RMSprop (root Mean Square prop) algorithm, Adaptive Moment Estimation (Adam) algorithm, etc.
The predetermined training condition in step 602 may be various predetermined conditions for determining the convergence of the language model. For example, the preset training condition may include at least one of the following conditions 1 to 4:
in condition 1, the number of times of performing the parameter adjustment operation is equal to or greater than a preset number of times. For example, the preset number of times may be the number of training data sets acquired in step 6021.
And 2, the time for executing the parameter adjusting operation exceeds the preset training time.
And 3, in the parameter adjusting operation, the sum of the loss function values corresponding to the training data in the training data set is smaller than a preset difference sum threshold, or the mean value of the loss function values corresponding to the training data in the training data set is smaller than a preset difference mean threshold.
Conditional 4, prior to step 602, a verification data set is pre-acquired. The verification data in the verification data set comprises verification texts, verification text feature vectors and label occurrence probabilities used for representing the candidate word expressions after the verification texts. Then, the difference between the sum of the loss function values of the verification data set before and after this parameter adjustment operation of the initial language model is calculated. And the condition 4 is that the calculated difference is smaller than a preset loss function difference threshold value. That is, the loss function of the initial language model does not drop over the validation data set or drops by a small amount.
Through step 602, model parameters of the initial language model are optimized, and random noise is added during the optimization process, so that the risk of leakage of training data information can be reduced.
Step 603, determining the initial language model as a pre-trained language model.
Alternative embodiment (b): in the parameter adjustment operation, the following step 6021' and step 6021 ″ may be further included before step 6021:
step 6021' performs a privacy computation operation on the training data in the set of training data.
I.e. for the training data set B t Training data x in (1) j A privacy calculation operation is performed. Here, the privacy degree calculating operation includes:
first, a training data set B is collected t Training data x in (1) j Sample text vector v in (1) s Inputting an initial language model Lm to obtain and train data x j Corresponding, for predicting training data x j Each candidate word w appears after the middle sample text s i Predicted probability of occurrence of
Figure BDA0003610504780000181
Then, based on each candidate word w obtained i Predicted probability of occurrence of
Figure BDA0003610504780000182
Determining and training data x j Corresponding for characterizing the training data x j The middle sample text s is in the training data set B t Degree of privacy of medium degree of privacy
Figure BDA0003610504780000183
Specifically, the applicant found, according to the research, that text data containing private information does not appear in large amounts in large-scale corpora. That is, since the probability that the common text data is a private text is considered to be low, the judgment of the text privacy degree can be converted into the judgment of the text commonness degree. That is, the degree to which the text doc is private text, i.e. the degree of privacy pri (doc), is inversely proportional to its degree of commonness ord (doc), i.e.:
Figure BDA0003610504780000184
because the language model is a probability generation model, the probability of occurrence that the next word is each candidate word in the preset candidate word set can be predicted when the text vector of the prefix text is given. And the confusion degree is an index for measuring the accuracy of the text sequence predicted by the language model, and if the confusion degree of the language model Lm to the text doc is ppl (doc), the average probability of predicting the text doc by the language model Lm is higher when the confusion degree ppl (doc) is very low. In practice, a language model is trained using billions of text data, and the language model predicts the probability distribution of the result text to match the distribution of the text in a real language scene, i.e., a higher prediction probability of the result text means a higher degree of commonness of the result text.
In summary, the confusion degree ppl (doc) of the language model Lm to the text doc is inversely proportional to the degree of commonness ord (doc) of the text doc, i.e. the language model Lm is confused with the text doc
Figure BDA0003610504780000191
The following conclusions are further reached: the confusion degree ppl (doc) of a text doc is proportional to the privacy degree pri (doc) of the text doc. Thus, the privacy pri (doc) of the text doc can be calculated by the confusion degree ppl (doc) of the text doc. I.e. based on the resulting candidate words w i Predicted probability of occurrence of
Figure BDA0003610504780000192
Determining and training data x j Corresponding for characterizing the training data x j The middle sample text s is in the training data set B t Degree of privacy of degree of middle privacy
Figure BDA0003610504780000193
This can be done as follows:
firstly, based on the obtained predicted occurrence probability of each candidate word, the confusion degree of the initial language model for predicting the sample text in the training data is determined.
Here, the confusion degree ppl (doc) of the text doc can be calculated in various confusion degree calculation manners.
Suppose that the text doc is the following word sequence c 1 ,c 2 ,…,c n I.e. doc ═ c 1 ,c 2 ,…,c n Wherein n is the number of characters included by doc. The confusion degree ppl (doc) of the text doc in the language model Lm can be calculated by the following formula 4 or formula 5:
Figure BDA0003610504780000194
Figure BDA0003610504780000195
here, the first and second liquid crystal display panels are,
Figure BDA0003610504780000196
to c is 1 ,c 2 ,…,c k-1 C, inputting the text vector of the formed text into a language model Lm to obtain the occurrence probability of each candidate word k The corresponding occurrence probability value.
Alternatively, the confusion degree ppl (doc) of the text doc in the language model Lm can also be calculated by using cross entropy.
Then, the confusion degree of the sample text in the training data is predicted according to the initial language model, and the privacy degree corresponding to the training data is determined. Wherein a positive linear correlation exists between the privacy level corresponding to the training data and the confusion level of the initial language model predicting the sample text in the training data.
Here, to obtain the training data x j The middle sample text s is in the training data set B t Degree of privacy of medium degree of privacy
Figure BDA0003610504780000201
The training data x may be first calculated using the above formula 4, formula 5 or cross entropy j Perplexity of the middle sample text s in the initial language model
Figure BDA0003610504780000202
Then again according to
Figure BDA0003610504780000203
Calculating to obtain the privacy degree
Figure BDA0003610504780000204
In practice, the training data x can be calculated by the normalization method shown in the following equation 6 j The middle sample text s is in the training data set B t Degree of privacy of medium degree of privacy
Figure BDA0003610504780000205
Figure BDA0003610504780000206
Wherein the content of the first and second substances,
Figure BDA0003610504780000207
and
Figure BDA0003610504780000208
respectively training data sets B t Training data x in (1) j Corresponding degree of privacy
Figure BDA0003610504780000209
Minimum and maximum values of (d).
By adopting the optional implementation mode, the confusion degree is obtained by firstly utilizing the occurrence probability calculation of the sample text in the training data, and then the privacy degree is obtained by the confusion degree calculation, so that the privacy degree of the sample text can be automatically calculated without manual marking, and the labor cost is reduced.
Step 6021 ", based on the privacy degree corresponding to each training data in the training data set, determining the privacy weight of the training data set for representing the overall privacy degree of each sample text in the training data set.
Here, the privacy weight of the training data set for representing the overall privacy degree of each sample text in the training data set may be determined in various manners based on the privacy degree corresponding to each training data in the training data set.
Optionally, an average value of the privacy degrees corresponding to the training data in the training data set may be determined as the privacy weight of the training data set. In this way, when the privacy degree range corresponding to each training data is between 0 and 1, the privacy weight of the obtained training data set is also between 0 and 1.
Optionally, the training data in the training data set may be sorted in the order of the privacy degrees from high to low. And then, carrying out weighted summation on the privacy degrees of the training data according to preset weight coefficients corresponding to the sequencing positions of the training data, and normalizing the summation result to obtain the privacy weight of the training data set.
Alternative embodiment (c): based on the above-mentioned optional embodiment (two), in step 6022, the gradient corresponding to each training data in the training data set is clipped to obtain a corresponding clipping gradient, which may be performed as follows:
for training data set B t Training data x in (1) j The following gradient clipping operations are performed:
first, various loss functions can be used to determine a loss function value L (x) between the predicted occurrence probability and the labeled occurrence probability of each candidate word corresponding to the training data j ). For example, including but not limited to L1 norm, L2 norm, and the like.
Secondly, based on the determined loss function values, determining and training data x j The corresponding gradient.
That is, the current model parameter θ of the initial language model Lm may be paired with the determined loss function value t Calculating the partial derivative to obtain the training data x j Gradient of each model parameter of corresponding initial language model
Figure BDA0003610504780000211
Specifically, the following can be expressed by equation 7:
Figure BDA0003610504780000212
finally, the norm C pair and the training data x are cut according to a preset gradient j Corresponding gradient
Figure BDA0003610504780000213
Cutting to obtain training data x j Corresponding clipping gradient
Figure BDA0003610504780000214
Specifically, the following can be expressed by equation 8:
Figure BDA0003610504780000215
it will be appreciated that when C is less than 1, using the above formula, the training data x may be compared j Corresponding clipping gradient
Figure BDA0003610504780000216
Clipping to between 0 and C, i.e. clipping gradient
Figure BDA0003610504780000217
Is within a preset gradient clipping norm C. If C is greater than or equal to 1, cutting gradient
Figure BDA0003610504780000218
And
Figure BDA0003610504780000219
the same is true. Therefore, in practice to crop the gradient, C may be set to a value greater than 0 and less than 1.
Alternative embodiment (iv): based on the above optional embodiment (three), in step 6021, the preset random noise is adjusted according to the privacy weight of the training data set, so as to obtain the privacy-adjusted random noise, which may be performed as follows:
and adjusting the preset random noise according to the privacy weight and the preset gradient clipping norm of the training data set to obtain the privacy-adjusted random noise of the training data set. Specifically, the following can be expressed by formula (9):
Figure BDA00036105047800002110
by adopting the optional implementation mode (iv), the magnitude of the random noise after the clipping gradient and the privacy adjustment can be unified between 0 and C, that is, the magnitude is consistent, and execution of the subsequent step 6023 is facilitated, that is, when the random noise after the privacy adjustment obtained by adding the average clipping gradient corresponding to the training data set is added, the two are in the same magnitude, and the non-uniform distribution of the two can be avoided.
In the text generation method provided by the embodiment of the disclosure, in a scenario specifically applied to a language model, a first text is first obtained, and then the first text is input into the language model pre-trained by using a differential privacy model training method, so as to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text; and finally, generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word. Because the language model is obtained by adopting the differential privacy model training method for pre-training, the risk that the language model reveals training data in a specific application scene can be reduced.
With further reference to fig. 7, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a text generation apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 7, the text generation apparatus 700 of the present embodiment includes: an acquisition unit 701, an input unit 702, and a generation unit 703. Wherein, the obtaining unit 701 is configured to obtain a first text; an input unit 702, configured to input a text vector of the first text into a language model pre-trained by using a differential privacy model-based training method, so as to obtain a first occurrence probability that different candidate words in a preset candidate word set appear after the first text; a generating unit 703 configured to generate a second text corresponding to the first text in the current application scenario based on the first occurrence probability of each candidate word.
In this embodiment, specific processes of the obtaining unit 701, the input unit 702, and the generating unit 703 of the text generating apparatus 700 and technical effects brought by the processes can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional embodiments, the apparatus 700 may further include: a presentation unit 704 configured to present the second text.
In some optional embodiments, the current application scenario may be a text input scenario or a speech recognition scenario, and the first text may be an input text or a recognized text; and the generating unit 703 may be further configured to: sorting the candidate words according to the sequence of the first occurrence probability from large to small; generating a candidate word subset by using the candidate words ranked in a first preset higher occurrence probability range in each candidate word; forming the second text based on each candidate word in the subset of candidate words and the respective first probability of occurrence.
In some optional embodiments, the current application scenario may be an auxiliary writing scenario, and the first text may be an input text; and the generating unit 703 may be further configured to: splicing the candidate word with the highest first occurrence probability in all the candidate words after the first text to form a spliced text; performing a preset number of the following splicing operations: inputting the text vector of the spliced text into the language model to obtain a second occurrence probability of each candidate word appearing after the spliced text; splicing the candidate word with the highest second occurrence probability in all the candidate words behind the spliced text; and generating the second text by using texts except the first text in the spliced text.
In some optional embodiments, the current application scenario may be a question and answer scenario, and the first text may be a question text; and the generating unit 703 may be further configured to: sorting the candidate words according to the sequence of the first occurrence probability from large to small; generating a reply candidate word set by using the candidate words ranked in a second preset higher occurrence probability range in each candidate word; determining reply texts corresponding to the reply candidate words according to the corresponding relation between preset reply keywords and the reply texts; generating the second text based on the response text corresponding to each of the response candidate words.
In some alternative embodiments, the language model may be pre-trained by the following training steps: acquiring an initial language model and at least one training data set, wherein the training data comprises a sample text, a sample text vector and the label occurrence probability of each candidate word; performing the following parameter adjustment operations on training data sets in the at least one training data set until a preset training end condition is met, wherein the parameter adjustment operations comprise: adjusting preset random noise according to the privacy weight of the training data set to obtain privacy-adjusted random noise; cutting the gradient corresponding to each training data in the training data set to obtain a corresponding cutting gradient, and determining an average cutting gradient corresponding to the training data set; adding the obtained privacy-adjusted random noise to the average clipping gradient corresponding to the training data set to obtain a noise gradient corresponding to the training data set; based on the noise gradient corresponding to the training data set, adjusting the model parameters of the initial language model by adopting a preset gradient descent optimization algorithm; determining the initial language model as the pre-trained language model.
In some optional embodiments, before adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise, the parameter adjusting operation may further include: for the training data in the set of training data, performing the following privacy degree calculation operation: inputting the sample text vector in the training data into the initial language model to obtain the predicted occurrence probability corresponding to the training data and used for predicting each candidate word after the sample text in the training data; based on the obtained predicted occurrence probability of each candidate word, determining the privacy degree, corresponding to the training data, for representing the privacy degree of the sample text in the training data set; and determining privacy weight of the training data set, which is used for representing the integral privacy degree of each sample text in the training data set, based on the privacy degree corresponding to each training data in the training data set.
In some optional embodiments, the determining, based on the privacy degree corresponding to each piece of training data in the training data set, a privacy weight of the training data set for characterizing an overall privacy degree of each sample text in the training data set may include: and determining the average value of the privacy degrees corresponding to the training data in the training data set as the privacy weight of the training data set.
In some optional embodiments, the clipping the gradient corresponding to each training data in the training data set to obtain a corresponding clipping gradient may include: for the training data in the set of training data, the following gradient clipping operations are performed: determining a loss function value between the predicted occurrence probability and the labeled occurrence probability of each candidate word corresponding to the training data; determining a gradient corresponding to the training data based on the determined loss function value; and cutting the gradient corresponding to the training data according to a preset gradient cutting norm to obtain a cutting gradient corresponding to the training data.
In some optional embodiments, the adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise may include: and adjusting the preset random noise according to the privacy weight of the training data set and the preset gradient clipping norm to obtain the privacy-adjusted random noise of the training data set.
In some optional embodiments, the determining, based on the obtained predicted occurrence probability of each candidate word, a privacy degree corresponding to the training data and used for characterizing a privacy degree of a sample text in the training data set may include: determining the perplexity of the initial language model for predicting the sample text in the training data based on the obtained predicted occurrence probability of each candidate word; and predicting the confusion degree of the sample text in the training data according to the initial language model, and determining the privacy degree corresponding to the training data, wherein the privacy degree corresponding to the training data is in positive linear correlation with the confusion degree of the sample text in the training data predicted by the initial language model.
It should be noted that, for details of implementation and technical effects of each unit in the text generation apparatus provided in the embodiments of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.
Referring now to FIG. 8, a block diagram of a computer system 800 suitable for use in implementing the electronic device of the present disclosure is shown. The computer system 800 illustrated in fig. 8 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.
As shown in fig. 8, a computer system 800 may include a processing device (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage device 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the computer system 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the computer system 800 to communicate with other devices wirelessly or by wire to exchange data. While fig. 8 illustrates a computer system 800 with various means of an electronic device, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the text generation method shown in the embodiment shown in fig. 2 and its alternative embodiments, and/or the text generation method shown in the embodiment shown in fig. 3 and its alternative embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the acquiring unit may also be described as "a unit acquiring the first text".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Claims (15)

1. A text generation method, comprising:
acquiring a first text;
inputting a text vector of the first text into a language model pre-trained by a training method based on a differential privacy model to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text, wherein in the process of training the language model by using a training data set, a gradient corresponding to training data in the training data set is cut, noise to be added is determined according to the privacy degree of a sample text in the training data set, and the noise to be added is added to an average cutting gradient corresponding to the training data set;
and generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word.
2. The method of claim 1, wherein the method further comprises:
presenting the second text.
3. The method of claim 1, wherein the current application scenario is a text entry scenario or a speech recognition scenario, and the first text is an entered text or a recognized text.
4. The method of claim 3, wherein generating, based on the first probability of occurrence of each of the candidate words, a second text corresponding to the first text in the current application scenario comprises:
sorting the candidate words according to the sequence of the first occurrence probability from large to small;
generating a candidate word subset by using the candidate words ranked in a first preset higher occurrence probability range in each candidate word;
forming the second text based on each candidate word in the subset of candidate words and the respective first probability of occurrence.
5. The method of claim 1, wherein the current application scenario is an auxiliary composition scenario, the first text is an entered text; and
generating a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word includes:
splicing the candidate word with the highest first occurrence probability in all the candidate words after the first text to form a spliced text;
performing a preset number of the following splicing operations: inputting the text vector of the spliced text into the language model to obtain a second occurrence probability of each candidate word appearing after the spliced text; splicing the candidate word with the highest second occurrence probability in all the candidate words behind the spliced text;
and generating the second text by using texts except the first text in the spliced text.
6. The method of claim 1, wherein the current application scenario is a question and answer scenario, and the first text is a question text; and
generating a second text corresponding to the first text in the current application scene based on the occurrence probability of each candidate word comprises:
sorting the candidate words according to the sequence of the first occurrence probability from large to small;
generating a reply candidate word set by using the candidate words ranked in a second preset higher occurrence probability range in each candidate word;
determining reply texts corresponding to the reply candidate words according to the corresponding relation between preset reply keywords and the reply texts;
generating the second text based on the response text corresponding to each of the response candidate words.
7. The method of claim 1, wherein the language model is pre-trained by the following training steps:
acquiring an initial language model and at least one training data set, wherein the training data comprises a sample text, a sample text vector and the label occurrence probability of each candidate word;
performing the following parameter adjustment operations on training data sets in the at least one training data set until a preset training end condition is met, wherein the parameter adjustment operations comprise: adjusting preset random noise according to the privacy weight of the training data set to obtain privacy-adjusted random noise; cutting the gradient corresponding to each training data in the training data set to obtain a corresponding cutting gradient, and determining an average cutting gradient corresponding to the training data set; adding the obtained privacy-adjusted random noise to the average clipping gradient corresponding to the training data set to obtain a noise gradient corresponding to the training data set; based on the noise gradient corresponding to the training data set, adjusting the model parameters of the initial language model by adopting a preset gradient descent optimization algorithm;
determining the initial language model as the pre-trained language model.
8. The method of claim 7, wherein before adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise, the parameter adjusting operation further comprises:
for the training data in the set of training data, performing the following privacy degree calculation operation: inputting the sample text vector in the training data into the initial language model to obtain the predicted occurrence probability corresponding to the training data and used for predicting each candidate word after the sample text in the training data; based on the obtained predicted occurrence probability of each candidate word, determining the privacy degree, corresponding to the training data, for representing the privacy degree of the sample text in the training data set; and
and determining privacy weights of the training data set, which are used for representing the overall privacy degree of each sample text in the training data set, based on the privacy degrees corresponding to each training data in the training data set.
9. The method of claim 8, wherein determining the privacy weight of the training data set for characterizing the overall privacy level of each sample text in the training data set based on the privacy level corresponding to each training data in the training data set comprises:
and determining the average value of the privacy degrees corresponding to the training data in the training data set as the privacy weight of the training data set.
10. The method of claim 8, wherein the clipping the gradient corresponding to each training data in the training data set to obtain a corresponding clipping gradient comprises:
for the training data in the set of training data, the following gradient clipping operations are performed: determining a loss function value between the predicted occurrence probability and the labeled occurrence probability of each candidate word corresponding to the training data; determining a gradient corresponding to the training data based on the determined loss function value; and cutting the gradient corresponding to the training data according to a preset gradient cutting norm to obtain a cutting gradient corresponding to the training data.
11. The method of claim 10, wherein the adjusting the preset random noise according to the privacy weight of the training data set to obtain the privacy-adjusted random noise comprises:
and adjusting the preset random noise according to the privacy weight of the training data set and the preset gradient clipping norm to obtain the privacy-adjusted random noise of the training data set.
12. The method of claim 8, wherein determining, based on the obtained predicted occurrence probability of each candidate word, a privacy degree corresponding to the training data for characterizing a privacy degree of a sample text in the training data set comprises:
determining the perplexity of the initial language model for predicting the sample text in the training data based on the obtained predicted occurrence probability of each candidate word;
and predicting the confusion degree of the sample text in the training data according to the initial language model, and determining the privacy degree corresponding to the training data, wherein the privacy degree corresponding to the training data is in positive linear correlation with the confusion degree of the sample text in the training data predicted by the initial language model.
13. A text generation apparatus comprising:
an acquisition unit configured to acquire a first text;
the input unit is configured to input a text vector of the first text into a language model pre-trained by using a differential privacy model training method, so as to obtain a first occurrence probability of different candidate words in a preset candidate word set after the first text, wherein in the process of training the language model by using a training data set, a gradient corresponding to training data in the training data set is cut, noise to be added is determined according to the privacy degree of sample texts in the training data set, and the noise to be added is added to an average cutting gradient corresponding to the training data set;
and the generating unit is configured to generate a second text corresponding to the first text in the current application scene based on the first occurrence probability of each candidate word.
14. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-12.
15. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-12.
CN202210431002.0A 2022-04-22 2022-04-22 Text generation method and device, electronic equipment and storage medium Pending CN114841142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210431002.0A CN114841142A (en) 2022-04-22 2022-04-22 Text generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210431002.0A CN114841142A (en) 2022-04-22 2022-04-22 Text generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114841142A true CN114841142A (en) 2022-08-02

Family

ID=82565374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210431002.0A Pending CN114841142A (en) 2022-04-22 2022-04-22 Text generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114841142A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115640611A (en) * 2022-11-25 2023-01-24 荣耀终端有限公司 Method for updating natural language processing model and related equipment
CN115909354A (en) * 2022-11-11 2023-04-04 北京百度网讯科技有限公司 Training method of text generation model, and text acquisition method and device
CN116108157A (en) * 2023-04-11 2023-05-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909354A (en) * 2022-11-11 2023-04-04 北京百度网讯科技有限公司 Training method of text generation model, and text acquisition method and device
CN115909354B (en) * 2022-11-11 2023-11-10 北京百度网讯科技有限公司 Training method of text generation model, text acquisition method and device
CN115640611A (en) * 2022-11-25 2023-01-24 荣耀终端有限公司 Method for updating natural language processing model and related equipment
CN116108157A (en) * 2023-04-11 2023-05-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device
CN116108157B (en) * 2023-04-11 2023-09-12 阿里巴巴达摩院(杭州)科技有限公司 Method for training text generation model, text generation method and device

Similar Documents

Publication Publication Date Title
CN107491534B (en) Information processing method and device
CN107491547B (en) Search method and device based on artificial intelligence
CN107273503B (en) Method and device for generating parallel text in same language
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111428010B (en) Man-machine intelligent question-answering method and device
CN108121800B (en) Information generation method and device based on artificial intelligence
CN114841142A (en) Text generation method and device, electronic equipment and storage medium
CN109740167B (en) Method and apparatus for generating information
WO2020155619A1 (en) Method and apparatus for chatting with machine with sentiment, computer device and storage medium
CN108121699B (en) Method and apparatus for outputting information
WO2020052069A1 (en) Method and apparatus for word segmentation
CN111159409B (en) Text classification method, device, equipment and medium based on artificial intelligence
CN112650842A (en) Human-computer interaction based customer service robot intention recognition method and related equipment
WO2020052061A1 (en) Method and device for processing information
CN111738010A (en) Method and apparatus for generating semantic matching model
CN116127060A (en) Text classification method and system based on prompt words
CN111723180A (en) Interviewing method and device
CN115801980A (en) Video generation method and device
CN107766498A (en) Method and apparatus for generating information
CN112307738B (en) Method and device for processing text
CN109829117A (en) Method and apparatus for pushed information
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN117370373A (en) Data processing method, device, electronic equipment and storage medium
CN110688470B (en) Method and apparatus for transmitting information
CN112633004A (en) Text punctuation deletion method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination