CN114297693A - Model pre-training method and device, electronic equipment and storage medium - Google Patents

Model pre-training method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114297693A
CN114297693A CN202111656897.XA CN202111656897A CN114297693A CN 114297693 A CN114297693 A CN 114297693A CN 202111656897 A CN202111656897 A CN 202111656897A CN 114297693 A CN114297693 A CN 114297693A
Authority
CN
China
Prior art keywords
ciphertext
text
character
plaintext
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111656897.XA
Other languages
Chinese (zh)
Other versions
CN114297693B (en
Inventor
张阳
安晓江
蒋红宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haitai Fangyuan High Technology Co Ltd
Original Assignee
Beijing Haitai Fangyuan High Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Haitai Fangyuan High Technology Co Ltd filed Critical Beijing Haitai Fangyuan High Technology Co Ltd
Priority to CN202111656897.XA priority Critical patent/CN114297693B/en
Publication of CN114297693A publication Critical patent/CN114297693A/en
Application granted granted Critical
Publication of CN114297693B publication Critical patent/CN114297693B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of information security, in particular to a model pre-training method, a device, electronic equipment and a storage medium, wherein a pre-training instruction for a model to be trained is responded, a ciphertext corpus uploaded by a client with data viewing authority is obtained, each ciphertext text in the ciphertext corpus is determined, the ciphertext corpus is obtained by encrypting the corpus to be processed, and the corpus to be processed is a text containing sensitive information; respectively adopting each plaintext character in a preset public text to replace each ciphertext character in each ciphertext text to obtain each replaced text, wherein the public text contains a plurality of plaintext characters and is a public text which does not contain sensitive information; and training the model to be trained based on each replaced text to obtain a pre-training model, and performing corresponding processing based on the pre-training model. In this way, the pre-training model can be trained and used with guaranteed data security.

Description

Model pre-training method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of information security technologies, and in particular, to a model pre-training method and apparatus, an electronic device, and a storage medium.
Background
Currently, a pre-training model can be applied to a variety of different scenarios, and a large amount of text data is generally required to be used in the process of training the pre-training model, however, in a specific industry and application scenario, relevant text data may involve sensitive information, and therefore, the security of the text data may be reduced in the process of training the pre-training model.
In the related art, a pre-training environment can be generally deployed to a client and trained in a closed environment, so that other irrelevant people cannot acquire text data, however, an operator training a pre-training model can still acquire text data, and therefore, the model pre-training mode in the related art can reduce the safety of data.
Disclosure of Invention
The embodiment of the application provides a model pre-training method and device, electronic equipment and a storage medium, so as to improve the safety of data.
The embodiment of the application provides the following specific technical scheme:
a model pre-training method, comprising:
responding to a pre-training instruction aiming at a model to be trained, acquiring ciphertext corpora uploaded by a client with data viewing authority, and determining each ciphertext text in the ciphertext corpora, wherein the ciphertext corpora are obtained by encrypting the corpora to be processed, and the corpora to be processed are texts containing sensitive information;
respectively adopting each plaintext character in a preset public text to replace each ciphertext character in each ciphertext text to obtain each replaced text, wherein the public text comprises a plurality of plaintext characters, and is a public text which does not comprise sensitive information;
and training the model to be trained based on the replaced texts to obtain a pre-training model, and performing corresponding processing based on the pre-training model.
Optionally, determining each ciphertext text in the ciphertext corpus specifically includes:
segmenting the ciphertext corpus based on a preset paragraph ending identifier to obtain each ciphertext paragraph;
and based on preset sentence-division encrypted characters, respectively carrying out sentence division on each ciphertext paragraph to obtain corresponding each ciphertext text.
Optionally, each plaintext character in the preset public text is respectively adopted to replace each ciphertext character in each ciphertext text, so as to obtain each replaced text, and the method specifically includes:
determining ciphertext word frequency corresponding to each ciphertext character in each ciphertext text and determining plaintext word frequency corresponding to each plaintext character in a preset open text, wherein each ciphertext word frequency is the occurrence frequency of the corresponding ciphertext character in each ciphertext text, and each plaintext word frequency is the occurrence frequency of the corresponding plaintext character in the open text;
and respectively replacing each ciphertext character in each ciphertext text with a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency to obtain each replaced text.
Optionally, the replacing each ciphertext character in each ciphertext text with a corresponding plaintext character to obtain each replaced text, specifically includes:
sequencing the ciphertext characters according to ciphertext character frequencies corresponding to the ciphertext characters to obtain sequenced ciphertext characters, and sequencing the plaintext characters according to plaintext character frequencies corresponding to the plaintext characters to obtain sequenced plaintext characters;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters identical to the sequence information of the corresponding ciphertext characters from the sequenced plaintext characters based on the sequence information corresponding to each ciphertext character in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
Optionally, each plaintext character in the preset public text is respectively adopted to replace each ciphertext character in each ciphertext text, so as to obtain each replaced text, and the method specifically includes:
determining ciphertext word vectors corresponding to the corresponding ciphertext characters by respectively taking the ciphertext characters as input parameters based on the trained ciphertext word vector model, and determining plaintext word vectors corresponding to the corresponding plaintext characters by respectively taking the plaintext characters as input parameters based on the trained ciphertext word vector model;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters meeting the condition of vector similarity based on the vector similarity between each ciphertext word vector corresponding to each ciphertext text in any ciphertext text and each plaintext word vector, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain the replaced text.
Optionally, training the model to be trained based on the replaced texts to obtain a pre-training model, specifically including:
respectively performing mask processing on the replaced texts to obtain mask texts;
respectively determining mask text vectors corresponding to the mask texts;
and respectively inputting the mask text vectors into a model to be trained, training the model to be trained based on the mask text vectors until a preset training cutoff condition is met, and obtaining a pre-training model after training.
Optionally, the mask text vectors are respectively input into a model to be trained, the model to be trained is trained based on the mask text vectors until a preset training cutoff condition is met, and obtaining a trained pre-training model specifically includes:
for each mask text vector, respectively performing the following operations:
determining a plurality of preset plaintext characters corresponding to each mask position in a mask text of any mask text vector, wherein the plurality of preset plaintext characters are selected from a plurality of preset plaintext characters with highest similarity to the plaintext characters according to plaintext word vectors corresponding to the plaintext characters at the mask positions, or the plurality of preset plaintext characters are determined from a plurality of preset plaintext characters arranged in front of the plaintext characters and a plurality of preset plaintext characters arranged behind the plaintext characters according to sequence information corresponding to the plaintext characters at the mask positions.
Generating training samples respectively based on the mask text vector and plaintext word vectors corresponding to a plurality of preset plaintext characters;
and inputting the training samples into the model to be trained, calculating loss values with the preset plaintext characters as output parameters through an attention mechanism, and adjusting various model parameters of the model to be trained based on the loss values to obtain model parameters converging to expected plaintext characters.
Optionally, performing corresponding processing based on the pre-training model specifically includes:
and sending the pre-training model to a client with data viewing permission so that the client performs text processing by adopting the pre-training model.
A model pre-training method, comprising:
encrypting the obtained linguistic data to be processed to obtain ciphertext linguistic data corresponding to the linguistic data to be processed;
sending the ciphertext linguistic data to a training client, so that the training client obtains replaced texts corresponding to the ciphertext texts in the ciphertext linguistic data based on the ciphertext linguistic data, trains a model to be trained based on the replaced texts, obtains a pre-training model, and returns the pre-training model to the client with data viewing permission;
and performing corresponding processing based on the received pre-training model.
Optionally, encrypting the obtained corpus to be processed to obtain a ciphertext corpus corresponding to the corpus to be processed, specifically including:
determining the word frequency of the character to be processed corresponding to each character to be processed in the corpus to be processed, wherein each word frequency of the character to be processed is the occurrence frequency of the corresponding character to be processed in the corpus to be processed;
based on the character frequency of each character to be processed, combining a preset encryption algorithm, respectively encrypting each character to be processed to obtain a ciphertext character corresponding to the corresponding character to be processed, and establishing a mapping relation between each character to be processed and the corresponding ciphertext character;
and respectively replacing each character to be processed with a corresponding ciphertext character based on the mapping relation between each character to be processed and the corresponding ciphertext character to obtain ciphertext corpora.
Optionally, performing corresponding processing based on the received pre-training model specifically includes:
encrypting the text to be converted to obtain a corresponding ciphertext text;
replacing the ciphertext text by adopting each plaintext character in a preset public corpus to obtain a replaced text;
and inputting the replaced text into the pre-training model to obtain an output result of the pre-training model.
A model pre-training apparatus comprising:
the acquisition module is used for responding to a pre-training instruction aiming at a model to be trained, acquiring ciphertext corpora uploaded by a client with data viewing authority, and determining each ciphertext text in the ciphertext corpora, wherein the ciphertext corpora are obtained by encrypting the corpora to be processed, and the corpora to be processed are texts containing sensitive information;
the processing module is used for replacing each ciphertext character in each ciphertext text by respectively adopting each plaintext character in a preset public text to obtain each replaced text, wherein the public text comprises a plurality of plaintext characters, and is a public text which does not comprise sensitive information;
and the training module is used for training the model to be trained based on the replaced texts to obtain a pre-training model and carrying out corresponding processing based on the pre-training model.
Optionally, when determining each ciphertext text in the ciphertext corpus, the obtaining module is further configured to:
segmenting the ciphertext corpus based on a preset paragraph ending identifier to obtain each ciphertext paragraph;
and based on preset sentence-division encrypted characters, respectively carrying out sentence division on each ciphertext paragraph to obtain corresponding each ciphertext text.
Optionally, the processing module is further configured to:
determining ciphertext word frequency corresponding to each ciphertext character in each ciphertext text and determining plaintext word frequency corresponding to each plaintext character in a preset open text, wherein each ciphertext word frequency is the occurrence frequency of the corresponding ciphertext character in each ciphertext text, and each plaintext word frequency is the occurrence frequency of the corresponding plaintext character in the open text;
and respectively replacing each ciphertext character in each ciphertext text with a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency to obtain each replaced text.
Optionally, when each ciphertext character in each ciphertext text is replaced by a corresponding plaintext character, and each replaced text is obtained, the processing module is further configured to:
sequencing the ciphertext characters according to ciphertext character frequencies corresponding to the ciphertext characters to obtain sequenced ciphertext characters, and sequencing the plaintext characters according to plaintext character frequencies corresponding to the plaintext characters to obtain sequenced plaintext characters;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters identical to the sequence information of the corresponding ciphertext characters from the sequenced plaintext characters based on the sequence information corresponding to each ciphertext character in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
Optionally, the processing module is further configured to:
determining ciphertext word vectors corresponding to the corresponding ciphertext characters by respectively taking the ciphertext characters as input parameters based on the trained ciphertext word vector model, and determining plaintext word vectors corresponding to the corresponding plaintext characters by respectively taking the plaintext characters as input parameters based on the trained ciphertext word vector model;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters meeting the condition of vector similarity based on the vector similarity between each ciphertext word vector corresponding to each ciphertext text in any ciphertext text and each plaintext word vector, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain the replaced text.
Optionally, training the model to be trained based on the replaced texts, and when a pre-training model is obtained, the training module is further configured to:
respectively performing mask processing on the replaced texts to obtain mask texts;
respectively determining mask text vectors corresponding to the mask texts;
and respectively inputting the mask text vectors into a model to be trained, training the model to be trained based on the mask text vectors until a preset training cutoff condition is met, and obtaining a pre-training model after training.
Optionally, the mask text vectors are respectively input into a model to be trained, the model to be trained is trained based on the mask text vectors, until a preset training cutoff condition is met, and when a pre-training model after training is obtained, the training module is further configured to:
for each mask text vector, respectively performing the following operations:
determining a plurality of preset plaintext characters corresponding to each mask position in a mask text of any mask text vector, wherein the plurality of preset plaintext characters are selected from a plurality of preset plaintext characters with highest similarity to the plaintext characters according to plaintext word vectors corresponding to the plaintext characters at the mask positions, or the plurality of preset plaintext characters are determined from a plurality of preset plaintext characters arranged in front of the plaintext characters and a plurality of preset plaintext characters arranged behind the plaintext characters according to sequence information corresponding to the plaintext characters at the mask positions.
Generating training samples respectively based on the mask text vector and plaintext word vectors corresponding to a plurality of preset plaintext characters;
and inputting the training samples into the model to be trained, calculating loss values with the preset plaintext characters as output parameters through an attention mechanism, and adjusting various model parameters of the model to be trained based on the loss values to obtain model parameters converging to expected plaintext characters.
Optionally, when performing corresponding processing based on the pre-training model, the training module is further configured to:
and sending the pre-training model to a client with data viewing permission so that the client performs text processing by adopting the pre-training model.
A model pre-training apparatus comprising:
the encryption module is used for encrypting the obtained linguistic data to be processed to obtain ciphertext linguistic data corresponding to the linguistic data to be processed;
the obtaining module is used for sending the ciphertext linguistic data to a training client so that the training client can obtain replaced texts corresponding to the ciphertext texts in the ciphertext linguistic data based on the ciphertext linguistic data, train a model to be trained based on the replaced texts to obtain a pre-training model, and return the pre-training model to the client with data viewing permission;
and the processing module is used for carrying out corresponding processing based on the received pre-training model.
Optionally, the encryption module is further configured to:
determining the word frequency of the character to be processed corresponding to each character to be processed in the corpus to be processed, wherein each word frequency of the character to be processed is the occurrence frequency of the corresponding character to be processed in the corpus to be processed;
based on the character frequency of each character to be processed, combining a preset encryption algorithm, respectively encrypting each character to be processed to obtain a ciphertext character corresponding to the corresponding character to be processed, and establishing a mapping relation between each character to be processed and the corresponding ciphertext character;
and respectively replacing each character to be processed with a corresponding ciphertext character based on the mapping relation between each character to be processed and the corresponding ciphertext character to obtain ciphertext corpora.
Optionally, the processing module is further configured to:
encrypting the text to be converted to obtain a corresponding ciphertext text;
replacing the ciphertext text by adopting each plaintext character in a preset public corpus to obtain a replaced text;
and inputting the replaced text into the pre-training model to obtain an output result of the pre-training model.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the model pre-training method when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned model pre-training method.
In the embodiment of the application, in response to a pre-training instruction for a model to be trained, ciphertext corpora uploaded by a client with data viewing permission are obtained, each ciphertext text in the ciphertext corpora is determined, the ciphertext corpora is obtained by encrypting the corpora to be processed, the corpora to be processed is a text containing sensitive information, each plaintext character in a preset public text is adopted to replace each ciphertext character in each ciphertext text, each replaced text is obtained, the public text contains a plurality of plaintext characters, the public text is a text which does not contain sensitive information and is open, the model to be trained is trained based on each replaced text, the pre-training model is obtained, and corresponding processing is performed based on the pre-training model. In this way, because the ciphertext corpora are sent by the client with the data viewing authority, the training client without the data viewing authority cannot acquire the original corpus to be processed, only can acquire the encrypted ciphertext corpora, and cannot acquire sensitive information in the corpus to be processed through the ciphertext corpora in the training process, so that the data security can be ensured.
Drawings
FIG. 1 is a schematic flow chart illustrating a model pre-training method according to an embodiment of the present disclosure;
FIG. 2 is another schematic flow chart illustrating a model pre-training method according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a model pre-training apparatus according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of another model pre-training apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, a pre-training model can be applied to various different scenes, a large amount of text data is usually needed in the process of training the pre-training model, but in a specific industry and application scene, sensitive information may be related to the text data, so that data leakage may be caused when the pre-training model is trained.
In the related art, when a model is pre-trained, the pre-trained model is usually trained in a relatively closed environment, however, an operator who trains the model can still obtain data with sensitive information, and therefore, the safety of the data can be reduced.
In order to solve the above problem, embodiments of the present application provide a model pre-training method, apparatus, electronic device, and storage medium, which respond to a pre-training instruction for a model to be trained to obtain ciphertext corpora uploaded by a client with data viewing permission, and determining each ciphertext text in the ciphertext corpus, wherein the ciphertext corpus is obtained by encrypting a corpus to be processed, the corpus to be processed is a text containing sensitive information, each plaintext character in a preset open text is respectively adopted, replacing each ciphertext character in each ciphertext text to obtain each replaced text, wherein the public text comprises a plurality of plaintext characters, is a public text which does not comprise sensitive information and is based on each replaced text, and training the model to be trained to obtain a pre-training model, and performing corresponding processing based on the pre-training model. Therefore, when the training client trains the data, the sensitive information in the ciphertext corpus cannot be acquired, and the safety of the data can be ensured.
Based on the above embodiment, referring to fig. 1, a schematic flow chart of a model pre-training method in the embodiment of the present application is shown, which specifically includes:
s10: and responding to a pre-training instruction aiming at the model to be trained, acquiring ciphertext linguistic data uploaded by the client with the data viewing authority, and determining each ciphertext text in the ciphertext linguistic data.
And the ciphertext linguistic data is obtained by encrypting the linguistic data to be processed.
In the embodiment of the application, when a model to be trained needs to be trained, a client with data viewing permission can trigger to generate a pre-training instruction for the model to be trained, and the generated pre-training instruction is sent to a training client, so that the training client responds to the pre-training instruction for the model to be trained, acquires ciphertext corpora uploaded by the client with data viewing permission, and performs sentence division processing on the ciphertext corpora to acquire ciphertext texts corresponding to the ciphertext corpora.
It should be noted that, in the embodiment of the present application, the client having the data viewing permission may further pack the ciphertext corpora in the pre-training instruction and send the ciphertext corpora to the training client, so that the training client may obtain the ciphertext corpora packed in the pre-training instruction by analyzing the pre-training instruction.
Optionally, in this embodiment of the present application, a possible implementation manner is provided for determining each ciphertext text in the ciphertext corpus, and specifically includes:
s101: and segmenting the ciphertext corpus based on the preset paragraph ending identification to obtain each ciphertext paragraph.
In the embodiment of the application, the ciphertext corpus includes at least one ciphertext paragraph, and each ciphertext paragraph is connected with each other through a paragraph ending identifier, so that the ciphertext corpus can be segmented based on a preset paragraph ending identifier, and each ciphertext paragraph corresponding to the ciphertext corpus is obtained.
For example, assuming that the ciphertext corpus is a @ b @ k and the preset paragraph ending identifier is @, the ciphertext corpus is segmented based on the preset paragraph ending identifier @toobtain ciphertext paragraphs corresponding to the ciphertext corpus, which are ciphertext paragraph a, ciphertext paragraph b, and ciphertext paragraph c, respectively.
S102: and based on preset sentence-division encrypted characters, respectively carrying out sentence division on each ciphertext paragraph to obtain corresponding each ciphertext text.
In the embodiment of the present application, when a client having a data viewing right encrypts a corpus to be processed, each sentence-to-be-processed character is encrypted into a uniform sentence-encrypted character, that is, different sentence-to-be-processed characters correspond to the same sentence-encrypted character, therefore, in the embodiment of the present application, a ciphertext paragraph can be divided based on the sentence-encrypted character, so as to obtain corresponding ciphertext texts, the following is detailed in the process of dividing a ciphertext paragraph in the embodiment of the present application, and the following operations are respectively performed for each ciphertext paragraph: and carrying out clause on any ciphertext paragraph by adopting a preset clause encryption character so as to obtain each ciphertext text corresponding to the ciphertext paragraph.
For example, assuming that the ciphertext paragraph is AxBxCx and the preset clause encryption character is x, clause encryption characters x are used to clause the ciphertext paragraph, and each ciphertext text corresponding to the ciphertext paragraph is obtained as ciphertext text a, ciphertext text B, and ciphertext text C.
It should be noted that, in the embodiment of the present application, when a client having a data viewing right encrypts a corpus to be processed, different clause characters to be processed may also be encrypted into different clause ciphertext characters, in other words, different clause characters to be processed correspond to different clause ciphertext characters.
For example, sentence characters to be processed ". "after encryption, the obtained clause ciphertext character is" x ", and the clause character to be processed"? "after encryption, the obtained sentence ciphertext character is" y ", which is not limited in this embodiment of the present application.
In addition, in the embodiment of the present application, after the client with the data viewing right sends the ciphertext corpus to the training client, the clause ciphertext characters are sent to the training client, so that when the training client performs clause on the ciphertext paragraphs, the sentence ciphertext characters are obtained by segmenting the ciphertext paragraphs based on the known sentence ciphertext characters because the sentence ciphertext characters are already known.
Correspondingly, in the embodiment of the application, the client with the data viewing authority can also send the paragraph ending identifier to the training client, so that when the training client segments the ciphertext corpus, the training client can segment the ciphertext corpus based on the known paragraph ending identifier because the paragraph ending identifier is already known, and thus each ciphertext paragraph is obtained.
S11: and respectively replacing each ciphertext character in each ciphertext text by adopting each plaintext character in the preset public text to obtain each replaced text.
The open text contains a plurality of plaintext characters, and the open text is open text which does not contain sensitive information.
In the embodiment of the application, because the open text contains a plurality of plaintext characters, and the plaintext characters are open characters which do not contain sensitive information, each plaintext character in the preset open text is adopted to replace each ciphertext character contained in each ciphertext text, so that each replaced text is obtained.
Optionally, in the embodiment of the present application, two possible implementation manners are provided for obtaining the text after the replacement, and the following respectively describes in detail a process of replacing the ciphertext text in the embodiment of the present application.
The first mode is as follows: and replacing by adopting the ciphertext word frequency and the plaintext word frequency.
When S11 is executed, the method specifically includes:
s11-1-1: and determining the ciphertext word frequency corresponding to each ciphertext character in each ciphertext text, and determining the plaintext word frequency corresponding to each plaintext character in the preset open text.
Each ciphertext word frequency is the occurrence frequency of the corresponding ciphertext character in each ciphertext text, and each plaintext word frequency is the occurrence frequency of the corresponding plaintext character in the public text.
In the embodiment of the application, the following operations are executed respectively for each ciphertext character: and counting the occurrence frequency of any ciphertext character in each ciphertext text, and taking the occurrence frequency of the ciphertext character as the ciphertext character frequency corresponding to the ciphertext character. Through the method, the ciphertext word frequency corresponding to each ciphertext character can be obtained.
For example, assume that each ciphertext text is ciphertext text a, ciphertext text B and ciphertext text C, where ciphertext text a is acddadb, ciphertext text B is aabbdd2, and ciphertext text C is a22db, so that ciphertext characters included in each ciphertext text are a, B, C, d and 2, the number of occurrences of ciphertext character a in each ciphertext text is counted as 5, and a ciphertext word frequency corresponding to ciphertext character a is determined as 5; the occurrence frequency of the ciphertext character b in each ciphertext text is 4, and the ciphertext word frequency corresponding to the ciphertext character b is determined to be 4; the occurrence frequency of the ciphertext character c in each ciphertext text is 1, and the ciphertext word frequency corresponding to the ciphertext character c is determined to be 1; the occurrence frequency of the ciphertext character d in each ciphertext text is 6, and the ciphertext word frequency corresponding to the ciphertext character d is determined to be 6; the occurrence frequency of the ciphertext character 2 in each ciphertext text is 3, and the ciphertext word frequency corresponding to the ciphertext character 2 is determined to be 3.
It should be noted that, in the embodiment of the present application, when determining the ciphertext word frequency, statistics is performed on the number of occurrences of the ciphertext character in all ciphertext texts.
In addition, in the embodiment of the present application, after obtaining the ciphertext word frequency corresponding to each ciphertext character, a ciphertext word frequency table may be generated based on each ciphertext character and the ciphertext word frequency corresponding to each ciphertext character, so that each ciphertext character and the ciphertext word frequency corresponding to each ciphertext character are included in the ciphertext word frequency table. For example, see table 1, which is a table of cipher text words in the embodiment of the present application.
Table 1.
Ciphertext character Ciphertext word frequency
a 6
b 2
# 4
Meanwhile, the following operations are executed respectively for each plaintext character: counting the occurrence frequency of any plaintext character in a preset open text, and taking the occurrence frequency of the plaintext character as a plaintext character frequency corresponding to the plaintext character. Therefore, through the method, the plaintext character frequency corresponding to each plaintext character can be obtained.
For example, suppose that the public text is specifically "i look up the weather today and the sunlight today is good", therefore, the plaintext characters contained in the public text are "i", "look up", "d", "day", "qi", "sun" and "light", respectively, the number of occurrences of the plaintext character "i" in the public text is counted as 1, and the plaintext character frequency corresponding to the plaintext character "i" is determined as 1; counting the occurrence frequency of the plaintext character check in the public text to be 1, and determining the plaintext character frequency corresponding to the plaintext character check to be 1; counting the occurrence frequency of the plaintext character 'yes' in the public text to be 1, and determining the plaintext character frequency corresponding to the plaintext character 'yes' to be 1; counting the occurrence frequency of the plaintext character 'today' in the public text to be 2, and determining the plaintext character frequency corresponding to the plaintext character 'today' to be 2; counting the occurrence frequency of the plaintext character day in the public text to be 3, and determining the plaintext character frequency corresponding to the plaintext character day to be 3; counting the occurrence frequency of the plaintext character 'in the public text to be 2, and determining the plaintext character frequency corresponding to the plaintext character' to be 2; counting the occurrence frequency of the plaintext character 'qi' in the public text to be 1, and determining the plaintext character frequency corresponding to the plaintext character 'qi' to be 1; counting the occurrence frequency of the plaintext character 'yang' in the public text to be 1, and determining the plaintext character frequency corresponding to the plaintext character 'yang' to be 1; counting the number of occurrences of the plaintext character light in the public text to be 1, and determining that the plaintext character frequency corresponding to the plaintext character light is 1.
It should be noted that, in the embodiment of the present application, after obtaining the plaintext word frequencies corresponding to the plaintext characters, a plaintext word frequency table may be generated based on the plaintext characters and the plaintext word frequencies corresponding to the plaintext characters, so that the plaintext word frequency table includes the plaintext characters and the plaintext word frequencies corresponding to the plaintext characters. For example, see table 2 for a plaintext word frequency table in the embodiment of the present application.
Table 2.
Plaintext character Plaintext word frequency
Jinjin tea 3
Sky 2
Light (es) 1
In addition, it should be noted that the published text in the embodiment of the present application is a general corpus, for example, a chinese corpus in an X-japanese report, which is not limited in the embodiment of the present application.
S11-1-2: and respectively replacing each ciphertext character in each ciphertext text with a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency to obtain each replaced text.
In the embodiment of the application, each ciphertext character in each ciphertext text is replaced by a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency, so that each replaced text is obtained.
Optionally, in the embodiment of the present application, a possible implementation manner is provided for replacing plaintext characters, and the following explains a process of replacing a ciphertext text in the embodiment of the present application, specifically including:
s11-1-21: and sequencing the plaintext characters according to the plaintext character frequencies corresponding to the plaintext characters to obtain the sequenced plaintext characters.
In the embodiment of the application, the ciphertext characters are sequenced according to the ciphertext character frequency corresponding to each ciphertext character, so that each ciphertext character after sequencing is obtained, and meanwhile, the plaintext characters are sequenced according to the plaintext character frequency corresponding to each plaintext character, so that each plaintext character frequency after sequencing is obtained.
It should be noted that, in the embodiment of the present application, when performing sorting, sorting may be performed according to an ascending order, or sorting may also be performed according to a descending order, which is not limited in the embodiment of the present application.
S11-1-22: and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters identical to the sequence information of the corresponding ciphertext characters from the sequenced plaintext characters based on the sequence information corresponding to each ciphertext character in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
In the embodiment of the present application, the replaced texts corresponding to the ciphertext texts are obtained respectively, and the following explains a process of obtaining the replaced texts with any ciphertext text (hereinafter, referred to as ciphertext text a), specifically including:
for each ciphertext character in the ciphertext text a, the following operations are respectively executed:
firstly, based on the sequence information of any ciphertext character in each arranged ciphertext character and the sequence information corresponding to each sequenced plaintext character, a plaintext character which is the same as the sequence information of the ciphertext character is determined from each sequenced plaintext character.
Then, the ciphertext character is replaced by the determined plaintext character.
Through the method, each ciphertext character in the ciphertext text a is replaced by the corresponding plaintext character, so that the replaced text is obtained.
A specific example is used to describe the process of replacing the ciphertext text a in the embodiment of the present application. Assuming that the preset number is 3, each arranged plaintext character is "today is you me" ciphertext text a, specifically bdccd, so that the ciphertext text a includes ciphertext character b, ciphertext character c and ciphertext character d, each arranged ciphertext character is cdb, taking ciphertext character d as an example, determining that the sequence information of ciphertext character d in each arranged ciphertext character is the second, determining that the plaintext character with the same sequence as ciphertext character d is "and replacing the ciphertext character d with the plaintext character" if the sequence information of ciphertext character d in each arranged ciphertext character is the second.
Then, based on each ciphertext character and the determined corresponding plaintext character, the replaced text is acquired as 'the current day and day'.
The second mode is as follows: and replacing by adopting the ciphertext word frequency and the plaintext word frequency.
When S11 is executed, the method specifically includes:
s11-2-1: and determining a plaintext word vector corresponding to the corresponding plaintext character by respectively taking each plaintext character as an input parameter based on the trained ciphertext word vector model.
In the embodiment of the application, based on the trained ciphertext word vector model, each ciphertext character is used as an input parameter, and the output parameter is a ciphertext word vector corresponding to each ciphertext character, and simultaneously, based on the trained plaintext word vector model, each plaintext character is used as an input parameter, and the output parameter is a plaintext word vector corresponding to each plaintext character.
The ciphertext character vector model is used for determining ciphertext character vectors corresponding to ciphertext characters, and is obtained by adopting an unsupervised word vector training method; the plaintext word vector model is used for determining a plaintext word vector corresponding to the plaintext character, and the plaintext word vector model is obtained by adopting an unsupervised word vector training method.
S11-2-2: and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters meeting the condition of vector similarity based on the vector similarity between each ciphertext word vector and each plaintext word vector corresponding to each ciphertext text in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
In this embodiment of the present application, in the embodiment of the present application, replaced texts corresponding to each ciphertext text are obtained, and the following explains a process of obtaining the replaced texts by taking the ciphertext text a as an example, specifically including:
for each ciphertext character in the ciphertext text a, the following operations are respectively executed:
firstly, according to a ciphertext word vector corresponding to any ciphertext character and a plaintext word vector corresponding to each plaintext character, respectively determining the vector similarity between the ciphertext character and each plaintext character.
And secondly, determining the plaintext characters meeting the preset vector similarity condition from the plaintext characters based on the vector similarity.
The plaintext character with the largest vector similarity can be determined from the plaintext characters and used as the replaced plaintext character.
And then, replacing the corresponding ciphertext character by using the determined plaintext character to obtain a replaced text.
Through the method, each ciphertext character in the ciphertext text a is replaced by the corresponding plaintext character, so that the replaced text is obtained.
The following describes a process of obtaining a text after replacement in the embodiment of the present application by using a specific example. For example, assuming that the preset number is 3, the plaintext characters are "present", "day", "yes" and "respectively, and each ciphertext character in the ciphertext text a is! d2, using the ciphertext character! "for example, calculate the ciphertext character"! "the vector similarity between the ciphertext word vector and the plaintext word vector of the plaintext character" today "is 0.2, calculate the ciphertext character! "the vector similarity between the ciphertext word vector and the plaintext word vector of the plaintext character" day "is 0.3, calculate the ciphertext character! The vector similarity between the ciphertext word vector of "and the plaintext word vector of the plaintext character" is "0.15", and the ciphertext character! The vector similarity between the ciphertext word vector and the plaintext word vector of the plaintext character is 0.4, so that the plaintext characters satisfying the preset vector similarity condition are determined to be respectively ' true ' and used for replacing the ciphertext character '! ".
S12: and training the model to be trained based on each replaced text to obtain a pre-training model, and performing corresponding processing based on the pre-training model.
In the embodiment of the application, each replaced text is input into the model to be trained, the model to be trained is trained, a trained pre-training model is obtained, and corresponding processing is performed based on the pre-training model.
Optionally, in the embodiment of the present application, a possible implementation manner is provided for training a model to be trained, and the following description of the process of training the model to be trained in the embodiment of the present application specifically includes:
s121: and respectively carrying out mask processing on each replaced text to obtain each mask text.
In the embodiment of the application, each replaced text is subjected to mask processing according to a preset mask strategy to obtain each mask text.
It should be noted that the preset masking policy may adopt a random masking manner, and the proportion of the plaintext characters subjected to the masking processing is not more than 20% of the total amount of each plaintext character contained in the replaced text.
S122: and respectively determining mask text vectors corresponding to the mask texts.
In the embodiment of the present application, the following operations are performed for each mask text: and determining a plaintext vector corresponding to each plaintext character in any mask text, and determining a mask text vector corresponding to the mask text based on each ciphertext vector.
S123: and respectively inputting each mask text into the model to be trained, training the model to be trained based on each mask text until a preset training cut-off condition is met, and obtaining a pre-training model after training.
In the embodiment of the application, the model to be trained is trained based on each mask file, so that a pre-training model after training is obtained.
The preset training cutoff condition may be, for example, reaching the maximum training step size, for example, 1,000,000 times, or reaching the training round, for example, 100 rounds, which is not limited in the embodiment of the present application.
Optionally, in this embodiment of the present application, a possible implementation manner for training a model to be trained is provided, and a process for training the model to be trained in this embodiment of the present application is described below by taking any one mask text vector as an example, which specifically includes:
s1231: and determining a plurality of preset plaintext characters corresponding to each mask position in the mask text of any mask text vector.
The preset plaintext characters are selected from preset plaintext characters with highest similarity to the plaintext characters according to plaintext character vectors corresponding to the plaintext characters at mask positions, or the preset plaintext characters are determined from preset plaintext characters arranged in front of the plaintext characters and preset plaintext characters arranged behind the plaintext characters according to sequence information corresponding to the plaintext characters at mask positions.
In the embodiment of the application, a plurality of preset plaintext characters corresponding to each mask position in a mask text of any mask text vector are determined.
It should be noted that, in the embodiment of the present application, the predetermined number of plaintext characters may be determined in the following two ways.
The first mode is as follows: determined from the plaintext word vector.
In the embodiment of the application, the original plaintext characters at the mask position in the mask text are determined, and then, according to the plaintext character vector corresponding to the plaintext character and the vector similarity of the plaintext character vector between the plaintext character vector and each of the other plaintext characters, a plurality of preset plaintext characters with the highest vector similarity are determined from each plaintext character and are used as a plurality of preset plaintext characters for replacing the plaintext character.
The second mode is as follows: determined according to the sequence information.
In the embodiment of the application, an original plaintext character at a mask position in a mask text is determined, then a preset number of plaintext characters arranged before the plaintext character and a preset number of plaintext characters arranged after the plaintext character are determined according to sequence information corresponding to the plaintext character, and the determined plaintext characters are used as the preset number of plaintext characters for replacing the plaintext character.
S1232: and generating each training sample based on the mask text vector and the plaintext character vectors corresponding to the preset plurality of plaintext characters respectively.
In the embodiment of the present application, after obtaining each plaintext character for replacement at each mask position, the following operations are performed for each plaintext character respectively: training samples are generated based on the mask text vector and a plaintext word vector corresponding to any plaintext character, and therefore each training sample can be obtained.
S1233: inputting each training sample into the model to be trained, calculating a loss value taking a plurality of preset plaintext characters as output parameters through an attention mechanism, and adjusting each model parameter of the model to be trained based on the loss value to obtain a model parameter converging to an expected plaintext character.
In the embodiment of the application, each training sample is respectively input into a model to be trained, the loss value of the model is calculated by an attention mechanism when a plurality of preset plaintext characters are taken as output parameters, and each model parameter of the model to be trained is optimized according to the loss value of the model to obtain the model parameter converged in the expected plaintext character, so that the pre-training model after training is obtained.
Therefore, in the embodiment of the present application, the computation complexity of the model to be trained is M × N, M is the number of mask positions of each mask text, and N is the number of plaintext characters corresponding to the original plaintext characters at each mask position.
It should be noted that the model to be trained in the embodiment of the present application may be a Bert model or an Albert model, where the Bert model is a 12-layer, 12-headed, 768-dimensional standard Bert model structure, and the model is optimized by using an ADM optimizer, and the model parameters are adjusted by using a cross-entropy CE loss function.
Further, in the embodiment of the application, after the pre-training model is obtained, the pre-training model is sent to the client side with the data viewing permission, so that the client side performs text processing by using the pre-training model.
It should be noted that, in the embodiment of the present application, a process of performing text processing by using a pre-training model at a client is described in detail in the following embodiments, which is not described herein in any more detail.
In the embodiment of the application, the training data acquired by the training client is the ciphertext corpus, so that sensitive information in the ciphertext corpus cannot be acquired, and the data security is ensured.
Based on the foregoing embodiment, referring to fig. 2, another schematic flow chart of a model pre-training method in the embodiment of the present application is shown, which specifically includes:
s20: and encrypting the obtained linguistic data to be processed to obtain ciphertext linguistic data corresponding to the linguistic data to be processed.
In the embodiment of the application, the linguistic data to be processed is obtained, and the obtained linguistic data to be processed is encrypted to obtain ciphertext linguistic data corresponding to the linguistic data to be processed.
Optionally, in this embodiment of the present application, a possible implementation manner is provided for encrypting the corpus to be processed, and the following process of encrypting the corpus to be processed and obtaining the ciphertext corpus corresponding to the corpus to be processed in this embodiment of the present application is elaborated in detail, and specifically includes:
s201: and determining the character frequency of the character to be processed corresponding to each character to be processed in the corpus to be processed.
And each character frequency of the characters to be processed is the occurrence frequency of the corresponding characters to be processed in the linguistic data to be processed.
In the embodiment of the application, the following operations are executed respectively for each character to be processed: counting the occurrence frequency of any character to be processed in the corpus to be processed, and taking the occurrence frequency of the character to be processed as the character frequency of the character to be processed corresponding to the character to be processed. By the method, the character frequency of the character to be processed corresponding to each character to be processed can be obtained.
Optionally, in this embodiment of the application, before determining the word frequency of the to-be-processed character corresponding to each to-be-processed character, the to-be-processed corpus needs to be preprocessed first, so that when determining the word frequency of each to-be-processed character, the word frequency of the to-be-processed character corresponding to each to-be-processed character in the preprocessed to-be-processed corpus is determined.
When the corpus to be processed is preprocessed, the corpus to be processed can be subjected to processing such as coding format conversion, noise reduction and illegal character removal, and the preprocessed corpus is obtained.
S202: and respectively encrypting each character to be processed based on the character frequency of each character to be processed and a preset encryption algorithm to obtain a ciphertext character corresponding to the corresponding character to be processed, and establishing a mapping relation between each character to be processed and the corresponding ciphertext character.
In the embodiment of the application, firstly, the characters to be processed are sequenced based on the character frequency of the characters to be processed to obtain the sequenced characters to be processed, the sequenced characters to be processed are respectively encrypted to obtain ciphertext characters to be processed corresponding to the sequenced characters to be processed, and then the mapping relation between the characters to be processed and the corresponding ciphertext characters is established.
For example, each character to be processed is "you", "good", the ciphertext character to be processed corresponding to the character to be processed "you" is "d", the ciphertext character to be processed corresponding to the character to be processed "is" x ", and the ciphertext character to be processed corresponding to the character to be processed" good "is" 9 ", so as to establish a mapping relationship between each character to be processed and the corresponding ciphertext character.
It should be noted that, for example, the preset encryption algorithm in the embodiment of the present application may be a national secret encryption algorithm, which is not limited in the embodiment of the present application.
S203: and respectively replacing each character to be processed with a corresponding ciphertext character based on the mapping relation between each character to be processed and the corresponding ciphertext character to obtain ciphertext corpora.
In the embodiment of the application, after the mapping relationship between each character to be processed and the corresponding ciphertext character is obtained, each character to be processed in the corpus to be processed is replaced by the corresponding ciphertext character respectively based on the mapping relationship between each character to be processed and the corresponding ciphertext character, and the ciphertext corpus is obtained.
S21: and sending the ciphertext linguistic data to a training client so that the training client obtains the replaced texts corresponding to the ciphertext texts in the ciphertext linguistic data based on the ciphertext linguistic data, trains the model to be trained based on the replaced ciphertext texts to obtain a pre-training model, and returns the pre-training model to the client with the data viewing authority.
In the embodiment of the application, after the ciphertext corpora are obtained, the ciphertext corpora are sent to the training client side, after the training client side receives the ciphertext corpora, each ciphertext text in the ciphertext corpora is determined based on the ciphertext corpora, each ciphertext text is replaced respectively, a corresponding replaced text is obtained, the model to be trained is trained based on each replaced text, a pre-training model is obtained, the pre-training model is returned to the client side with the data viewing authority, and therefore the client side with the data viewing authority receives the pre-training model.
It should be noted that, in the embodiment of the present application, the process of training the model to be trained by the training client may refer to the method in the foregoing embodiment, which is not described in detail herein.
S22: and performing corresponding processing based on the received pre-training model.
In the embodiment of the present application, a possible implementation manner is provided for S22, and the following describes a process of performing corresponding processing based on a received pre-training model in the embodiment of the present application, which specifically includes:
s221: and encrypting the text to be converted to obtain a corresponding ciphertext text.
In the embodiment of the application, format conversion is performed on an input text to be converted to an UTF-8 format, meanwhile, noise reduction processing is performed to remove illegal characters, then, a preset ciphertext mapping table is adopted to convert each character to be converted contained in the text to be converted into corresponding ciphertext characters respectively, and a ciphertext text corresponding to the text to be converted is obtained.
S222: and replacing the ciphertext text by adopting each plaintext character in the preset public corpus to obtain a replaced text.
In the embodiment of the application, the ciphertext text is replaced by the corresponding plaintext characters based on the same replacement mode adopted in the training of the pre-training model, so that the replaced text is obtained.
S223: and inputting the replaced text into the pre-training model to obtain an output result of the pre-training model.
In the embodiment of the application, the replaced text is input into the pre-training model, and the output result is obtained by using the pre-training model.
Optionally, in the embodiment of the present application, training may also be performed directly based on the ciphertext text to obtain a model after further training, which is not limited in the embodiment of the present application.
In the embodiment of the application, the client with the data viewing authority generates the ciphertext corpora, so that the ciphertext corpora containing sensitive information can be ensured not to be known by other users without the authority, and the data security is improved.
Based on the same inventive concept, the embodiment of the present application provides a model pre-training apparatus, which may be, for example, a training client in the foregoing embodiment, and the model pre-training apparatus may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiment, referring to fig. 3, a schematic structural diagram of a model pre-training apparatus in an embodiment of the present application is shown, which specifically includes:
the obtaining module 300 is configured to, in response to a pre-training instruction for a model to be trained, obtain a ciphertext corpus uploaded by a client having a data viewing right, and determine each ciphertext text in the ciphertext corpus, where the ciphertext corpus is obtained by encrypting a corpus to be processed, and the corpus to be processed is a text containing sensitive information;
the processing module 310 is configured to replace each ciphertext character in each ciphertext text with each plaintext character in a preset public text, so as to obtain each replaced text, where the public text includes a plurality of plaintext characters, and the public text is a public text that does not include sensitive information;
and a training module 320, configured to train the model to be trained based on the replaced texts, obtain a pre-training model, and perform corresponding processing based on the pre-training model.
Optionally, when determining each ciphertext text in the ciphertext corpus, the obtaining module 300 is further configured to:
segmenting the ciphertext corpus based on a preset paragraph ending identifier to obtain each ciphertext paragraph;
and based on preset sentence-division encrypted characters, respectively carrying out sentence division on each ciphertext paragraph to obtain corresponding each ciphertext text.
Optionally, the processing module 310 is further configured to:
determining ciphertext word frequency corresponding to each ciphertext character in each ciphertext text and determining plaintext word frequency corresponding to each plaintext character in a preset open text, wherein each ciphertext word frequency is the occurrence frequency of the corresponding ciphertext character in each ciphertext text, and each plaintext word frequency is the occurrence frequency of the corresponding plaintext character in the open text;
and respectively replacing each ciphertext character in each ciphertext text with a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency to obtain each replaced text.
Optionally, when each ciphertext character in each ciphertext text is replaced by a corresponding plaintext character, and each replaced text is obtained, the processing module 310 is further configured to:
sequencing the ciphertext characters according to ciphertext character frequencies corresponding to the ciphertext characters to obtain sequenced ciphertext characters, and sequencing the plaintext characters according to plaintext character frequencies corresponding to the plaintext characters to obtain sequenced plaintext characters;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters identical to the sequence information of the corresponding ciphertext characters from the sequenced plaintext characters based on the sequence information corresponding to each ciphertext character in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
Optionally, the processing module 310 is further configured to:
determining ciphertext word vectors corresponding to the corresponding ciphertext characters by respectively taking the ciphertext characters as input parameters based on the trained ciphertext word vector model, and determining plaintext word vectors corresponding to the corresponding plaintext characters by respectively taking the plaintext characters as input parameters based on the trained ciphertext word vector model;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters meeting the condition of vector similarity based on the vector similarity between each ciphertext word vector corresponding to each ciphertext text in any ciphertext text and each plaintext word vector, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain the replaced text.
Optionally, based on the replaced texts, training the model to be trained, and when a pre-training model is obtained, the training module 320 is further configured to:
respectively performing mask processing on the replaced texts to obtain mask texts;
respectively determining mask text vectors corresponding to the mask texts;
and respectively inputting the mask text vectors into a model to be trained, training the model to be trained based on the mask text vectors until a preset training cutoff condition is met, and obtaining a pre-training model after training.
Optionally, the mask text vectors are respectively input into a model to be trained, the model to be trained is trained based on the mask text vectors, until a preset training cutoff condition is met, and when a pre-training model after training is obtained, the training module 320 is further configured to:
for each mask text vector, respectively performing the following operations:
determining a plurality of preset plaintext characters corresponding to each mask position in a mask text of any mask text vector, wherein the plurality of preset plaintext characters are selected from a plurality of preset plaintext characters with highest similarity to the plaintext characters according to plaintext word vectors corresponding to the plaintext characters at the mask positions, or the plurality of preset plaintext characters are determined from a plurality of preset plaintext characters arranged in front of the plaintext characters and a plurality of preset plaintext characters arranged behind the plaintext characters according to sequence information corresponding to the plaintext characters at the mask positions.
Generating training samples respectively based on the mask text vector and plaintext word vectors corresponding to a plurality of preset plaintext characters;
and inputting the training samples into the model to be trained, calculating loss values with the preset plaintext characters as output parameters through an attention mechanism, and adjusting various model parameters of the model to be trained based on the loss values to obtain model parameters converging to expected plaintext characters.
Optionally, when performing corresponding processing based on the pre-training model, the training module 320 is further configured to:
and sending the pre-training model to a client with data viewing permission so that the client performs text processing by adopting the pre-training model.
Based on the same inventive concept, another model pre-training device is provided in the embodiment of the present application, and the another model pre-training device may be, for example, a client with data viewing permission in the foregoing embodiment, and the another model pre-training device may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiment, referring to fig. 4, a schematic structural diagram of another model pre-training device in the embodiment of the present application is shown, which specifically includes:
the encryption module 400 is configured to encrypt the obtained corpus to be processed to obtain a ciphertext corpus corresponding to the corpus to be processed;
an obtaining module 410, configured to send the ciphertext corpus to a training client, so that the training client obtains, based on the ciphertext corpus, a replaced text corresponding to each ciphertext text in the ciphertext corpus, trains a to-be-trained model based on each replaced text, obtains a pre-trained model, and returns the pre-trained model to the client having the data viewing permission;
and the processing module 420 is configured to perform corresponding processing based on the received pre-training model.
Optionally, the encryption module 400 is further configured to:
determining the word frequency of the character to be processed corresponding to each character to be processed in the corpus to be processed, wherein each word frequency of the character to be processed is the occurrence frequency of the corresponding character to be processed in the corpus to be processed;
based on the character frequency of each character to be processed, combining a preset encryption algorithm, respectively encrypting each character to be processed to obtain a ciphertext character corresponding to the corresponding character to be processed, and establishing a mapping relation between each character to be processed and the corresponding ciphertext character;
and respectively replacing each character to be processed with a corresponding ciphertext character based on the mapping relation between each character to be processed and the corresponding ciphertext character to obtain ciphertext corpora.
Optionally, the processing module 420 is further configured to:
encrypting the text to be converted to obtain a corresponding ciphertext text;
replacing the ciphertext text by adopting each plaintext character in a preset public corpus to obtain a replaced text;
and inputting the replaced text into the pre-training model to obtain an output result of the pre-training model.
Based on the above embodiments, fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.
An embodiment of the present application provides an electronic device, which may include a processor 510 (CPU), a memory 520, an input device 530, an output device 540, and the like, wherein the input device 530 may include a keyboard, a mouse, a touch screen, and the like, and the output device 540 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.
Memory 520 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 510 with program instructions and data stored in memory 520. In the embodiment of the present application, the memory 520 may be used to store a program of any one of the model pre-training methods in the embodiment of the present application.
Processor 510 is configured to execute any of the model pre-training methods of the embodiments of the present application according to program instructions obtained by calling program instructions stored in memory 520.
Based on the foregoing embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the model pre-training method in any of the above method embodiments.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of model pre-training, comprising:
responding to a pre-training instruction aiming at a model to be trained, acquiring ciphertext corpora uploaded by a client with data viewing authority, and determining each ciphertext text in the ciphertext corpora, wherein the ciphertext corpora are obtained by encrypting the corpora to be processed, and the corpora to be processed are texts containing sensitive information;
respectively adopting each plaintext character in a preset public text to replace each ciphertext character in each ciphertext text to obtain each replaced text, wherein the public text comprises a plurality of plaintext characters, and is a public text which does not comprise sensitive information;
and training the model to be trained based on the replaced texts to obtain a pre-training model, and performing corresponding processing based on the pre-training model.
2. The method of claim 1, wherein determining each ciphertext text in the ciphertext corpus comprises:
segmenting the ciphertext corpus based on a preset paragraph ending identifier to obtain each ciphertext paragraph;
and based on preset sentence-division encrypted characters, respectively carrying out sentence division on each ciphertext paragraph to obtain corresponding each ciphertext text.
3. The method according to claim 1, wherein replacing each ciphertext character in each ciphertext text with each plaintext character in a preset public text to obtain each replaced text specifically comprises:
determining ciphertext word frequency corresponding to each ciphertext character in each ciphertext text and determining plaintext word frequency corresponding to each plaintext character in a preset open text, wherein each ciphertext word frequency is the occurrence frequency of the corresponding ciphertext character in each ciphertext text, and each plaintext word frequency is the occurrence frequency of the corresponding plaintext character in the open text;
and respectively replacing each ciphertext character in each ciphertext text with a corresponding plaintext character based on each ciphertext character frequency and each plaintext character frequency to obtain each replaced text.
4. The method according to claim 3, wherein the step of replacing each ciphertext character in each ciphertext text with a corresponding plaintext character to obtain each replaced text comprises:
sequencing the ciphertext characters according to ciphertext character frequencies corresponding to the ciphertext characters to obtain sequenced ciphertext characters, and sequencing the plaintext characters according to plaintext character frequencies corresponding to the plaintext characters to obtain sequenced plaintext characters;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters identical to the sequence information of the corresponding ciphertext characters from the sequenced plaintext characters based on the sequence information corresponding to each ciphertext character in any ciphertext text, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain a replaced text.
5. The method according to claim 1, wherein replacing each ciphertext character in each ciphertext text with each plaintext character in a preset public text to obtain each replaced text specifically comprises:
determining ciphertext word vectors corresponding to the corresponding ciphertext characters by respectively taking the ciphertext characters as input parameters based on the trained ciphertext word vector model, and determining plaintext word vectors corresponding to the corresponding plaintext characters by respectively taking the plaintext characters as input parameters based on the trained ciphertext word vector model;
and respectively executing the following operations aiming at each ciphertext text: and respectively determining plaintext characters meeting the condition of vector similarity based on the vector similarity between each ciphertext word vector corresponding to each ciphertext text in any ciphertext text and each plaintext word vector, and replacing the corresponding ciphertext characters by adopting the determined plaintext characters to obtain the replaced text.
6. A method of model pre-training, comprising:
encrypting the obtained linguistic data to be processed to obtain ciphertext linguistic data corresponding to the linguistic data to be processed;
sending the ciphertext linguistic data to a training client, so that the training client obtains replaced texts corresponding to the ciphertext texts in the ciphertext linguistic data based on the ciphertext linguistic data, trains a model to be trained based on the replaced texts, obtains a pre-training model, and returns the pre-training model to the client with data viewing permission;
and performing corresponding processing based on the received pre-training model.
7. A model pre-training apparatus, comprising:
the acquisition module is used for responding to a pre-training instruction aiming at a model to be trained, acquiring ciphertext corpora uploaded by a client with data viewing authority, and determining each ciphertext text in the ciphertext corpora, wherein the ciphertext corpora are obtained by encrypting the corpora to be processed, and the corpora to be processed are texts containing sensitive information;
the processing module is used for replacing each ciphertext character in each ciphertext text by respectively adopting each plaintext character in a preset public text to obtain each replaced text, wherein the public text comprises a plurality of plaintext characters, and is a public text which does not comprise sensitive information;
and the training module is used for training the model to be trained based on the replaced texts to obtain a pre-training model and carrying out corresponding processing based on the pre-training model.
8. A model pre-training apparatus, comprising:
the encryption module is used for encrypting the obtained linguistic data to be processed to obtain ciphertext linguistic data corresponding to the linguistic data to be processed;
the obtaining module is used for sending the ciphertext linguistic data to a training client so that the training client can obtain replaced texts corresponding to the ciphertext texts in the ciphertext linguistic data based on the ciphertext linguistic data, train a model to be trained based on the replaced texts to obtain a pre-training model, and return the pre-training model to the client with data viewing permission;
and the processing module is used for carrying out corresponding processing based on the received pre-training model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-5 or 6 are performed when the program is executed by the processor.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5 or 6.
CN202111656897.XA 2021-12-30 2021-12-30 Model pre-training method and device, electronic equipment and storage medium Active CN114297693B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111656897.XA CN114297693B (en) 2021-12-30 2021-12-30 Model pre-training method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111656897.XA CN114297693B (en) 2021-12-30 2021-12-30 Model pre-training method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114297693A true CN114297693A (en) 2022-04-08
CN114297693B CN114297693B (en) 2022-11-18

Family

ID=80974449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111656897.XA Active CN114297693B (en) 2021-12-30 2021-12-30 Model pre-training method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114297693B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595566A (en) * 2023-07-17 2023-08-15 山东云天安全技术有限公司 Data encryption system and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571323A (en) * 2011-12-31 2012-07-11 深圳市永达电子股份有限公司 Non-standard character repertoire processing method and device, and system adopting same
CN106788993A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of encryption communication method, device and electronic equipment
US20190197256A1 (en) * 2017-12-21 2019-06-27 Compugroup Medical Se Method for accessing a database stored on a server using a relation
CN110134953A (en) * 2019-05-05 2019-08-16 北京科技大学 Chinese medicine name entity recognition method and identifying system based on Chinese medical book document
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN111898135A (en) * 2020-02-12 2020-11-06 北京京东尚科信息技术有限公司 Data processing method, data processing apparatus, computer device, and medium
CN112560443A (en) * 2020-12-29 2021-03-26 平安银行股份有限公司 Choice question generation model training method, choice question generation method, device and medium
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113065330A (en) * 2021-03-22 2021-07-02 四川大学 Method for extracting sensitive information from unstructured data
CN113742454A (en) * 2021-09-09 2021-12-03 平安科技(深圳)有限公司 Response corpus generation method based on artificial intelligence and related equipment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571323A (en) * 2011-12-31 2012-07-11 深圳市永达电子股份有限公司 Non-standard character repertoire processing method and device, and system adopting same
CN106788993A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of encryption communication method, device and electronic equipment
US20190197256A1 (en) * 2017-12-21 2019-06-27 Compugroup Medical Se Method for accessing a database stored on a server using a relation
CN110134953A (en) * 2019-05-05 2019-08-16 北京科技大学 Chinese medicine name entity recognition method and identifying system based on Chinese medical book document
CN111898135A (en) * 2020-02-12 2020-11-06 北京京东尚科信息技术有限公司 Data processing method, data processing apparatus, computer device, and medium
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN112560443A (en) * 2020-12-29 2021-03-26 平安银行股份有限公司 Choice question generation model training method, choice question generation method, device and medium
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113065330A (en) * 2021-03-22 2021-07-02 四川大学 Method for extracting sensitive information from unstructured data
CN113742454A (en) * 2021-09-09 2021-12-03 平安科技(深圳)有限公司 Response corpus generation method based on artificial intelligence and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李亮: "基于ALBERT的藏文预训练模型及其应用", 《硕士电子期刊》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595566A (en) * 2023-07-17 2023-08-15 山东云天安全技术有限公司 Data encryption system and electronic equipment
CN116595566B (en) * 2023-07-17 2023-10-20 山东云天安全技术有限公司 Data encryption system and electronic equipment

Also Published As

Publication number Publication date
CN114297693B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
US20210312334A1 (en) Model parameter training method, apparatus, and device based on federation learning, and medium
US11902413B2 (en) Secure machine learning analytics using homomorphic encryption
CN110457912B (en) Data processing method and device and electronic equipment
CN110598442A (en) Sensitive data self-adaptive desensitization method and system
CN111783124B (en) Data processing method, device and server based on privacy protection
CN108984733A (en) cross-domain data fusion method, system and storage medium
CN108848058A (en) Intelligent contract processing method and block catenary system
CN112116008A (en) Target detection model processing method based on intelligent decision and related equipment thereof
CN105956469A (en) Method and device for identifying file security
CN113055153B (en) Data encryption method, system and medium based on fully homomorphic encryption algorithm
US12033233B2 (en) Image steganography utilizing adversarial perturbations
CN105827582A (en) Communication encryption method, device and system
CN114696990A (en) Multi-party computing method, system and related equipment based on fully homomorphic encryption
CN114297693B (en) Model pre-training method and device, electronic equipment and storage medium
US11748661B2 (en) Training method and apparatus for a distributed machine learning model and medium
CN108549824A (en) A kind of data desensitization method and device
CN112800467B (en) Online model training method, device and equipment based on data privacy protection
CN101442671A (en) Method and apparatus for embedding and extracting digital information
CN111159730B (en) Data processing method, query method, device, electronic equipment and system
CN113033717A (en) Model generation method and device for model generation
CN112398861A (en) Encryption system and method for sensitive data in web configuration system
CN116108464A (en) Image processing method, device, electronic equipment and readable storage medium
CN115292726A (en) Semantic communication method and device, electronic equipment and storage medium
CN115080987A (en) Password management method, device, system, storage medium and computer equipment
CN114817937A (en) Keyboard encryption method, device, storage medium and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant