CN112199955A - Anti-named entity recognition encoder countermeasure training and privacy protection method and device - Google Patents

Anti-named entity recognition encoder countermeasure training and privacy protection method and device Download PDF

Info

Publication number
CN112199955A
CN112199955A CN202011173866.4A CN202011173866A CN112199955A CN 112199955 A CN112199955 A CN 112199955A CN 202011173866 A CN202011173866 A CN 202011173866A CN 112199955 A CN112199955 A CN 112199955A
Authority
CN
China
Prior art keywords
text
loss
recognition
network
reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011173866.4A
Other languages
Chinese (zh)
Other versions
CN112199955B (en
Inventor
刘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202011173866.4A priority Critical patent/CN112199955B/en
Publication of CN112199955A publication Critical patent/CN112199955A/en
Application granted granted Critical
Publication of CN112199955B publication Critical patent/CN112199955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the specification provides a method and a device for coding network training and privacy protection of anti-named entity recognition, wherein the method comprises the following steps: obtaining a first sample set, wherein the first sample set comprises a plurality of first samples, and each first sample corresponds to a section of original text and a label aiming at a named entity in the original text; inputting the corresponding original text of each first sample into a coding network to obtain a characteristic text of each first sample; inputting the feature text into a pre-trained named entity recognition network model, obtaining a recognition result aiming at the named entity, and determining recognition loss according to the recognition result and a label corresponding to the first sample; inputting the characteristic text into a reconstructed network model to obtain a reconstructed text, and determining reconstruction loss according to the reconstructed text and the original text; determining comparison loss according to the original text and the characteristic text; determining coding loss, wherein the coding loss is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss; the coding network is updated with the goal that coding losses tend to decrease.

Description

Anti-named entity recognition encoder countermeasure training and privacy protection method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of machine learning and the field of data security, and in particular, to a method and an apparatus for encoding network training and privacy protection for anti-named entity recognition.
Background
Characters are used as carriers for recording ideas and languages of human beings, and often imply rich labor values and important personal information. For example, a personal notepad often records some personal names, mobile phone numbers, addresses, companies and other information, and text records of some people chatting also keep privacy information about personal social activities and the like. With the development of artificial intelligence technology, named entity recognition technology can easily resolve proper nouns appearing in the characters, such as names of people, places, organizations, time and date, and the like. However, this also poses a threat to the privacy of the individual. The named entity recognition technology (NER) can easily analyze personal information from a large text, and accurately recognize personal name, mobile phone number, address, company and other information, thereby obtaining personal privacy. Therefore, how to anonymize the sensitive information in the text information without affecting the user's convenience in understanding the text content becomes a focus of attention of each enterprise.
Disclosure of Invention
The embodiments provided in this specification aim to provide a more efficient method of privacy protection of anti-named entity recognition, solving the deficiencies in the prior art
According to a first aspect, there is provided a method of training a coded network for anti-named entity recognition, the method comprising:
obtaining a first sample set with labels, wherein the first sample set comprises a plurality of first samples, and each first sample corresponds to a section of original text and a label aiming at a named entity in the original text;
for each first sample, inputting an original text corresponding to the first sample into a coding network to obtain a characteristic text of the original text;
inputting the characteristic text into a pre-trained named entity recognition network model, obtaining a recognition result for recognizing the named entity, and determining recognition loss according to the recognition result and a label corresponding to the first sample;
inputting the characteristic text into a reconstructed network model to obtain a reconstructed text, and determining reconstruction loss according to the reconstructed text and the original text;
determining comparison loss according to the original text and the characteristic text;
determining a coding loss that is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss;
updating the coding network with the aim that the coding losses tend to decrease.
In one embodiment, the named entity recognition network model comprises a third party named entity recognition network model;
inputting the characteristic text into a named entity recognition network model to obtain a recognition result for recognizing a named entity, wherein the method comprises the steps of inputting the characteristic text into an access interface of a third-party named entity recognition network model to obtain the recognition result.
In one embodiment, the reconstruction network is a pre-trained reconstruction network.
In one embodiment, the training method further comprises: the reconstruction network is updated with the goal that reconstruction losses tend to decrease.
In one embodiment, the encoding network is based on one of a convolutional neural network, a long-short term memory model.
In one embodiment, the reconstructed network model is based on a recurrent neural network.
In one embodiment, the determining the recognition loss according to the recognition result and the label tag corresponding to the first sample includes determining the recognition loss according to a first text editing distance between the recognition result and the label tag, where the recognition loss is positively correlated to the first text editing distance;
determining the reconstruction loss according to the reconstructed text and the original text, wherein the determining the reconstruction loss comprises determining the reconstruction loss according to a second text editing distance between the reconstructed text and the original text, and the reconstruction loss is positively correlated with the second text editing distance.
In one embodiment, the determining a comparison loss from the original text and the feature text comprises one of:
determining comparison loss according to the content difference scores of the original text and the feature text by the user;
and determining the comparison loss according to the mean square error between the coded value of the original text and the coded value of the characteristic text.
In one embodiment, determining the coding loss comprises:
subtracting the identification loss and the reconstruction loss from the comparison loss to obtain a coding loss; or
The weighted recognition loss and reconstruction loss are subtracted from the comparison loss to obtain the coding loss.
According to a second aspect, there is provided a method of training a coded network for anti-named entity recognition, the method comprising:
determining the recognition effect of the entity recognition network model;
and executing the method of the first aspect, the coding network of the first aspect, when the recognition effect reaches a preset condition.
In one embodiment, the determining the recognition effect of the entity recognition network model comprises:
inputting a predetermined number of text samples with labels into an entity recognition network model, and acquiring the recognition rate of the entity recognition network model to the text samples as the recognition effect;
the preset condition is that the identification rate reaches a preset threshold value.
According to a third aspect, there is provided a privacy preserving method of anti-named entity recognition, the method comprising:
acquiring a text to be protected;
and inputting the text to be protected into the coding network trained according to the method in the first aspect, wherein the coding network generates the privacy protection text.
In one embodiment, the coding network is deployed in a mobile terminal or a web client.
According to a fourth aspect, there is provided an apparatus for training a coded network for anti-named entity recognition, the apparatus comprising:
the sample acquiring unit is configured to acquire a first sample set with labels, wherein the first sample set comprises a plurality of first samples, and each first sample corresponds to a section of original text and a label for a named entity in the original text;
the characteristic text acquisition unit is configured to input the original text corresponding to each first sample into a coding network to obtain a characteristic text of the original text;
the recognition loss determining unit is configured to input the feature text into a pre-trained named entity recognition network model, obtain a recognition result for recognizing the named entity, and determine recognition loss according to the recognition result and a label corresponding to the first sample;
the reconstruction loss determining unit is configured to input the characteristic text into a reconstruction network model to obtain a reconstruction text, and determine reconstruction loss according to the reconstruction text and the original text;
a comparison loss determination unit configured to determine a comparison loss from the original text and the feature text;
a coding loss determination unit configured to determine a coding loss that is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss;
and the coding network updating unit is configured to update the coding network with the aim that the coding loss tends to be reduced.
In one embodiment, the named entity recognition network model comprises a third party named entity recognition network model;
and the reconstruction loss determining unit is configured to input the feature text into an access interface of the third-party named entity recognition network model to obtain a recognition result.
In one embodiment, the reconstruction loss determination unit is further configured to: the reconstruction network is a pre-trained reconstruction network.
In one embodiment, the reconstruction loss determination unit is further configured to: the reconstruction network is updated with the goal that reconstruction losses tend to decrease.
In one embodiment, the encoding network is based on one of a convolutional neural network, a long-short term memory model.
In one embodiment, the reconstructed network model is based on a recurrent neural network.
In one embodiment, the identification loss determining unit is configured to determine an identification loss according to a first text editing distance between an identification result and the labeling label, wherein the identification loss is positively correlated with the first text editing distance;
and the reconstruction loss determining unit is configured to determine the reconstruction loss according to a second text editing distance between the reconstructed text and the original text, wherein the reconstruction loss is positively correlated with the second text editing distance.
In one embodiment, the coding network updating unit, the comparison loss determining unit, is configured to one of:
determining comparison loss according to the content difference scores of the original text and the feature text by the user;
and determining the comparison loss according to the mean square error between the coded value of the original text and the coded value of the characteristic text.
In one embodiment, the coding loss determination unit is configured to:
subtracting the identification loss and the reconstruction loss from the comparison loss to obtain a coding loss; or
The weighted recognition loss and reconstruction loss are subtracted from the comparison loss to obtain the coding loss.
According to a fifth aspect, there is provided an apparatus for training a coded network for anti-named entity recognition, the apparatus comprising:
a recognition effect determination unit configured to determine a recognition effect of the entity recognition network model;
a training unit configured to execute the method of the first aspect to train the coding network of the first aspect when the recognition effect reaches a preset condition according to the recognition effect of the entity recognition network model.
In one embodiment, determining the recognition effect of the entity recognition network model comprises: inputting a predetermined number of text samples with labels into an entity recognition network model, and acquiring the recognition rate of the entity recognition network model to the text samples as the recognition effect; the preset condition is that the identification rate reaches a preset threshold value.
According to a sixth aspect, there is provided a privacy preserving apparatus for anti-named entity recognition, the apparatus comprising:
the device comprises a to-be-protected text acquisition unit, a to-be-protected text acquisition unit and a to-be-protected text acquisition unit, wherein the to-be-protected text acquisition unit is configured to acquire a to-be-protected text;
a privacy protection unit configured to input a text to be protected into the coding network trained according to the method of the first aspect; the encoding network generates privacy preserving text.
In one embodiment, the coding network is deployed in a mobile terminal or a web client.
According to a seventh aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed on a computer, causes the computer to perform the method of the first, second, third aspect.
According to an eighth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method of the first, second, and third aspects.
By using one or more of the methods, apparatuses, computing devices, and storage media in the above aspects, the privacy protection problem caused by named entity identification can be solved more effectively.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a coded network training method for anti-named entity recognition according to an embodiment of the present description;
FIG. 2 is a schematic diagram illustrating usage effects of a client encoder according to an embodiment of the present disclosure;
FIG. 3 illustrates a flow diagram of a method of coded network training for anti-named entity recognition in accordance with an embodiment of the present description;
FIG. 4 illustrates a flow diagram of a method of coded network training for anti-named entity recognition in accordance with another embodiment of the present description;
FIG. 5 illustrates a flow diagram of a privacy preserving method of anti-named entity recognition in accordance with an embodiment of the present description;
FIG. 6 is a block diagram of a coded network training device for anti-named entity recognition in accordance with an embodiment of the present description;
FIG. 7 is a block diagram of an encoding network training apparatus for anti-named entity recognition according to another embodiment of the present description;
FIG. 8 illustrates a block diagram of a privacy preserving apparatus for anti-named entity recognition in accordance with an embodiment of the present description;
Detailed Description
The solution provided by the present specification will be described below with reference to the accompanying drawings.
As mentioned above, Named Entity Recognition (NER) can easily resolve information related to an individual from a large text, such as the personal note book or chat information, to accurately identify the information of the individual's name, phone number, address, company, etc., thereby causing the problem of personal privacy information.
In view of the above problems, in the embodiments in the present specification, a method for anti-named entity recognition coding network training and privacy protection and an apparatus thereof are provided. The basic idea is an anti-named entity recognition (anti-NER) method based on counterlearning, which can anonymize characters by changing the expression mode or the word order of text information a little without changing the semantics. Colloquially, the method proposed herein can output a segment of text, which people can easily understand, but the machine or NER technology cannot recognize. The scheme provided by the text information anonymizing method and the text information anonymizing device can be used as a tool for anonymizing the personal text information by the personal user, can also be used as a privacy tool for protecting the text information data of the personal user in a website or APP, and can also be used in all scenes containing the text information.
Fig. 2 is a schematic diagram illustrating the effect of using a client encoder according to an embodiment of the present disclosure, in which the client encoder is a coding network trained according to a training method for anti-named entity recognition coding network and downloads the coding network to a mobile client, as can be seen from fig. 2, the encoder encodes a segment of text into a new segment of text, and from a human perspective, the new segment of text, although possibly somewhat strange, still understands the meaning of the new segment of text, but is difficult to recognize the personal information in the new segment of text for a machine recognition model using a named entity recognition technology, so that the dual purposes of encoding an original text by using the encoder, obtaining the meaning of the original text from the encoded text, can be achieved, thereby achieving good text privacy protection effect.
Fig. 1 is a schematic diagram illustrating a coding network training method for anti-named entity recognition according to an embodiment of the present disclosure. The encoding network (Encoder) is used for encoding an original text into a feature text, and the basic training thought of the encoding network is that the larger the difference between the recognition result of a character form of the feature text encoded by the encoding network and the label of the original text is, the better the difference is, the larger the difference between the text reconstructed by the feature text through another reconstructed network and the original text is, the better the difference is, and meanwhile, the smaller the difference between the feature text and the original text calculated through a comparison method (for example, a user scoring method or an algorithm substitution method) is, the better the difference is. Expressed mathematically, the loss function for the training of the coding network is:
LEncoder=LCompare-(LReconstructor+LRecognizer) (1)
wherein L isCompareFor comparing losses, features for evaluating calculations of comparison methodsDifference of text from original text, LReconstructorFor reconstruction loss, L, for measuring the difference between the reconstructed text and the original text of the reconstructed networkRecognizerFor recognition loss, L for measuring the difference between the recognition result of the named entity recognition model recognition and the label of the original textEncoderFor coding losses, the training of the coding network is such that LEncoderThe tendency becomes smaller. In different embodiments, a named entity recognition model established and trained independently or a model established and trained by a third party may be adopted, and only the model has the named entity recognition capability, which is not limited in this specification. In addition, in different embodiments, there may be multiple methods for calculating the text difference, for example, in one embodiment, the text edit distance of two texts may be calculated, in another embodiment, the difference between two texts may also be measured based on the mean square error algorithm, and the method for calculating the text difference is not limited in this specification either
Fig. 3 shows a flowchart of a method for coded network training for anti-named entity recognition according to an embodiment of the present description. As shown in fig. 3, the method comprises the steps of:
in step S31, a labeled first sample set is obtained, where the first sample set includes a plurality of first samples, and each first sample corresponds to a piece of original text and a label for a named entity in the original text.
As described above, the original text is, for example, a personal description in a personal notepad, which often records some personal names, phone numbers, addresses, companies, and other information, and is, for example, a text record of chat, in which privacy information about the personal social activities and other information is also retained. These are, of course, merely illustrative descriptions of original text and are not intended to limit the scope of embodiments of the present description. In addition, each original text in the first sample set has a corresponding label, and the label itself is also in text form, and in one embodiment, the label may be a text from which the exact named entity information in the original text is extracted. In different examples, the label can be generated by manual labeling by a labeling person, or can be identified by a named entity identification model with verified accuracy.
In step S32, for each first sample, the original text corresponding to the first sample is input into the coding network, and a feature text of the original text is obtained.
The characteristic text is a section of characters obtained by encoding an original text, and the purpose of obtaining the section of characters is that people can easily understand the contents of the characters, but a machine model utilizing an entity recognition technology cannot recognize the contents.
The coding network is essentially a model for feature extraction, and in one embodiment, the coding network may be a convolutional neural network model or a long-short term memory model. It is understood that in various embodiments, the encoding network may be based on various neural network models, such as a transform encoder, a BERT model, etc., suitable for natural language processing. The present description does not limit the specific implementation of the encoding network.
In step S33, the feature text is input into the pre-trained named entity recognition network model, a recognition result for recognizing the named entity is obtained, and the recognition loss is determined according to the recognition result and the label corresponding to the first sample.
As with the original text label, the recognition result itself is in text form, and in one embodiment, the recognition result may be a text of named entity information (often key personal privacy information) in the feature text extracted by the entity recognition network model. By comparing the recognition result with the label of the original text, the difference between the two is determined, thereby determining the recognition loss, which is used in the subsequent training process, and the specific function of which will be explained later.
In one embodiment, the recognition loss may be determined based on the recognition result and a text edit distance of the tag of the original text. The text Edit Distance (Edit Distance), also called Levenshtein Distance, refers to the minimum number of editing operations required to change one character string (text) into another character string (text), and the larger the Distance is, the more different the characters are. The editing operation may include replacing a character with another character, inserting a character, deleting a character
In the actual production process, named entity recognition models trained by different manufacturer or organization architectures (third parties) exist, and access interfaces of the named entity recognition models are disclosed. The identification result can be obtained by introducing the sample to be tested into the disclosed interfaces. In one embodiment, the named entity recognition model described above may utilize a third-party trained named entity recognition model. In such a case, the feature text may be input into an access interface of the third-party named entity recognition network model, and the recognition result may be obtained.
In step S34, the feature text is input into the reconstructed network model to obtain a reconstructed text, and the reconstruction loss is determined according to the reconstructed text and the original text.
The purpose of reconstructing the network model is to recover the original text from the feature text. The purpose of adding the training method into the reconstruction network is to increase the difficulty of restoring or reconstructing the original text from the feature text output by the coding network. If the original text can be conveniently reconstructed from the feature text, the personal privacy information of the user can be conveniently extracted from the reconstructed original text by using the named entity recognition technology, so that the difficulty of reconstructing the original text according to the feature text is improved, and the aim of anti-named entity recognition technology can be essentially fulfilled. In one example, the reconstructed network model may be based on a recurrent neural network, and the description does not limit the specific form of the reconstructed network model.
The reconstructed network is used as an independent network and can be independently trained. For training the reconstruction network itself, the goal is to achieve better and better reconstruction results. Thus, in one embodiment, the reconstructed network may be updated with the goal that reconstruction losses tend to decrease. In such a case, the training of the reconstruction network and the training of the coding network form a countermeasure.
In another embodiment, the reconstruction network may also be a pre-trained reconstruction network. In this case, the reconstruction network is fixed and does not need to be updated in the whole process of the coding network training.
The reconstruction loss determined as described above is used for the subsequent training of the coding network, and will be explained later on as to how to specifically utilize the reconstruction loss, and only how to determine the reconstruction loss will be explained here. Since the reconstruction loss can be considered a measure of the difference between the reconstructed text and the original text, in one embodiment, the reconstruction loss can be determined according to the text edit distance of the reconstructed text and the original text.
In step S35, a comparison loss is determined based on the original text and the feature text.
In different embodiments, according to the original text and the feature text, the text difference can be measured in different specific ways to determine the comparison loss. For example:
according to one embodiment, the comparison loss may be determined based on a user's content difference score for the original text and the feature text. Since the comparison loss itself is intended to measure the difference between the original text and the feature text from the perspective of human comprehension, the comparison loss can be determined based on the user score of the difference between the original text and the feature text, and the user score can be obtained by, for example, sending a sample survey to a random user, where the user is not limited to a direct user using the model, and any human user whose difference score can be obtained can be regarded as the user.
Although the user score is used, which best meets the design intent of the comparison loss, and the comparison loss determined therefrom is also most accurate, the user score is often not easily obtained, and therefore, the comparison loss may also be determined algorithmically in different embodiments. According to one embodiment, the comparison loss may be determined by measuring a mean square error between the encoded values of the original text and the feature text. The Mean Square Error (MSE) is an expected value of the square of the difference between an estimated value and a true value, the change degree of data can be evaluated, the smaller the MSE value is, the better accuracy of the prediction model description experiment data is shown, and in the embodiment, the difference between an original text and a feature text can be measured by using the Mean square Error to determine the comparison loss. In one example, the difference between the two can be measured by the following equation:
Figure BDA0002748141490000111
wherein xi is the coding value of the feature text, yi is the coding value of the original text, and MSE is the mean square error of the coding values of the original text and the feature text.
It should be noted that the above step S33 of determining the identification loss, step S34 of determining the reconstruction loss, and step S35 of determining the comparison loss may be executed in any reasonable relative order, or executed in parallel. The order described above is merely an example.
On the basis of determining the recognition loss, the reconstruction loss, and the comparison loss, respectively, in step S36, the coding loss is determined, which is positively correlated with the comparison loss and negatively correlated with the recognition loss and the reconstruction loss.
In one example, the coding penalty is equal to the comparison penalty minus the identification penalty and the reconstruction penalty, as shown in equation (1) previously described in connection with FIG. 1. In another example, the comparison loss may be subtracted by the weighted identification loss and reconstruction loss to obtain the coding loss. When the coding loss is calculated, the recognition loss and the reconstruction loss are weighted, and the training process can be adjusted according to the recognition effect and the privacy protection effect of the coding.
Next, in step S37, the coding network is updated with the goal that the coding loss tends to decrease.
In this step, the training of the coding network aims to reduce the coding loss, and the coding loss is positively correlated with the comparison loss and negatively correlated with the recognition loss and the reconstruction loss, so that the training of the coding network actually depends on the feedback given by the outputs of the named entity recognition model and the reconstruction network model, or the training of the coding network is essentially to resist the recognition capability of the named entity recognition model and resist the reconstruction capability of the reconstruction network model. Specifically, since the coding loss is inversely related to the recognition loss and the reconstruction loss, the stronger the recognition capability of the named entity recognition model is, the smaller the recognition loss is, and the larger the coding loss is; the stronger the reconstruction capability of the reconstruction network, the smaller the reconstruction loss and the larger the coding loss. Meanwhile, the coding loss is positively correlated with the comparison loss, i.e., the smaller the comparison loss, the smaller the coding loss. Therefore, the process of training the coding network with the coding loss tending to become smaller will make the recognition difficulty of the coded result (feature text) for the named entity recognition model (reconstruction difficulty for the reconstructed network) larger, so that the recognition loss (reconstruction loss for the reconstructed network) becomes larger, and the recognition difficulty for human beings decreases, so that the process of making the comparison loss smaller. This is also the essence of the antagonistic training in this training method.
Therefore, the coded network obtained through the training process can achieve the effect that the coded feature text keeps secret for the named entity recognition technology, and meanwhile, a guarantee person can still obtain the original meaning from the coded feature text.
FIG. 4 illustrates a flow diagram of a method for coded network training for anti-named entity recognition in accordance with another embodiment of the present description. As shown in fig. 4, the method comprises the following steps:
in step 41, the recognition effect of the entity recognition network model is determined.
In one embodiment, a predetermined number of tagged text samples may be input into the entity recognition network model, and the recognition rate of the entity recognition network model for the text samples is obtained as the recognition effect. In other embodiments, the recognition effect may also be determined according to other performance metrics, such as a false recognition rate, a missed recognition rate, and the like.
In step 42, in case that the recognition effect reaches the preset condition, the method shown in fig. 3 is executed to train the coding network.
In practice, the recognition capability of the named entity recognition model is usually better and better, but the recognition capability of the named entity recognition model becomes stronger, which causes the anti-named entity recognition capability of the coding network to become weaker, so that when the recognition capability of the named entity recognition model is detected to become stronger, training of the coding network can be started in a resistant manner. In one embodiment, the training of the coding network may be initiated by using the method shown in fig. 3 after the recognition rate of the entity recognition network model for the text sample exceeds a predetermined threshold, or after the recognition error rate/recognition omission rate is lowered to a certain set value, for example.
FIG. 5 illustrates a flow diagram of a privacy preserving method of anti-named entity recognition in accordance with an embodiment of the present description. As shown in fig. 5, the method includes:
in step 51, a text to be protected is obtained;
in step 52, the text to be protected is input into the coding network trained according to the method shown in fig. 3, and the coding network generates the privacy-protecting text.
In one embodiment, the trained coding network may be deployed in a mobile terminal or a web client. The coding network can be deployed at the cloud end in the training process, and in a specific example, the coding network trained at the cloud end is downloaded and deployed at the mobile terminal or the webpage client. It is easy to understand that, since the coding network described above can restart the training process according to the threshold judgment, the retrained coding network can be redeployed.
Fig. 6 is a block diagram illustrating a coded network training device for anti-named entity recognition according to an embodiment of the present disclosure. As shown in fig. 6, the coding network training apparatus 600 includes:
the sample acquiring unit 61 is configured to acquire a first sample set with tags, where the first sample set includes a plurality of first samples, and each first sample corresponds to a piece of original text and a tag label for a named entity in the original text;
a feature text obtaining unit 62, configured to, for each first sample, input an original text corresponding to the first sample into a coding network, and obtain a feature text thereof;
the recognition loss determining unit 63 is configured to input the feature text into a pre-trained named entity recognition network model, obtain a recognition result for recognizing the named entity, and determine a recognition loss according to the recognition result and a label corresponding to the first sample;
a reconstruction loss determining unit 64 configured to input the feature text into a reconstruction network model to obtain a reconstruction text, and determine a reconstruction loss according to the reconstruction text and the original text;
a comparison loss determination unit 65 configured to determine a comparison loss from the original text and the feature text;
a coding loss determination unit 66 configured to determine a coding loss that is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss;
an encoding network updating unit 67 configured to update the encoding network with a view to the tendency of the encoding loss to decrease.
In one embodiment, the named entity recognition network model may be a third party named entity recognition network model; and the reconstruction loss determining unit is further configured to input the feature text into an access interface of the third-party named entity recognition network model to obtain a recognition result.
In one embodiment, the reconstruction loss determination unit 64 may be further configured to: the reconstruction network is a pre-trained reconstruction network.
In one embodiment, the reconstruction loss determination unit 64 may be further configured to: the reconstruction network is updated with the goal that reconstruction losses tend to decrease.
In one embodiment, the encoding network may be based on one of a convolutional neural network, a long short term memory model.
In one embodiment, the reconstructed network model may be based on a recurrent neural network.
In one embodiment, the recognition loss determining unit 63 may be further configured to determine a recognition loss according to a first text editing distance between the recognition result and the label tag, where the recognition loss is positively correlated to the first text editing distance;
the reconstruction loss determining unit 64 may be further configured to determine a reconstruction loss according to a second text editing distance between the reconstructed text and the original text, the reconstruction loss being positively correlated with the second text editing distance.
In one embodiment, the comparison loss determining unit 65 may be further configured to one of:
determining comparison loss according to the content difference scores of the original text and the feature text by the user;
and determining the comparison loss according to the mean square error between the coded value of the original text and the coded value of the characteristic text.
In one embodiment, the coding loss determination unit 66 may be further configured to subtract the identification loss and the reconstruction loss from the comparison loss to obtain a coding loss; or
The weighted recognition loss and reconstruction loss are subtracted from the comparison loss to obtain the coding loss.
FIG. 7 is a block diagram of an encoding network training apparatus for anti-named entity recognition according to another embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:
a recognition effect determination unit 71 configured to determine a recognition effect of the entity recognition network model;
and a training unit 72 configured to execute the method shown in fig. 3 to train the coding network when the recognition effect reaches a preset condition.
In one embodiment, a predetermined number of tagged text samples may be input into an entity recognition network model, and a recognition rate of the entity recognition network model for the text samples is obtained as the recognition effect; in this case, the preset condition is that the recognition rate reaches a predetermined threshold value.
Fig. 8 illustrates a block diagram of a privacy preserving apparatus of anti-named entity recognition according to an embodiment of the present description. As shown in fig. 8, the apparatus includes:
a to-be-protected text acquisition unit 81 configured to acquire a to-be-protected text;
a privacy protection unit 82 configured to input a text to be protected into the coding network trained according to the method shown in fig. 3; the encoding network generates privacy preserving text.
In one embodiment, the trained coding network may be deployed in a mobile terminal or a web client.
Another aspect of the present specification provides a computer readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform any one of the above methods.
Another aspect of the present specification provides a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements any of the methods described above.
It is to be understood that the terms "first," "second," and the like, herein are used for descriptive purposes only and not for purposes of limitation, to distinguish between similar concepts.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (28)

1. A method of training a coded network for anti-named entity recognition, the method comprising:
obtaining a first sample set with labels, wherein the first sample set comprises a plurality of first samples, and each first sample corresponds to a section of original text and a label aiming at a named entity in the original text;
for each first sample, inputting an original text corresponding to the first sample into a coding network to obtain a characteristic text of the original text;
inputting the characteristic text into a pre-trained named entity recognition network model, obtaining a recognition result for recognizing the named entity, and determining recognition loss according to the recognition result and a label corresponding to the first sample;
inputting the characteristic text into a reconstructed network model to obtain a reconstructed text, and determining reconstruction loss according to the reconstructed text and the original text;
determining comparison loss according to the original text and the characteristic text;
determining a coding loss that is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss;
updating the coding network with the aim that the coding losses tend to decrease.
2. The training method of claim 1, wherein the named entity recognition network model comprises a third party named entity recognition network model;
inputting the characteristic text into a named entity recognition network model to obtain a recognition result for recognizing a named entity, wherein the method comprises the steps of inputting the characteristic text into an access interface of a third-party named entity recognition network model to obtain the recognition result.
3. The training method of claim 1, wherein the reconstruction network is a pre-trained reconstruction network.
4. The training method of claim 1, further comprising:
updating the reconstructed network with the goal that the reconstruction loss tends to decrease.
5. The training method of claim 1, wherein the coding network is based on one of a convolutional neural network, a long-short term memory model.
6. The training method of claim 1, wherein the reconstructed network model is based on a recurrent neural network.
7. The training method according to claim 1, wherein determining a recognition loss according to the recognition result and the label corresponding to the first sample comprises determining a recognition loss according to a first text editing distance between the recognition result and the label, wherein the recognition loss is positively correlated with the first text editing distance;
determining the reconstruction loss according to the reconstructed text and the original text, wherein the determining the reconstruction loss comprises determining the reconstruction loss according to a second text editing distance between the reconstructed text and the original text, and the reconstruction loss is positively correlated with the second text editing distance.
8. The training method of claim 1, wherein said determining a comparison loss from the original text and the feature text comprises one of:
determining comparison loss according to the content difference scores of the original text and the feature text by the user;
and determining the comparison loss according to the mean square error between the coded value of the original text and the coded value of the characteristic text.
9. The training method of claim 1, wherein determining a coding loss comprises:
subtracting the identification loss and the reconstruction loss from the comparison loss to obtain a coding loss; or
The weighted recognition loss and reconstruction loss are subtracted from the comparison loss to obtain the coding loss.
10. A method of training a coded network for anti-named entity recognition, the method comprising:
determining the recognition effect of the entity recognition network model;
and in the case that the recognition effect reaches a preset condition, executing the method of claim 1 and training the coding network.
11. The training method of claim 10, wherein the determining the recognition effect of the entity recognition network model comprises:
inputting a predetermined number of text samples with labels into an entity recognition network model, and acquiring the recognition rate of the entity recognition network model to the text samples as the recognition effect;
the preset condition is that the identification rate reaches a preset threshold value.
12. A privacy preserving method of anti-named entity recognition, the method comprising:
acquiring a text to be protected;
the text to be protected is entered into a coding network trained according to the method of claim 1, said coding network generating privacy protected text.
13. The privacy protection method of claim 12, wherein the encoded network is deployed in a mobile terminal or a web client.
14. An apparatus for training a coded network for anti-named entity recognition, the apparatus comprising:
the sample acquiring unit is configured to acquire a first sample set with labels, wherein the first sample set comprises a plurality of first samples, and each first sample corresponds to a section of original text and a label for a named entity in the original text;
the characteristic text acquisition unit is configured to input the original text corresponding to each first sample into a coding network to obtain a characteristic text of the original text;
the recognition loss determining unit is configured to input the feature text into a pre-trained named entity recognition network model, obtain a recognition result for recognizing the named entity, and determine recognition loss according to the recognition result and a label corresponding to the first sample;
the reconstruction loss determining unit is configured to input the characteristic text into a reconstruction network model to obtain a reconstruction text, and determine reconstruction loss according to the reconstruction text and the original text;
a comparison loss determination unit configured to determine a comparison loss from the original text and the feature text;
a coding loss determination unit configured to determine a coding loss that is positively correlated with the comparison loss and negatively correlated with the identification loss and the reconstruction loss;
and the coding network updating unit is configured to update the coding network with the aim that the coding loss tends to be reduced.
15. The training device of claim 14, wherein the named entity recognition network model comprises a third party named entity recognition network model;
and the reconstruction loss determining unit is configured to input the feature text into an access interface of a third-party named entity recognition network model, and acquire the recognition result.
16. The training apparatus of claim 14, wherein the reconstruction loss determination unit is further configured to: the reconstruction network is a pre-trained reconstruction network.
17. The training apparatus of claim 14, wherein the reconstruction loss determination unit is further configured to: the reconstruction network is updated with the goal that reconstruction losses tend to decrease.
18. The training apparatus of claim 14 wherein the coding network is based on one of a convolutional neural network, a long-short term memory model.
19. The training apparatus of claim 14 wherein the reconstructed network model is based on a recurrent neural network.
20. The training apparatus according to claim 14, wherein the recognition loss determining unit is configured to determine a recognition loss according to a first text editing distance between the recognition result and the label tag, the recognition loss being positively correlated with the first text editing distance;
and the reconstruction loss determining unit is configured to determine the reconstruction loss according to a second text editing distance between the reconstructed text and the original text, wherein the reconstruction loss is positively correlated with the second text editing distance.
21. The training device of claim 14, wherein the comparison loss determination unit is configured to one of:
determining comparison loss according to the content difference scores of the original text and the feature text by the user;
and determining the comparison loss according to the mean square error between the coded value of the original text and the coded value of the characteristic text.
22. The training device of claim 14, wherein the coding loss determination unit is configured to:
subtracting the identification loss and the reconstruction loss from the comparison loss to obtain a coding loss; or
The weighted recognition loss and reconstruction loss are subtracted from the comparison loss to obtain the coding loss.
23. An apparatus for training a coded network for anti-named entity recognition, the apparatus comprising:
a recognition effect determination unit configured to determine a recognition effect of the entity recognition network model;
a training unit configured to execute the method of claim 1 and train the coding network of claim 1 when the recognition effect reaches a preset condition.
24. The training apparatus of claim 23, wherein the determining the recognition effect of the entity recognition network model comprises:
inputting a predetermined number of text samples with labels into an entity recognition network model, and acquiring the recognition rate of the entity recognition network model to the text samples as the recognition effect;
the preset condition is that the identification rate reaches a preset threshold value.
25. A privacy preserving apparatus of anti-named entity recognition, the apparatus comprising:
the device comprises a to-be-protected text acquisition unit, a to-be-protected text acquisition unit and a to-be-protected text acquisition unit, wherein the to-be-protected text acquisition unit is configured to acquire a to-be-protected text;
a privacy protection unit configured to input a text to be protected into the coding network trained according to the method of claim 1; the encoding network generates privacy preserving text.
26. The privacy protection device of claim 24, wherein the encoded network is deployed in a mobile terminal or a web client.
27. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-13.
28. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-13.
CN202011173866.4A 2020-10-28 2020-10-28 Method and device for counter training and privacy protection of encoder for identifying renamed entity Active CN112199955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011173866.4A CN112199955B (en) 2020-10-28 2020-10-28 Method and device for counter training and privacy protection of encoder for identifying renamed entity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011173866.4A CN112199955B (en) 2020-10-28 2020-10-28 Method and device for counter training and privacy protection of encoder for identifying renamed entity

Publications (2)

Publication Number Publication Date
CN112199955A true CN112199955A (en) 2021-01-08
CN112199955B CN112199955B (en) 2024-10-15

Family

ID=74011116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011173866.4A Active CN112199955B (en) 2020-10-28 2020-10-28 Method and device for counter training and privacy protection of encoder for identifying renamed entity

Country Status (1)

Country Link
CN (1) CN112199955B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926559A (en) * 2021-05-12 2021-06-08 支付宝(杭州)信息技术有限公司 Face image processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488466A (en) * 2015-11-26 2016-04-13 中国船舶工业系统工程研究院 Deep neural network and underwater sound target vocal print feature extraction method
CN106415538A (en) * 2014-05-11 2017-02-15 微软技术许可有限责任公司 File service using a shared file access-REST interface
CN106464707A (en) * 2014-04-25 2017-02-22 诺基亚技术有限公司 Interaction between virtual reality entities and real entities
US20170091235A1 (en) * 2015-09-25 2017-03-30 Netapp, Inc. Namespace hierarchy preservation with multiple object storage objects
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN111241287A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Training method and device for generating generation model of confrontation text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106464707A (en) * 2014-04-25 2017-02-22 诺基亚技术有限公司 Interaction between virtual reality entities and real entities
CN106415538A (en) * 2014-05-11 2017-02-15 微软技术许可有限责任公司 File service using a shared file access-REST interface
US20170091235A1 (en) * 2015-09-25 2017-03-30 Netapp, Inc. Namespace hierarchy preservation with multiple object storage objects
CN105488466A (en) * 2015-11-26 2016-04-13 中国船舶工业系统工程研究院 Deep neural network and underwater sound target vocal print feature extraction method
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN111241287A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Training method and device for generating generation model of confrontation text

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
丁岩;努尔布力;: "基于URL混淆技术识别的钓鱼网页检测方法", 计算机工程与应用, no. 20, 15 October 2017 (2017-10-15) *
刘奕洋;余正涛;高盛祥;郭军军;张亚飞;聂冰鸽;: "基于机器阅读理解的中文命名实体识别方法", 模式识别与人工智能, no. 07, 15 July 2020 (2020-07-15) *
杨贺羽;杜洪波;朱立军;: "基于顺序遗忘编码和Bi-LSTM的命名实体识别算法", 计算机应用与软件, no. 02, 12 February 2020 (2020-02-12) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926559A (en) * 2021-05-12 2021-06-08 支付宝(杭州)信息技术有限公司 Face image processing method and device

Also Published As

Publication number Publication date
CN112199955B (en) 2024-10-15

Similar Documents

Publication Publication Date Title
Wang et al. Tsdae: Using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning
CN108763445B (en) Construction method, device, computer equipment and the storage medium in patent knowledge library
CN110059320B (en) Entity relationship extraction method and device, computer equipment and storage medium
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
CN107391614A (en) A kind of Chinese question and answer matching process based on WMD
CN109766693A (en) A kind of cross-site scripting attack detection method based on deep learning
CN107633077B (en) System and method for cleaning social media text data by multiple strategies
CN112685739A (en) Malicious code detection method, data interaction method and related equipment
CN113190849A (en) Webshell script detection method and device, electronic equipment and storage medium
CN112861518B (en) Text error correction method and device, storage medium and electronic device
CN111091004B (en) Training method and training device for sentence entity annotation model and electronic equipment
CN112766485B (en) Named entity model training method, device, equipment and medium
CN114357190A (en) Data detection method and device, electronic equipment and storage medium
CN115146068A (en) Method, device and equipment for extracting relation triples and storage medium
CN113918936A (en) SQL injection attack detection method and device
CN112733140A (en) Detection method and system for model tilt attack
CN112199955B (en) Method and device for counter training and privacy protection of encoder for identifying renamed entity
CN113807091B (en) Word mining method and device, electronic equipment and readable storage medium
CN114244795A (en) Information pushing method, device, equipment and medium
CN109359481A (en) It is a kind of based on BK tree anti-collision search about subtract method
CN115567306B (en) APT attack traceability analysis method based on bidirectional long-short-term memory network
CN108875591B (en) Text picture matching analysis method and device, computer equipment and storage medium
CN113420127B (en) Threat information processing method, threat information processing device, computing equipment and storage medium
CN112087473A (en) Document downloading method and device, computer readable storage medium and computer equipment
CN109981818B (en) Domain name semantic anomaly analysis method and device, computer equipment and storage medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40044673

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant