CN116384405A - Text processing method, text classification method and emotion recognition method - Google Patents

Text processing method, text classification method and emotion recognition method Download PDF

Info

Publication number
CN116384405A
CN116384405A CN202310147898.4A CN202310147898A CN116384405A CN 116384405 A CN116384405 A CN 116384405A CN 202310147898 A CN202310147898 A CN 202310147898A CN 116384405 A CN116384405 A CN 116384405A
Authority
CN
China
Prior art keywords
text
sample
semantic
model
desensitization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310147898.4A
Other languages
Chinese (zh)
Inventor
李进锋
刘翔宇
张�荣
薛晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310147898.4A priority Critical patent/CN116384405A/en
Publication of CN116384405A publication Critical patent/CN116384405A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the specification provides a text processing method, a text classification method and an emotion recognition method, wherein the text processing method comprises the following steps: the method comprises the steps of obtaining a text to be processed, respectively carrying out sequence coding and semantic coding on the text to be processed to obtain sequence characterization and semantic characterization of the text to be processed, and generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text, carrying out noise adding processing on the semantic characterization by utilizing the noise disturbance to obtain desensitization semantic characterization, and carrying out text processing according to the desensitization semantic characterization to obtain a target processing result. The desensitization model of the countermeasure training identified by the sensitive information is utilized to generate high-pertinence noise disturbance, so that the desensitization semantic representation with high fairness is obtained, the fairness and efficiency of text processing are improved, the processing cost is reduced, and the method has good universality.

Description

Text processing method, text classification method and emotion recognition method
Technical Field
The embodiment of the specification relates to the technical field of text processing, in particular to a text processing method.
Background
With the development of computer technology, a neural network model is obtained by Pre-training (Pre-train) and Fine-tuning (Fine-tune), which has become a new paradigm of natural language processing, and is widely applied in the aspects of machine translation, intelligent customer service, emotion recognition, content security recognition and the like.
At present, a neural network model of large-scale natural language processing represented by a BERT model (Bidirectional Encoder Representation from Transformers, a bi-directional coding representation translation model), a RoBERTa (ARobustly Optimized BERTPretrainingApproach, a powerful optimization BERT method model) and the like performs semantic coding on a text, and then performs text processing based on semantic representation obtained by coding to obtain a corresponding processing result, however, in the pre-training and fine-tuning process of the neural network model, due to deviation and bias of a sample text, the neural network model obtained by pre-training and fine-tuning brings the deviation and bias into the semantic coding of the text to be processed, and further the fairness of the text processing is affected based on the semantic representation with the deviation and bias to obtain an unfair text processing result. Therefore, a high fairness text processing method is needed.
Disclosure of Invention
In view of this, the present embodiments provide a text processing method. One or more embodiments of the present disclosure relate to a text classification method, an emotion recognition method, a text processing data processing method, a text processing apparatus, a text classification apparatus, an emotion recognition apparatus, a text processing data processing apparatus, a computing device, a computer readable storage medium, and a computer program, to solve the technical defects in the prior art.
According to a first aspect of embodiments of the present specification, there is provided a text processing method, including:
acquiring a text to be processed;
respectively carrying out sequence coding and semantic coding on the text to be processed to obtain sequence characterization and semantic characterization of the text to be processed;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
And carrying out text processing according to the desensitization semantic representation to obtain a target processing result.
According to a second aspect of embodiments of the present specification, there is provided a text classification method, comprising:
acquiring a text to be classified;
respectively carrying out sequence coding and semantic coding on the text to be classified to obtain sequence characterization and semantic characterization of the text to be classified;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
and carrying out text classification according to the desensitization semantic representation to obtain a text classification result.
According to a third aspect of embodiments of the present specification, there is provided an emotion recognition method, including:
acquiring a text to be identified;
respectively carrying out sequence coding and semantic coding on the text to be identified to obtain sequence characterization and semantic characterization of the text to be identified;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
Noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
and carrying out emotion recognition according to the desensitization semantic representation to obtain an emotion recognition result.
According to a fourth aspect of embodiments of the present disclosure, there is provided a data processing method for text processing, applied to a cloud-side device, including:
acquiring a sample text set, wherein the sample text set comprises a plurality of sample texts;
extracting a first sample text from a sample text set, wherein the first sample text is any one of a plurality of sample texts;
performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
according to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance;
carrying out noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation;
calculating to obtain a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization;
according to the discrimination loss value, model parameters of the generator and the discriminator are adjusted, the step of extracting a first sample text from the sample text set is returned to be executed, and under the condition that the preset training ending condition is met, the generator with the training completed is obtained;
The model parameters of the generator are sent to the end-side device.
According to a fifth aspect of embodiments of the present specification, there is provided a text processing apparatus comprising:
the first acquisition module is configured to acquire a text to be processed;
the first coding module is configured to respectively perform sequence coding and semantic coding on the text to be processed to obtain sequence characterization and semantic characterization of the text to be processed;
the first generation module is configured to generate corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
the first noise adding module is configured to utilize noise disturbance to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
and the processing module is configured to perform text processing according to the desensitization semantic representation to obtain a target processing result.
According to a sixth aspect of embodiments of the present specification, there is provided a text classification apparatus comprising:
the second acquisition module is configured to acquire texts to be classified;
the second coding module is configured to respectively perform sequence coding and semantic coding on the text to be classified to obtain sequence characterization and semantic characterization of the text to be classified;
The second generation module is configured to generate corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
the second noise adding module is configured to utilize noise disturbance to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
and the classification module is configured to classify the text according to the desensitization semantic representation to obtain a text classification result.
According to a seventh aspect of embodiments of the present specification, there is provided an emotion recognition device, including:
the third acquisition module is configured to acquire a text to be recognized;
the third coding module is configured to respectively perform sequence coding and semantic coding on the text to be identified to obtain sequence characterization and semantic characterization of the text to be identified;
the third generation module is configured to generate corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
The third noise adding module is configured to utilize noise disturbance to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
the identification module is configured to carry out emotion identification according to the desensitization semantic representation to obtain an emotion identification result.
According to an eighth aspect of embodiments of the present specification, there is provided a data processing apparatus for text processing, applied to a cloud-side device, including:
a fourth acquisition module configured to acquire a sample text set, wherein the sample text set includes a plurality of sample texts;
an extraction module configured to extract a first sample text from a set of sample texts, wherein the first sample text is any one of a plurality of sample texts;
the fourth coding module is configured to perform sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
a fourth generation module configured to generate corresponding sample noise perturbations from the sample sequence characterization using a generator that generates an countermeasure model;
the fourth noise adding module is configured to utilize sample noise disturbance to carry out noise adding processing on the sample semantic representation to obtain sample desensitization semantic representation;
the calculating module is configured to calculate a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic representation and the sample desensitization semantic representation;
The training module is configured to adjust model parameters of the generator and the discriminator according to the discrimination loss value, return to execute the step of extracting the first sample text from the sample text set, and obtain the generator after training under the condition that the preset training ending condition is met;
and the sending module is configured to send the model parameters of the generator to the end-side device.
According to a ninth aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions, where the computer executable instructions when executed by the processor implement the steps of the text processing method, the text classification method, the emotion recognition method, or the data processing method for text processing.
According to a tenth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described text processing method, text classification method, emotion recognition method, or text processing data processing method.
According to an eleventh aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to execute the steps of the above-described text processing method, text classification method, emotion recognition method, or text processing data processing method.
In one or more embodiments of the present disclosure, a text to be processed is obtained, the text to be processed is subjected to sequence encoding and semantic encoding, a sequence representation and a semantic representation of the text to be processed are obtained, a pre-trained desensitization model is used according to the sequence representation, and corresponding noise disturbance is generated, wherein the desensitization model is obtained by performing countermeasure training for sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is used to perform noise processing on the semantic representation, a desensitization semantic representation is obtained, and text processing is performed according to the desensitization semantic representation, so as to obtain a target processing result. The noise disturbance for the text to be processed is generated by using the desensitization model obtained by the countermeasure training of the sensitive information identification, the noise disturbance is utilized to carry out noise adding processing on the semantic representation, the sensitive information in the semantic representation is effectively covered, the high-fairness desensitization semantic representation is obtained, then the text processing is carried out according to the high-fairness desensitization semantic representation, the high-fairness target processing result is obtained, the desensitization processing is related to the text to be processed but not related to the text processing, the universality is good, the processing cost is reduced, and the processing efficiency is improved.
Drawings
FIG. 1 is a flow chart of a text processing method provided in one embodiment of the present disclosure;
FIG. 2 is a flow chart of a text classification method provided by one embodiment of the present description;
FIG. 3 is a flow chart of a emotion recognition method provided in one embodiment of the present disclosure;
FIG. 4 is a flow chart of a data processing method for text processing provided in one embodiment of the present disclosure;
FIG. 5 is a flow diagram of a pre-training method for generating a generator of an countermeasure model in a text processing method according to one embodiment of the present disclosure;
FIG. 6 is a schematic diagram of semantic characterization and desensitization semantic characterization in a text processing method according to one embodiment of the present description;
FIG. 7 is a process flow diagram of a text processing method for professional discrimination in intelligent question-answering according to one embodiment of the present disclosure;
fig. 8 is a schematic structural view of a text processing device according to an embodiment of the present disclosure;
fig. 9 is a schematic structural view of a text classification device according to an embodiment of the present disclosure;
FIG. 10 is a schematic diagram of an emotion recognition device according to an embodiment of the present disclosure;
FIG. 11 is a schematic diagram of a text processing data processing apparatus according to an embodiment of the present disclosure;
FIG. 12 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present specification will be explained.
Fairness: the text processing applied to the artificial intelligence algorithm or the system automation decision requires that the sensitive information corresponding to the inherent attribute or the acquired attribute is correspondingly obtained from the personal or the group independent of the protected natural attribute and the social attribute, so that the text processing result with the deviation and the bias is obtained.
Algorithm depolarization: the method generally refers to a fairness constraint technology which eliminates deviation and prejudice existing in a neural network model by adopting modes of pre-process pretreatment, in-process treatment or post-process treatment and the like for a pre-training or fine tuning process, so that semantic characterization obtained by subsequent coding has high fairness.
Natural language processing: in an important direction in the field of computer science and artificial intelligence, various theories and methods capable of effectively communicating with a computer by using text, voice and other data of natural language are mainly researched.
Pre-training a natural language model: through a pre-trained neural network model for processing natural language data, the pre-trained natural language model obtains corresponding Sequence representation (Sequence Feature) and Semantic representation (Semantic Feature) through Sequence Coding (Sequence Coding) and Semantic Coding (Semantic Coding) of word vectors on the natural language data, and the core of the pre-training process is training based on specific self-supervision tasks by using unlabeled sample text before using the labeled sample text, so that the natural language model obtained through training can learn some potential knowledge irrelevant to labeling.
Data rebalance: when each label in the sample text set has similar sample numbers, the machine learning model has the best effect, and the sample text is adjusted by taking the model as a target, so that the labels have similar sample numbers.
Challenge training: by adding noise disturbance to the real sample, an countermeasure sample is obtained, and the training model identification is a training method of the real sample and the countermeasure sample.
MLP (Multilayer Perceptron, fully connected neural network or multi-layer perceptron) model: a neural network model comprises an input layer, a hidden layer and an output layer, wherein the layers are connected in a full-connection mode.
CNN (Convolutional Neural Networks, convolutional neural network) model: a multi-layer neural network model with forward and backward propagation has convolution kernels (filters) that process feature data.
RNN (Recurrent Neural Network ) model: a recurrent neural network model that recurses in the processing direction of vector characterization and links each intermediate layer in a chain.
LSTM (Long Short Term Memory, long and short term memory network) model: a neural network model with the ability to memorize long-short term information has a convolution kernel (filter) that processes feature data.
Transformer model: a neural network model based on an attention mechanism extracts and analyzes the characteristics of data through the attention mechanism.
BERT (Bidirectional Encoder Representation from Transformers, bi-directional coding characterizes translation) model: a neural network model of bi-directional attention code characterization function.
RoBERTa (ARobustly Optimized BERTPretrainingApproach, brute force optimization BERT method) model: a BERT derivative model with a character-level and word-level hybrid coding and dynamic masking mechanism.
ALBERT (light weight BERT) model: a BERT derivative model with fewer model parameters.
Chinese BERT (chinese BERT) model: a BERT derivative model applied to the field of Chinese language processing.
Generating an countermeasure model GAN (GenerativeAdversarial Network, generating an countermeasure model): a neural network model, which is commonly used for deep learning, includes a Generator (Generator) and a discriminant (discriminant), and a high-accuracy Generator is obtained through alternate training of the Generator and the discriminant.
At present, the existing natural language model is designed and built for a downstream text processing task, so that the performance of the text processing task of the natural language is improved, and the bias and deviation problems of the model are hardly considered in a model design stage, a model pre-training stage and a fine tuning stage, so that the deviation and bias exist in semantic representation obtained by encoding, and in the downstream text processing task, the deviation and bias for individuals or groups possibly exist in automatic decision of a text decision model, so that the unfair problem occurs. For example, in intelligent translation, occupational discrimination translation results are generated; in an intelligent question-answering system, a utterance is generated with a generic discrimination.
In view of the above problems, in the present specification, there are provided a text processing method, a text classification method, an emotion recognition method, a text processing data processing method, a text processing apparatus, a text classification apparatus, an emotion recognition apparatus, a text processing data processing apparatus, a computing device, a computer-readable storage medium, and a computer program, which are described in detail one by one in the following embodiments.
Referring to fig. 1, fig. 1 shows a flowchart of a text processing method according to an embodiment of the present disclosure, including the following specific steps:
step 102: and acquiring a text to be processed.
The embodiment of the specification is applied to a client or a server of an application with a text processing function. For example, a client or server of an intelligent translation application, a client or server of an intelligent question-answer application, a client or server of a search application, a client or server of a text emotion recognition application, and a client or server of a content security recognition application. The specific application platform can be an e-commerce platform, an online social media platform and the like.
The text to be processed is a natural language text which needs to be processed, and the text to be processed can be a natural language text which is directly obtained, or can be a natural language text which is obtained by text recognition of other modal data such as voice data, video data and the like. The text to be processed is a word sequence consisting of a plurality of words. For example, the text processing is intelligent translation processing, the text processing is text to be translated, the text processing is intelligent question-answering processing, the text processing is search processing, the text processing is index keyword text, the text processing is text emotion recognition text, the text processing is natural language text with emotion information, the text processing is content safety recognition processing, and the text processing is natural language text with potential safety hazards.
The text to be processed is obtained by receiving a text processing request sent by a front end, wherein the text processing request carries the text to be processed, voice data or video data corresponding to the text to be processed, and can also carry indexes of the text to be processed and the voice data or the video data. Specifically, the front end may perform text recognition according to the text to be processed input by the user, or may perform text recognition according to voice data or video data corresponding to the text to be processed input by the user, or may retrieve the text to be processed from a text database according to an index of the text to be processed input by the user, or may retrieve the text to be processed obtained by performing text recognition after retrieving the text to be processed from a corresponding voice database or video database according to an index of voice data or video data corresponding to the text to be processed input by the user, which is not limited herein.
Illustratively, the user inputs the text Txt to be processed through the interactive interface of the front end, and the front end generates a text processing Request according to the input text Txt, receives the text processing Request sent by the front end, and the text processing Request carries the text Txt to be processed.
And acquiring a text to be processed, and laying a text data foundation for the subsequent desensitization processing and text processing.
Step 104: and respectively carrying out sequence coding and semantic coding on the text to be processed to obtain sequence characterization and semantic characterization of the text to be processed.
The sequence coding is to code words in the text in sequence according to the word sequence. Semantic coding is to perform high-dimensional vector processing on sequence coding sequence signs, such as pooling processing, full-connection processing, convolution processing and the like, and can obtain more abstract natural language features in a text, such as emotion features (emotion of the text is positive, negative or neutral), attribute features (text is dialect or general term), character features (language expression of a character has specific habit, character identity, character education degree) and the like through semantic coding. The sequence coding does not acquire more abstract natural language characteristics and deviation and bias due to direct coding, and the semantic coding acquires more abstract natural language characteristics and introduces the deviation and bias due to high-dimensional vector processing.
The sequence characterization is a vector characterization obtained by directly encoding the word sequence of the text to be processed, and the semantic characterization is a vector characterization obtained by carrying out high-dimensional vector processing on the sequence characterization. It should be noted that, since the semantic token is a more abstract vector token, the vector dimension of the semantic token is lower than the sequence token, for example, the sequence token is a vector token in m×n dimensions, and the semantic token is a vector token in 1×n dimensions.
The method comprises the steps of respectively carrying out sequence coding and semantic coding on a text to be processed to obtain sequence representation and semantic representation of the text to be processed, wherein the pre-trained natural language model is utilized to respectively carry out sequence coding and semantic coding on the text to be processed to obtain the sequence representation and the semantic representation of the text to be processed. The natural language model is a neural network model with a text encoding function, such as an MLP model, a CNN model, an RNN model, an LSTM model, a transducer model, a BERT model, a RoBERTa model, an ALBERT model, a ChineseBERT model, and the like.
Illustratively, the text to be processed Txt is respectively Sequence-coded and semantically-coded using a pre-trained BERT model, resulting in a Sequence representation feature_sequence and a Semantic representation feature_sequence of the text to be processed Txt.
The text to be processed is subjected to sequence coding and semantic coding respectively to obtain sequence characterization and semantic characterization of the text to be processed, and data support is provided for generating targeted noise disturbance and noise adding to obtain desensitized semantic characterization subsequently.
Step 106: and generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text.
The desensitization model is a neural network model with a noise disturbance generation function, and the noise disturbance generation by the desensitization model is specifically characterized in that corresponding generation of text words is carried out according to sequence characterization, so that the method has the characteristic of high pertinence. The desensitization model may be an MLP model, a CNN model, an RNN model, an LSTM model, a transducer model, a BERT model, a RoBERTa model, an ALBERT model, a chinese BERT model, a generator for generating an countermeasure model, or the like, and is not limited herein.
The noise disturbance is a noise vector representation obtained by carrying out high-dimensional vector processing on the sequence representation, has high correspondence with the semantic representation, and can accurately correspond to sensitive information in the semantic representation. The noise perturbation may or may not be consistent but approximate with the vector dimension of the semantic representation.
The sensitive information is corresponding information in semantic representation after semantic coding is carried out on sensitive words in the text to be identified. Sensitive information is identified as identifying sensitive information based on semantic characterization and desensitized semantic characterization. For example, the sample text contains sensitive words: the method comprises the steps of (1) identifying sensitive information of 'academic' and 'professional' by 'family' and 'composer', obtaining sample desensitization semantic representation by adding corresponding noise disturbance generated by a desensitization model to the sample semantic representation of a sample text, and performing countermeasure training by combining the sample semantic representation, so that the sensitive information of 'academic' and 'professional' can not be obtained by identifying the sample semantic representation, or uncorrelated sensitive information of 'chronology', 'plant' and the like can not be obtained, and further, determining that the noise disturbance generated by the desensitization model can effectively cover the sensitive information in the semantic representation.
According to the sequence characterization, a pre-trained desensitization model is utilized to generate corresponding noise disturbance, and the specific mode is that the sequence characterization is input into the pre-trained desensitization model, and the corresponding noise disturbance is generated based on sensitive information in the sequence characterization.
Illustratively, the Sequence characterization feature_sequence is input into a pre-trained CNN model, and a corresponding Noise disturbance feature_noise is generated based on sensitive information in the Sequence characterization feature_sequence.
And generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text. The desensitization model obtained by the countermeasure training through the sensitive information identification is utilized to generate noise disturbance aiming at the text to be processed, so that a foundation is laid for the subsequent desensitization semantic characterization with high fairness.
Step 108: and (5) carrying out noise adding processing on the semantic representation by utilizing noise disturbance to obtain the desensitized semantic representation.
Desensitization semantic characterization is a vector characterization with noise perturbations added to the semantic characterization. Noise characterization masks sensitive information in semantic characterization.
And (3) carrying out noise adding processing on the semantic representation by utilizing noise disturbance to obtain a desensitized semantic representation, wherein the noise disturbance is added to the semantic representation according to the vector position between the semantic representation and the noise disturbance to obtain the desensitized semantic representation. The addition may be direct addition or weighted addition, and is not limited herein. For example, the text to be processed is "at/Mandeli/luxury/Bay/bar, serli/is/I/provided/premium/private/service" sequence characterized as { at (Token 1), mandeli (Token 2) (Token 3), luxury (Token 4), bay (Token 5), bar (Token 6), "," (Token 7), sery (Token 8), as (Token 9), I (Token 10), provided (Token 11), premium (Token 12), private (Token 13), service (Token 14), semantic characterization as { at (T1), mandeli (T2), as (T3), luxury (T4), bay (T5), bar (T6), "," (T7), sery (T8), as (T9), as (T10), provided (T11), premium (T12), as (T13), as a model of { at (T1), as "take advantage of { at (T0), as" take "0" (6), as "0", "0" (0) as "0, the desensitization semantics obtained after the noise addition process are characterized as { in (T1), mandeller (t2+n2), (T3), luxury (t4+n4), bay (T5), bar (T6), "," (T7), saly (t8+n8), and (T9), i (T10), providing (T11), premium (t12+n12), private (t13+n13), service (T14), noise disturbance can effectively mask sensitive information of "mandeller", "luxury", "saly", "premium" and "private".
Illustratively, the Noise disturbance feature_noise is added to the Semantic Noise feature_security based on the vector position between the Semantic Feature feature_security and the Noise disturbance feature_noise, resulting in a desensitized Semantic Feature feature_FairSemantic.
And (5) carrying out noise adding processing on the semantic representation by utilizing noise disturbance to obtain the desensitized semantic representation. The noise disturbance of the text to be processed is utilized to carry out noise adding processing on the semantic representation, so that sensitive information in the semantic representation is effectively covered, the desensitization semantic representation with high fairness is obtained, and a foundation is laid for carrying out subsequent text processing with high fairness to obtain a target processing result with high fairness.
Step 110: and carrying out text processing according to the desensitization semantic representation to obtain a target processing result.
The text processing is to perform corresponding vector characterization processing according to semantic characterization, the text processing can achieve a target processing task, and the text processing is achieved by using a pre-trained neural network model, wherein the neural network model can be a pre-trained text decision model, or can be a pre-trained other data decision model, such as a voice decision model or a video decision model, and is not limited herein. For example, the target processing task is a text classification task, the text decision model is a text classification model, the target processing task is an emotion recognition task, and the text decision model is an emotion recognition model. The neural network model may be an MLP model, a CNN model, an RNN model, an LSTM model, a transducer model, a BERT model, a RoBERTa model, an ALBERT model, a chinese BERT model, or the like.
According to the desensitization semantic representation, text processing is carried out to obtain a target processing result, and specifically, according to the desensitization semantic representation, text processing is carried out by utilizing a pre-trained neural network model to obtain the target processing result.
Illustratively, text processing is performed according to the desensitization semantic characterization feature_FairSemantic by using a pre-trained transducer model to obtain a target processing result resultToTxt.
In the embodiment of the specification, a text to be processed is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be processed, sequence representation and semantic representation of the text to be processed are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, text processing is carried out according to the desensitization semantic representation, and a target processing result is obtained. The noise disturbance for the text to be processed is generated by using the desensitization model obtained by the countermeasure training of the sensitive information identification, the noise disturbance is utilized to carry out noise adding processing on the semantic representation, the sensitive information in the semantic representation is effectively covered, the high-fairness desensitization semantic representation is obtained, then the text processing is carried out according to the high-fairness desensitization semantic representation, the high-fairness target processing result is obtained, the desensitization processing is related to the text to be processed but not related to the text processing, the universality is good, the processing cost is reduced, and the processing efficiency is improved.
Optionally, the desensitization model generates a generator of an reactance model.
Generating an countermeasure model includes generating and judging, wherein the generating structure includes, but is not limited to, an MLP model structure, a CNN model structure, an RNN model structure and a transducer model structure, and noise disturbance which has pertinency with semantic representation of the text to be processed is generated by taking the sequence representation as input. The structure of the discriminator comprises, but is not limited to, an MLP model structure, a CNN model structure, an RNN model structure and a transducer model structure, and the discrimination is carried out by respectively identifying the sensitive information through the real semantic representation and the denoised semantic representation and predicting the group of the obtained sensitive information.
The pre-training method for generating the countermeasure model is realized by training one party after the other party is fixed and alternately training the two parties. Therefore, compared with other neural network models, the method has the characteristics of training completion based on a small number of samples, high stability (gradient explosion avoidance), high accuracy (overfitting and underfilling resistance), high efficiency and the like.
Optionally, step 110 includes the following specific steps:
and according to the desensitization semantic representation, performing text processing by using a pre-trained text decision model to obtain a target processing result.
The text decision model is a neural network model for realizing a target text processing task according to semantic characterization. The text decision model comprises a semantic classification module, the semantic classification module performs semantic classification on semantic characterization to obtain corresponding semantic categories, and a corresponding target text processing task is realized based on the semantic categories to obtain a target processing result. The text decision model performs target text processing tasks according to the text decision model, including but not limited to a text classification model, a emotion recognition model, a text generation model, a text search model and a text content security recognition model.
According to the desensitization semantic representation, text processing is carried out by utilizing a pre-trained text decision model to obtain a target processing result, specifically, the pre-trained text decision model is utilized to carry out semantic classification on the desensitization semantic representation to obtain a corresponding semantic category, and text processing is carried out based on the semantic category to obtain the target processing result.
Illustratively, the pre-trained transducer model is utilized to perform semantic classification on desensitization semantic characterization feature_FairSemantic to obtain a corresponding semantic category CategoryOfTxt, and text processing is performed based on the semantic category CategoryOfTxt to obtain a target processing result ResultToTxt.
And according to the desensitization semantic representation, performing text processing by using a pre-trained text decision model to obtain a target processing result. And the text processing is performed by utilizing a pertinence pre-training text decision model, so that the accuracy and efficiency of the text processing are improved, and the processing cost is reduced.
Optionally, the method further comprises the following specific steps before step 106:
acquiring a sample text set, wherein the sample text set comprises a plurality of sample texts;
extracting a first sample text from a sample text set, wherein the first sample text is any one of a plurality of sample texts;
performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
according to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance;
carrying out noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation;
calculating to obtain a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization;
and according to the discrimination loss value, adjusting model parameters of the generator and the discriminator, and returning to execute the step of extracting the first sample text from the sample text set, and obtaining the generator after training under the condition that the preset training ending condition is met.
At present, aiming at the problems of deviation and prejudice of semantic coding of a natural language model, the method has the following technical scheme: (1) The sample text distribution is adjusted through a data rebalancing technology to eliminate deviation and prejudice; (2) Aiming at a specific model and a specific model bias, the bias and bias introduced in model training are relieved by designing a regular mode. However, the above method has the following problems: regarding (1): the accuracy of the identification directly influences the algorithm depolarization effect, and the algorithm depolarization effect is insufficient under the condition of insufficient identification accuracy; regarding (2): only specific model deviation can be solved by a model regularization mode, and mobility and universality are not achieved.
Compared with the current model, the method has the advantages that sensitive information in the sample text does not need to be recognized in advance, a large-scale sample text does not need to be recognized, namely, a sample text set is not required to be expanded by data enhancement, targeted sample noise disturbance can be generated through a small-scale sample text, and training effect and training efficiency are improved.
The sample text set is a sample text set pre-trained for a generator that generates a challenge model, the sample text set including a plurality of sample texts. The sample text set does not contain label data of the sample text, and the pre-training for generating the countermeasure model is unsupervised training. The sample text contains sensitive words, semantic representation obtained by semantic coding contains sensitive information, and the semantic representation can be natural language text corresponding to a target text processing task or natural language text not corresponding to the target text processing task. The sample text may be a natural language history text stored in a local database, or may be a natural language text stored in a remote database, where the remote database may be an open source database, and the sample text may be generated by using a text generation model with a text generation function, which is not limited herein.
The discriminating loss value of the sensitive information is the loss value among the sensitive information obtained by prediction according to different semantic characterizations input by the discriminator.
The preset training ending condition is a preset training ending judgment condition, may be a preset training iteration number threshold, may be a preset discrimination loss value threshold, and may be a preset convergence condition for generating an countermeasure model, which is not limited herein.
The method comprises the steps of respectively carrying out sequence coding and semantic coding on a first sample to obtain sample sequence characterization and sample semantic characterization, and specifically adopting a pre-trained natural language model to carry out sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization.
According to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance, and the sample sequence characterization is input into the generator for generating the countermeasure model, and the corresponding sample noise disturbance is generated based on sensitive information in the sample sequence characterization.
And (3) carrying out noise adding processing on the sample semantic representation by utilizing sample noise disturbance to obtain a sample desensitization semantic representation, wherein the sample noise disturbance is added to the sample semantic representation according to the vector position between the sample semantic representation and the sample noise disturbance to obtain the sample desensitization semantic representation.
According to the sample semantic representation and the sample desensitization semantic representation, a discrimination loss value of the sensitive information is calculated by using a discriminator for generating an countermeasure model, specifically, according to the sample semantic representation and the sample desensitization semantic representation, the sensitive information is identified by using the discriminator for generating the countermeasure model, and according to the sensitive information identification result, the discrimination loss value of the sensitive information is calculated.
The model parameters of the generator and the arbiter are adjusted according to the discrimination loss value, specifically, the model parameters of the generator and the arbiter are alternately adjusted according to the discrimination loss value. That is, one of the model parameters is fixed and the other model parameter is adjusted, and when the model parameter is adjusted for any one of the model parameters, the model parameters are adjusted by using a gradient update method.
Illustratively, a sample text set is obtained, a first sample text text_1 is extracted from the sample text set, the first sample text text_1 is respectively subjected to sequence coding and semantic coding by utilizing a pre-trained BERT model, a sample sequence characterization feature_sample sequence_1 and a sample semantic characterization feature_sample_Semantic1 are obtained, inputting the sample sequence characterization feature_samplesequence_1 into a generator for generating an countermeasure model, generating a corresponding sample noise disturbance feature_samplenoise_1 based on sensitive information in the sample sequence characterization feature_samplesequence_1, characterizing a vector position between the feature_sampleSemantic1 and the sample noise disturbance feature_samplenoise_1 according to sample semantics, and adding the sample noise disturbance feature_sample noise_1 to the sample semantic representation feature_sample_sample_Semantic1 to obtain a sample desensitization semantic representation feature_sample_FairSemantic1, carrying out sensitive information recognition by utilizing a discriminator for generating an countermeasure model according to the sample semantic representation feature_sample_Semantic1 and the sample desensitization semantic representation feature_sample_SamplFairSemantic1, calculating a discrimination Loss value Loss of the sensitive information according to a sensitive information recognition result, alternately adjusting model parameters of a generator and the discriminator according to the discrimination Loss value Loss, and returning to execute the step of extracting a first sample text text_1 from a sample text set to obtain a trained generator under the condition that a preset convergence condition for generating the countermeasure model is met.
Obtaining a sample text set, wherein the sample text set comprises a plurality of sample texts, extracting a first sample text from the sample text set, wherein the first sample text is any one of the plurality of sample texts, respectively carrying out sequence coding and semantic coding on the first sample text to obtain sample sequence characterization and sample semantic characterization, utilizing a generator for generating an countermeasure model according to the sample sequence characterization to generate corresponding sample noise disturbance, utilizing the sample noise disturbance to carry out noise adding processing on the sample semantic characterization to obtain sample desensitization semantic characterization, utilizing a discriminator for generating the countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization, calculating a discrimination loss value of sensitive information, adjusting model parameters of the generator and the discriminator according to the discrimination loss value, and returning to execute the step of extracting the first sample text from the sample text set to obtain the trained generator under the condition that the preset training end condition is met. According to sample sequence characterization, a generator for generating an countermeasure model is utilized to generate noise disturbance aiming at a sample text, noise disturbance is utilized to carry out noise adding processing on the sample semantic characterization, sensitive information in the sample semantic characterization is covered, sample desensitization semantic characterization is obtained, then according to the sample semantic characterization and the sample desensitization semantic characterization, a discrimination loss value of the sensitive information is obtained through calculation by utilizing a discriminator for generating the countermeasure model, the discrimination loss value is utilized to carry out countermeasure training on the generator and the discriminator for generating the countermeasure model, a generator capable of generating highly-targeted noise disturbance is obtained, a target processing result with high fairness can be obtained in subsequent text processing, and good universality is achieved.
Optionally, according to the sample semantic representation and the sample desensitization semantic representation, using a discriminator for generating an countermeasure model to calculate a discrimination loss value of the sensitive information, including the following specific steps:
according to the semantic representation of the sample, predicting to obtain corresponding first sensitive information by utilizing a discriminator for generating an countermeasure model;
according to the sample desensitization semantic representation, predicting by using a discriminator to obtain corresponding second sensitive information;
and calculating to obtain a discrimination loss value of the sensitive information according to the first sensitive information and the second sensitive information.
The sensitive information is identified by utilizing the discriminator for generating the countermeasure model, namely, the sensitive information corresponding to semantic characterization prediction is utilized, and the discriminator has higher identification accuracy due to the pretraining mode of generating the countermeasure network, so that the generator obtained by training can be ensured to be accurately unbiased.
According to the first sensitive information and the second sensitive information, the discrimination loss value of the sensitive information is calculated, and the discrimination loss value of the sensitive information is calculated by utilizing a preset loss value calculation algorithm according to the first sensitive information and the second sensitive information. The preset loss value calculation algorithm includes, but is not limited to, a cross entropy loss value calculation algorithm, a CTC loss value calculation algorithm, a cosine similarity value calculation algorithm, and an L1 loss value calculation algorithm.
Illustratively, according to sample semantic representation feature_sample_Semantic1, a corresponding first sensitive information is obtained by prediction by using a discriminator for generating an countermeasure model, according to sample desensitization semantic representation feature_sample_FairSemantic1, a corresponding second sensitive information is obtained by prediction by using the discriminator, and according to the first sensitive information and the second sensitive information, a discrimination Loss value Loss of the sensitive information is calculated by using a cross entropy Loss value calculation algorithm.
According to the sample semantic representation, a corresponding first sensitive information is obtained through prediction by using a discriminator for generating an countermeasure model, according to the sample desensitization semantic representation, a corresponding second sensitive information is obtained through prediction by using the discriminator, and a discrimination loss value of the sensitive information is obtained through calculation according to the first sensitive information and the second sensitive information. The accuracy of the discrimination loss value is improved, the accuracy of adjusting the model parameters of the generation and the discriminator is further improved, and the noise disturbance generation accuracy of the generator obtained through training is improved.
Optionally, according to the discrimination loss value, adjusting model parameters of the generator and the discriminator, including the following specific steps:
and according to the discrimination loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the discriminator.
The preset countermeasure training strategy is a preset alternately adjusted fine tuning strategy. For example, the preset challenge training strategy is: and (3) adjusting the model parameters of the discriminator once every time the model parameters of the K-time generator are adjusted. Generally, the higher the frequency of the model parameters of the generator is adjusted, the faster the convergence speed of the pre-training is, the lower the frequency of the model parameters of the generator is adjusted, namely the higher the frequency of the discriminator is adjusted, the higher the accuracy of the discriminator is, so that the higher the noise disturbance generation accuracy of the generator obtained by training is, and the specific setting mode is set based on the actual text processing requirement.
Illustratively, the model parameters of the generator and the discriminant are alternately adjusted according to a preset countermeasure training strategy (the model parameters of the discriminant are adjusted once for each adjustment of the model parameters of the K-th generator) based on the discriminant Loss value Loss.
And according to the discrimination loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the discriminator. The training effect and the demand adaptation degree of the pre-training of the generator are improved.
Optionally, before returning to the step of extracting the first sample text from the sample text set, the method further comprises the following specific steps:
According to the sample semantic representation and the sample desensitization semantic representation, calculating to obtain a semantic loss value by utilizing a pre-trained text decision model;
model parameters of the generator and the text decision model are adjusted according to the semantic loss value.
Under the condition that text processing is not considered, noise is added to semantic representation only through noise disturbance, and the obtained desensitized semantic representation can cover part of semantic information in the semantic representation, so that the semantic representation for text processing has high fairness but insufficient accuracy, and therefore, on the basis of a generator for pre-training to generate an countermeasure model, a text decision model for text processing is combined for combined pre-training, and the fairness constraint is ensured not to influence the accuracy of text processing. Similarly, the text to be processed is regarded as a section of voice, noise disturbance is to carry out noise elimination processing on sensitive words in the voice, although the sensitive words can be effectively covered, the semantic content of the voice is lost, so that the text to be processed needs to be subjected to complementary replacement of the semantic content on the basis of the noise elimination processing, and the processed voice can be correctly understood.
In the pre-training process of the generator, the text decision model can be regarded as a discriminator of semantic recognition accuracy, namely, the discriminator for generating the countermeasure model is used for ensuring that sensitive information between the desensitization semantic representation and the semantic representation has higher variability, and the text decision model is used for ensuring that the sensitivity information between the desensitization semantic representation and the semantic representation has higher consistency. The balance of the two is realized by setting a reasonable pre-training strategy.
According to the sample semantic representation and the sample desensitization semantic representation, a pre-trained text decision model is utilized to calculate a semantic loss value, and concretely, according to the sample semantic representation and the sample desensitization semantic representation, semantic classification is carried out by utilizing the pre-trained text decision model, and according to a classification result, the semantic loss value is calculated.
According to the semantic loss value, model parameters of the generator and the text decision model are adjusted, specifically, according to the semantic loss value, the model parameters of the generator and the text decision model are alternately adjusted. That is, one of the model parameters is fixed and the other model parameter is adjusted, and when the model parameter is adjusted for any one of the model parameters, the model parameters are adjusted by using a gradient update method.
Illustratively, according to sample semantic representation feature_sample_semanteme_1 and sample desensitization semantic representation feature_sample_Fairsemanteme_1, semantic classification is performed by using a pre-trained transducer model, a semantic Loss value Loss 'is obtained through calculation according to classification results, and model parameters of a generator and a text decision model are alternately adjusted according to the semantic Loss value Loss'.
According to the sample semantic representation and the sample desensitization semantic representation, calculating to obtain a semantic loss value by utilizing a pre-trained text decision model, and adjusting model parameters of a generator and the text decision model according to the semantic loss value. On the basis of obtaining a generator capable of generating highly targeted noise disturbance, not only is high fairness of target processing results obtained by subsequent text processing ensured, but also high accuracy of the target processing results is ensured.
Optionally, according to the sample semantic representation and the sample desensitization semantic representation, calculating to obtain a semantic loss value by using a pre-trained text decision model, wherein the method comprises the following specific steps of:
according to the sample semantic representation, semantic classification is carried out by utilizing a pre-trained text decision model, and a corresponding first semantic category is obtained;
according to the sample desensitization semantic representation, semantic classification is carried out by utilizing a text decision model, and a corresponding second semantic category is obtained;
and calculating according to the first semantic category and the second semantic category to obtain a semantic discrimination loss value.
The text decision model is utilized to carry out semantic classification, namely, the corresponding semantic category is predicted according to semantic characterization, and the text decision model has higher classification accuracy due to the generation of a pre-training mode of an countermeasure network, so that a generator obtained through training can accurately retain semantic information.
According to the first semantic category and the second semantic category, calculating to obtain a semantic discrimination loss value, wherein the semantic discrimination loss value is calculated according to the first semantic category and the second semantic category by using a preset loss value calculation algorithm. The preset loss value calculation algorithm includes, but is not limited to, a cross entropy loss value calculation algorithm, a CTC loss value calculation algorithm, a cosine similarity value calculation algorithm, and an L1 loss value calculation algorithm.
Illustratively, according to the sample semantic representation feature_sample security_1, performing semantic classification by using a pre-trained transducer model to obtain a corresponding first semantic category, according to the sample desensitization semantic representation feature_sample security_1, performing semantic classification by using a pre-trained transducer model to obtain a corresponding second semantic category, according to the first semantic category and the second semantic category, using a cross entropy Loss value calculation algorithm, and the second semantic category Loss'.
According to the sample semantic representation, semantic classification is carried out by utilizing a pre-trained text decision model to obtain a corresponding first semantic category, according to the sample desensitization semantic representation, semantic classification is carried out by utilizing the text decision model to obtain a corresponding second semantic category, and according to the first semantic category and the second semantic category, a semantic discrimination loss value is obtained through calculation. The accuracy of the semantic loss value is improved, the accuracy of adjusting model parameters of the generator and the text decision model is further improved, and the accuracy of retaining semantic information by noise disturbance generated by the generator obtained through training is improved.
Optionally, according to the semantic loss value, model parameters of the generator and the text decision model are adjusted, including the following specific steps:
And according to the semantic loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the text decision model.
The preset countermeasure training strategy is a preset alternately adjusted fine tuning strategy. For example, the preset challenge training strategy is: and (3) adjusting the model parameters of the text decision model once every time the model parameters of the M-time generator are adjusted. Generally, the higher the frequency of adjusting the model parameters of the generator, the faster the convergence speed of the pre-training, the lower the frequency of adjusting the model parameters of the generator, i.e. the higher the frequency of adjusting the model parameters of the text decision model, the higher the accuracy of the text decision model, so that the accuracy of the noise disturbance of the generator obtained by training on the preservation of semantic information is improved, and the specific setting mode is set based on the actual text processing requirement.
Illustratively, the model parameters of the generator and the text decision model are alternately adjusted according to the semantic Loss value Loss' according to a preset countermeasure training strategy (the model parameters of the generator are adjusted once every time the model parameters of the generator are adjusted).
And according to the semantic loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the text decision model. The training effect and the demand adaptation degree of the pre-training of the generator are improved.
Referring to fig. 2, fig. 2 shows a flowchart of a text classification method according to an embodiment of the present disclosure, including the following specific steps:
step 202: acquiring a text to be classified;
step 204: respectively carrying out sequence coding and semantic coding on the text to be classified to obtain sequence characterization and semantic characterization of the text to be classified;
step 206: generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
step 208: noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
step 210: and carrying out text classification according to the desensitization semantic representation to obtain a text classification result.
The embodiment of the specification is applied to a client or a server of an application with a text classification function.
The text to be classified is a natural language text which needs to be classified, and the text to be classified can be a natural language text which is directly obtained or can be a natural language text which is obtained by carrying out text recognition on other modal data such as voice data, video data and the like. Text to be classified is a word sequence consisting of a plurality of words.
The text classification is to perform corresponding vector characterization classification according to semantic characterization, the text classification can realize target text classification tasks, and the text classification is realized by using a pre-trained neural network model, wherein the neural network model can be a pre-trained text classification model or other pre-trained data classification model. The neural network model may be an MLP model, a CNN model, an RNN model, an LSTM model, a transducer model, a BERT model, a RoBERTa model, an ALBERT model, a chinese BERT model, or the like.
The specific manner of steps 204 to 208 is described in detail in the embodiment of fig. 1, and will not be described here again.
In the embodiment of the specification, a text to be classified is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be classified, sequence representation and semantic representation of the text to be classified are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, text classification is carried out according to the desensitization semantic representation, and a text classification result is obtained. Generating noise disturbance aiming at texts to be classified by using a desensitization model obtained by countermeasure training of sensitive information identification, carrying out noise adding processing on semantic characterization by using the noise disturbance, effectively masking sensitive information in the semantic characterization to obtain high-fairness desensitization semantic characterization, and carrying out text classification according to the high-fairness desensitization semantic characterization to obtain a high-fairness target classification result, thereby reducing classification cost and improving classification efficiency.
Referring to fig. 3, fig. 3 shows a flowchart of an emotion recognition method according to an embodiment of the present disclosure, including the following specific steps:
step 302: acquiring a text to be identified;
step 304: respectively carrying out sequence coding and semantic coding on the text to be identified to obtain sequence characterization and semantic characterization of the text to be identified;
step 306: generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
step 308: noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
step 310: and carrying out emotion recognition according to the desensitization semantic representation to obtain an emotion recognition result.
The embodiment of the specification is applied to a client or a server of an application with emotion recognition function.
The text to be recognized is a natural language text which needs emotion recognition, and the text to be recognized can be a natural language text which is obtained by directly obtaining the text or can be a natural language text which is obtained by carrying out text recognition on other modal data such as voice data, video data and the like. The text to be recognized is a word sequence consisting of a plurality of words.
The emotion recognition is to perform corresponding vector characterization recognition according to semantic characterization, the emotion recognition can achieve a target recognition task, and the emotion recognition is achieved by using a pre-trained neural network model, wherein the neural network model can be a pre-trained text emotion recognition model or other pre-trained data emotion recognition model. The neural network model may be an MLP model, a CNN model, an RNN model, an LSTM model, a transducer model, a BERT model, a RoBERTa model, an ALBERT model, a chinese BERT model, or the like.
The specific manner of steps 304 to 308 is described in detail in the embodiment of fig. 1, and will not be described here again.
In the embodiment of the specification, a text to be identified is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be identified, sequence representation and semantic representation of the text to be identified are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, emotion identification is carried out according to the desensitization semantic representation, and emotion identification results are obtained. Generating noise disturbance aiming at a text to be identified by using a desensitization model obtained by countermeasure training of sensitive information identification, carrying out noise adding processing on semantic characterization by using the noise disturbance, effectively masking sensitive information in the semantic characterization to obtain high-fairness desensitization semantic characterization, carrying out emotion identification according to the high-fairness desensitization semantic characterization to obtain a high-fairness target identification result, reducing identification cost and improving identification efficiency.
Referring to fig. 4, fig. 4 shows a flowchart of a data processing method for text processing, where the method is applied to cloud-side equipment, and includes the following specific steps:
step 402: acquiring a sample text set, wherein the sample text set comprises a plurality of sample texts;
step 404: extracting a first sample text from a sample text set, wherein the first sample text is any one of a plurality of sample texts;
step 406: performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
step 408: according to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance;
step 410: carrying out noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation;
step 412: calculating to obtain a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization;
step 414: according to the discrimination loss value, model parameters of the generator and the discriminator are adjusted, the step of extracting a first sample text from the sample text set is returned to be executed, and under the condition that the preset training ending condition is met, the generator with the training completed is obtained;
Step 416: the model parameters of the generator are sent to the end-side device.
The cloud side device is network cloud side device for providing model training function, and is a virtual device. The terminal device is a terminal device of an application providing a text processing function, and is an entity device. And the cloud side equipment and the terminal side equipment are connected through a network transmission channel to perform data transmission.
The specific manner of steps 402 to 416 is already described in detail in the embodiment of fig. 1, and is not repeated here.
In the embodiment of the specification, a sample text set is obtained, wherein the sample text set comprises a plurality of sample texts, a first sample text is extracted from the sample text set, the first sample text is any one of the plurality of sample texts, sequence coding and semantic coding are respectively carried out on the first sample text to obtain sample sequence characterization and sample semantic characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance according to the sample sequence characterization, noise processing is carried out on the sample semantic characterization by utilizing the sample noise disturbance to obtain sample desensitization semantic characterization, a discrimination loss value of sensitive information is calculated according to the sample semantic characterization and the sample desensitization semantic characterization by utilizing a discriminator for generating the countermeasure model, model parameters of the generator and the discriminator are adjusted according to the discrimination loss value, the step of extracting the first sample text from the sample text set is executed in a return mode, the trained generator is obtained under the condition that a preset training end condition is met, and model parameters of the generator are sent to an end side device. According to sample sequence characterization, a generator for generating an countermeasure model is utilized to generate noise disturbance aiming at a sample text, noise disturbance is utilized to carry out noise adding processing on the sample semantic characterization, sensitive information in the sample semantic characterization is covered, sample desensitization semantic characterization is obtained, then according to the sample semantic characterization and the sample desensitization semantic characterization, a discrimination loss value of the sensitive information is obtained through calculation by utilizing a discriminator for generating the countermeasure model, the discrimination loss value is utilized to carry out countermeasure training on the generator and the discriminator for generating the countermeasure model, a generator for generating the noise disturbance with high pertinence is obtained, a target processing result with high fairness can be obtained in subsequent text processing, the whole pre-training process is applied to cloud side equipment, model training cost is saved for end side equipment, and model training efficiency is improved.
Fig. 5 is a flow chart of a pre-training method for generating a countermeasure model in a text processing method according to an embodiment of the present disclosure.
As shown in fig. 5, in the pre-training process of the generator for generating the countermeasure model, the first sample is input into the pre-trained BERT model, and sequence coding and semantic coding are performed respectively to obtain a sample sequence representation { Token1, token2 … … Token }, and a sample semantic representation { T1, T2 … … TN, and according to the sample sequence representation, a corresponding sample noise disturbance is generated by the generator, and the sample noise disturbance is used to perform noise processing on the sample semantic representation to obtain a sample desensitization semantic representation.
Fairness pre-training phase: and according to the sample semantic representation, carrying out sensitive information prediction by using a discriminator to obtain corresponding first sensitive information, according to the sample desensitization semantic representation, carrying out sensitive information prediction by using the discriminator to obtain corresponding second sensitive information, calculating to obtain a discrimination loss value of the sensitive information according to the first sensitive information and the second sensitive information, and alternately adjusting model parameters of a generator and the discriminator according to the discrimination loss value.
Semantic pre-training phase: according to the sample semantic representation, carrying out semantic classification by using a pre-trained text decision model to obtain a corresponding first semantic category, according to the sample desensitization semantic representation, carrying out semantic classification by using the pre-trained text decision model to obtain a corresponding second semantic category, calculating to obtain a semantic loss value according to the first semantic category and the second semantic category, and alternately adjusting model parameters of the generator and the pre-trained text decision model according to the semantic loss value.
Note that the solid line represents a high-fairness (desensitized) data stream, and the broken line represents a data stream with bias and deviation.
FIG. 6 shows a schematic diagram of semantic characterization and desensitization semantic characterization in a text processing method according to one embodiment of the present description.
As shown in fig. 6, for the text to be identified, "in/mandeller/luxury/bay/bar, sally/for/i/provide/premium/private/service," among the biased semantic representations of the text to be identified, "mandeller", "luxury", "sally", "premium" and "private" are word vector representations (represented by black filled in the figure) containing sensitive information, and "bar", "provide" and "service" are word vector representations (represented by black diagonal filled in the figure) possibly containing sensitive information, "in", "bay", "yes", "me" are word vector representations (represented by unfilled in the figure) not containing sensitive information, and the biased semantic representations are used for text processing, resulting in bias and deviation of the processing results. Through the text processing method in the embodiment of fig. 1, after corresponding noise interference is generated, the biased semantic representation is subjected to noise adding processing, so that the unbiased semantic representation, namely the desensitized semantic representation, of the text to be identified is obtained, wherein "mandeller" and "high-quality" are word vector representations (represented by black oblique line filling in the figure) possibly containing sensitive information, and other word vectors do not contain sensitive information (represented by unfilled lines in the figure).
The text processing method provided in the present specification will be further described with reference to fig. 7, by taking an application of the text processing method in professional discrimination in intelligent question answering as an example. Fig. 2 is a flowchart of a processing procedure of a text processing method applied to professional discrimination in intelligent question-answering, which is provided in an embodiment of the present disclosure, and specifically includes the following steps.
Step 702: acquiring a sample text set;
wherein the sample text set includes a plurality of sample texts for intelligent questions and answers, any of the sample texts containing occupational-sensitive words.
Step 704: extracting a first sample text from a sample text set;
wherein the first sample text is any one of a plurality of sample texts.
Step 706: performing sequence coding and semantic coding on the first sample by utilizing a pre-trained BERT model to obtain sample sequence characterization and sample semantic characterization;
step 708: according to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance;
step 710: carrying out noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation;
step 712: calculating to obtain a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization;
Step 714: according to the discrimination loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the discriminator;
step 716: according to the sample semantic representation and the sample desensitization semantic representation, calculating to obtain a semantic loss value by utilizing a pre-trained intelligent question-answer model;
step 718: according to the semantic loss value, model parameters of the generator and the intelligent question-answering model are adjusted;
step 720: returning to the step of extracting the first sample text from the sample text set, and determining that the trained generator is a desensitization model under the condition that the preset convergence condition is met;
step 722: receiving a problem text sent by a front end;
wherein the question text may contain occupational sensitive words.
Step 724: respectively carrying out sequence coding and semantic coding on the problem text to obtain sequence characterization and semantic characterization of the problem text;
step 726: generating corresponding noise disturbance by using a desensitization model according to the sequence characterization;
step 728: noise disturbance is utilized to carry out noise adding processing on the semantic representation to obtain desensitized semantic representation;
step 730: generating a reply text aiming at the question text by utilizing an intelligent question-answering model according to the desensitization semantic representation;
Step 732: the reply text is sent to the front end.
In the embodiment of the specification, the noise disturbance aiming at the problem text is generated by using the desensitization model obtained through the countermeasure training of the sensitive information identification, the noise disturbance is used for carrying out noise adding processing on the semantic representation, the sensitive information in the semantic representation is effectively covered, the high-fairness desensitization semantic representation is obtained, then the intelligent reply is carried out according to the high-fairness desensitization semantic representation, the high-fairness reply text is obtained, the intelligent reply cost is reduced, the intelligent reply efficiency is improved, and the user experience is improved.
It should be noted that, the information and data such as the text to be processed, the text to be classified, the text to be identified, the natural language model, the desensitization model, the generation countermeasure network, the text decision model, the sample set, the sample text and the like in the embodiment of the method are all information and data authorized by the user or fully authorized by all parties, and the collection, the use and the processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a text processing device, and fig. 8 shows a schematic structural diagram of a text processing device provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:
A first obtaining module 802 configured to obtain text to be processed;
the first encoding module 804 is configured to perform sequence encoding and semantic encoding on the text to be processed respectively, so as to obtain sequence characterization and semantic characterization of the text to be processed;
a first generation module 806 configured to generate a corresponding noise disturbance according to the sequence characterization using a pre-trained desensitization model, wherein the desensitization model is derived from an countermeasure training that performs sensitive information recognition according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
a first denoising module 808 configured to perform denoising processing on the semantic representation using noise disturbance to obtain a desensitized semantic representation;
and the processing module 810 is configured to perform text processing according to the desensitization semantic representation to obtain a target processing result.
Optionally, the desensitization model generates a generator of an reactance model.
Optionally, the processing module 810 is further configured to:
and according to the desensitization semantic representation, performing text processing by using a pre-trained text decision model to obtain a target processing result.
Optionally, the apparatus further comprises:
a first pre-training module configured to obtain a sample text set, wherein the sample text set comprises a plurality of sample texts; extracting a first sample text from a sample text set, wherein the first sample text is any one of a plurality of sample texts; performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization; according to the sample sequence characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance; carrying out noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation; calculating to obtain a discrimination loss value of the sensitive information by using a discriminator for generating an countermeasure model according to the sample semantic characterization and the sample desensitization semantic characterization; and according to the discrimination loss value, adjusting model parameters of the generator and the discriminator, and returning to execute the step of extracting the first sample text from the sample text set, and obtaining the generator after training under the condition that the preset training ending condition is met.
Optionally, the first pre-training module is further configured to:
according to the semantic representation of the sample, predicting to obtain corresponding first sensitive information by utilizing a discriminator for generating an countermeasure model; according to the sample desensitization semantic representation, predicting by using a discriminator to obtain corresponding second sensitive information; and calculating to obtain a discrimination loss value of the sensitive information according to the first sensitive information and the second sensitive information.
Optionally, the first pre-training module is further configured to:
and according to the discrimination loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the discriminator.
Optionally, the apparatus further comprises:
the second pre-training module is configured to calculate a semantic loss value by utilizing a pre-trained text decision model according to the sample semantic representation and the sample desensitization semantic representation; model parameters of the generator and the text decision model are adjusted according to the semantic loss value.
Optionally, the second pre-training module is further configured to:
according to the sample semantic representation, semantic classification is carried out by utilizing a pre-trained text decision model, and a corresponding first semantic category is obtained; according to the sample desensitization semantic representation, semantic classification is carried out by utilizing a text decision model, and a corresponding second semantic category is obtained; and calculating according to the first semantic category and the second semantic category to obtain a semantic discrimination loss value.
Optionally, the second pre-training module is further configured to:
and according to the semantic loss value and a preset countermeasure training strategy, alternately adjusting model parameters of the generator and the text decision model.
In the embodiment of the specification, a text to be processed is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be processed, sequence representation and semantic representation of the text to be processed are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, text processing is carried out according to the desensitization semantic representation, and a target processing result is obtained. The noise disturbance for the text to be processed is generated by using the desensitization model obtained by the countermeasure training of the sensitive information identification, the noise disturbance is utilized to carry out noise adding processing on the semantic representation, the sensitive information in the semantic representation is effectively covered, the high-fairness desensitization semantic representation is obtained, then the text processing is carried out according to the high-fairness desensitization semantic representation, the high-fairness target processing result is obtained, the desensitization processing is related to the text to be processed but not related to the text processing, the universality is good, the processing cost is reduced, and the processing efficiency is improved.
The above is an exemplary scheme of a text processing apparatus of the present embodiment. It should be noted that, the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the text processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the text processing method.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a text classification device, and fig. 9 shows a schematic structural diagram of a text classification device provided in one embodiment of the present disclosure. As shown in fig. 9, the apparatus includes:
a second obtaining module 902 configured to obtain text to be classified;
the second encoding module 904 is configured to perform sequence encoding and semantic encoding on the text to be classified respectively to obtain sequence characterization and semantic characterization of the text to be classified;
a second generating module 906 configured to generate a corresponding noise disturbance according to the sequence characterization using a pre-trained desensitization model, wherein the desensitization model is obtained by performing countermeasure training for sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
A second noise adding module 908 configured to perform noise adding processing on the semantic representation using noise disturbance to obtain a desensitized semantic representation;
the classification module 910 is configured to perform text classification according to the desensitization semantic representation, so as to obtain a text classification result.
In the embodiment of the specification, a text to be classified is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be classified, sequence representation and semantic representation of the text to be classified are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, text classification is carried out according to the desensitization semantic representation, and a text classification result is obtained. Generating noise disturbance aiming at texts to be classified by using a desensitization model obtained by countermeasure training of sensitive information identification, carrying out noise adding processing on semantic characterization by using the noise disturbance, effectively masking sensitive information in the semantic characterization to obtain high-fairness desensitization semantic characterization, and carrying out text classification according to the high-fairness desensitization semantic characterization to obtain a high-fairness target classification result, thereby reducing classification cost and improving classification efficiency.
The above is an exemplary scheme of a text classification apparatus of the present embodiment. It should be noted that, the technical solution of the text classification device and the technical solution of the text classification method belong to the same concept, and details of the technical solution of the text classification device, which are not described in detail, can be referred to the description of the technical solution of the text classification method.
Corresponding to the method embodiment, the present disclosure further provides an emotion recognition device embodiment, and fig. 10 shows a schematic structural diagram of an emotion recognition device provided in one embodiment of the present disclosure. As shown in fig. 10, the apparatus includes:
a third obtaining module 1002 configured to obtain a text to be recognized;
the third encoding module 1004 is configured to perform sequence encoding and semantic encoding on the text to be identified respectively to obtain sequence characterization and semantic characterization of the text to be identified;
a third generating module 1006 configured to generate a corresponding noise disturbance according to the sequence characterization using a pre-trained desensitization model, wherein the desensitization model is obtained by performing countermeasure training for sensitive information identification according to the sample semantic characterization of the sample text and the sample desensitization semantic characterization of the sample text;
A third noise adding module 1008 configured to perform noise adding processing on the semantic representation using the noise disturbance to obtain a desensitized semantic representation;
and the recognition module 1010 is configured to perform emotion recognition according to the desensitization semantic representation to obtain an emotion recognition result.
In the embodiment of the specification, a text to be identified is obtained, sequence encoding and semantic encoding are respectively carried out on the text to be identified, sequence representation and semantic representation of the text to be identified are obtained, a pre-trained desensitization model is utilized according to the sequence representation, corresponding noise disturbance is generated, the desensitization model is obtained by carrying out countermeasure training of sensitive information identification according to sample semantic representation of a sample text and sample desensitization semantic representation of the sample text, noise disturbance is utilized to carry out noise adding processing on the semantic representation, desensitization semantic representation is obtained, emotion identification is carried out according to the desensitization semantic representation, and emotion identification results are obtained. Generating noise disturbance aiming at a text to be identified by using a desensitization model obtained by countermeasure training of sensitive information identification, carrying out noise adding processing on semantic characterization by using the noise disturbance, effectively masking sensitive information in the semantic characterization to obtain high-fairness desensitization semantic characterization, carrying out emotion identification according to the high-fairness desensitization semantic characterization to obtain a high-fairness target identification result, reducing identification cost and improving identification efficiency.
The above is an exemplary embodiment of an emotion recognition device of the present embodiment. It should be noted that, the technical solution of the emotion recognition device and the technical solution of the emotion recognition method belong to the same concept, and details of the technical solution of the emotion recognition device, which are not described in detail, can be referred to the description of the technical solution of the emotion recognition method.
Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a data processing apparatus for text processing, and fig. 11 is a schematic structural diagram of a data processing apparatus for text processing according to one embodiment of the present disclosure. As shown in fig. 11, the apparatus is applied to cloud-side equipment, and the apparatus includes:
a fourth obtaining module 1102 configured to obtain a sample text set, wherein the sample text set comprises a plurality of sample texts;
an extraction module 1104 configured to extract a first sample text from a set of sample texts, wherein the first sample text is any one of a plurality of sample texts;
a fourth encoding module 1106 configured to perform sequence encoding and semantic encoding on the first sample, respectively, resulting in a sample sequence representation and a sample semantic representation;
a fourth generation module 1108 configured to generate corresponding sample noise perturbations from the sample sequence characterization using a generator that generates an countermeasure model;
A fourth noise adding module 1110 configured to perform noise adding processing on the sample semantic representation by using sample noise disturbance to obtain a sample desensitization semantic representation;
a calculation module 1112 configured to calculate a discrimination loss value of the sensitive information using a discriminator generating the challenge model based on the sample semantic representation and the sample desensitization semantic representation;
a training module 1114 configured to adjust model parameters of the generator and the arbiter according to the discrimination loss value, and return to perform the step of extracting the first sample text from the sample text set, and obtain a generator after training is completed if a preset training end condition is satisfied;
a transmitting module 1116 is configured to transmit the model parameters of the generator to the end-side device.
In the embodiment of the specification, a sample text set is obtained, wherein the sample text set comprises a plurality of sample texts, a first sample text is extracted from the sample text set, the first sample text is any one of the plurality of sample texts, sequence coding and semantic coding are respectively carried out on the first sample text to obtain sample sequence characterization and sample semantic characterization, a generator for generating an countermeasure model is utilized to generate corresponding sample noise disturbance according to the sample sequence characterization, noise processing is carried out on the sample semantic characterization by utilizing the sample noise disturbance to obtain sample desensitization semantic characterization, a discrimination loss value of sensitive information is calculated according to the sample semantic characterization and the sample desensitization semantic characterization by utilizing a discriminator for generating the countermeasure model, model parameters of the generator and the discriminator are adjusted according to the discrimination loss value, the step of extracting the first sample text from the sample text set is executed in a return mode, the trained generator is obtained under the condition that a preset training end condition is met, and model parameters of the generator are sent to an end side device. According to sample sequence characterization, a generator for generating an countermeasure model is utilized to generate noise disturbance aiming at a sample text, noise disturbance is utilized to carry out noise adding processing on the sample semantic characterization, sensitive information in the sample semantic characterization is covered, sample desensitization semantic characterization is obtained, then according to the sample semantic characterization and the sample desensitization semantic characterization, a discrimination loss value of the sensitive information is obtained through calculation by utilizing a discriminator for generating the countermeasure model, the discrimination loss value is utilized to carry out countermeasure training on the generator and the discriminator for generating the countermeasure model, a generator for generating the noise disturbance with high pertinence is obtained, a target processing result with high fairness can be obtained in subsequent text processing, the whole pre-training process is applied to cloud side equipment, model training cost is saved for end side equipment, and model training efficiency is improved.
The above is an exemplary scheme of a data processing apparatus for text processing of the present embodiment. It should be noted that, the technical solution of the text processing data processing device and the technical solution of the text processing data processing method belong to the same concept, and details of the technical solution of the text processing data processing device which are not described in detail can be referred to the description of the technical solution of the text processing data processing method.
FIG. 12 illustrates a block diagram of a computing device provided in one embodiment of the present description. The components of computing device 1200 include, but are not limited to, memory 1210 and processor 1220. Processor 1220 is coupled to memory 1210 by bus 1230 and database 1250 is used to store data.
The computing device 1200 also includes an access device 1240, the access device 1240 enabling the computing device 1200 to communicate via the one or more networks 1260. Examples of such networks include public switched telephone networks (PSTN, public SwitchedTelephone Network), local area networks (LAN, localAreaNetwork), wide area networks (WAN, wideAreaNetwork), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 1240 may include any type of network interface, wired or wireless, such as one or more of a network interface card (NIC, network Interface Controller), such as an IEEE802.11 wireless local area network (WLAN, wireless LocalAreaNetwork) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for MicrowaveAccess) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near field communication (NFC, near Field Communication).
In one embodiment of the present description, the above components of computing device 1200, as well as other components not shown in fig. 12, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 12 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 1200 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 1200 may also be a mobile or stationary server.
Wherein the processor 1220 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the data processing method described above. The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solutions of the text processing method, the text classification method, the emotion recognition method, and the data processing method for text processing belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solutions of the text processing method, the text classification method, the emotion recognition method, or the data processing method for text processing.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text processing method, the text classification method, the emotion recognition method, or the data processing method of text processing described above.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the text processing method, the text classification method, the emotion recognition method, and the technical solution of the data processing method for text processing belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the text processing method, the text classification method, the emotion recognition method, or the data processing method for text processing.
An embodiment of the present disclosure further provides a computer program, wherein the computer program when executed in a computer causes the computer to perform the steps of the text processing method, the text classification method, the emotion recognition method, or the data processing method for text processing described above.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the text processing method, the text classification method, the emotion recognition method, and the technical solution of the data processing method for text processing belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the text processing method, the text classification method, the emotion recognition method, or the data processing method for text processing.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (14)

1. A text processing method, comprising:
acquiring a text to be processed;
respectively carrying out sequence coding and semantic coding on the text to be processed to obtain sequence characterization and semantic characterization of the text to be processed;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
carrying out noise adding processing on the semantic representation by utilizing the noise disturbance to obtain a desensitized semantic representation;
and carrying out text processing according to the desensitization semantic representation to obtain a target processing result.
2. The method of claim 1, the desensitization model generating a generator of an reactance model.
3. The method according to claim 1 or 2, wherein the text processing according to the desensitization semantic representation to obtain a target processing result comprises:
and according to the desensitization semantic representation, performing text processing by using a pre-trained text decision model to obtain a target processing result.
4. The method of claim 2, further comprising, prior to said generating a corresponding noise disturbance from said sequence characterization using a pre-trained desensitization model:
Obtaining a sample text set, wherein the sample text set comprises a plurality of sample texts;
extracting a first sample text from the sample text set, wherein the first sample text is any one of the plurality of sample texts;
performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
generating corresponding sample noise disturbance by using a generator for generating an countermeasure model according to the sample sequence characterization;
carrying out noise adding processing on the sample semantic representation by utilizing the sample noise disturbance to obtain a sample desensitization semantic representation;
calculating to obtain a discrimination loss value of the sensitive information by using the discriminator for generating the countermeasure model according to the sample semantic representation and the sample desensitization semantic representation;
and according to the discrimination loss value, adjusting model parameters of the generator and the discriminator, and returning to execute the step of extracting the first sample text from the sample text set, and obtaining the generator after training is completed under the condition that a preset training ending condition is met.
5. The method according to claim 4, wherein the calculating, according to the sample semantic representation and the sample desensitization semantic representation, a discrimination loss value of sensitive information by using a discriminator for generating an countermeasure model includes:
According to the sample semantic representation, predicting to obtain corresponding first sensitive information by using the discriminator for generating the countermeasure model;
according to the sample desensitization semantic representation, predicting to obtain corresponding second sensitive information by utilizing the discriminator;
and calculating to obtain a discrimination loss value of the sensitive information according to the first sensitive information and the second sensitive information.
6. The method of claim 4, the adjusting model parameters of the generator and the arbiter according to the discrimination loss value, comprising:
and according to the discrimination loss value, alternately adjusting model parameters of the generator and the discriminator according to a preset countermeasure training strategy.
7. The method of claim 4, further comprising, prior to the returning to perform the step of extracting the first sample text from the set of sample texts:
according to the sample semantic representation and the sample desensitization semantic representation, calculating to obtain a semantic loss value by using a pre-trained text decision model;
and adjusting model parameters of the generator and the text decision model according to the semantic loss value.
8. The method of claim 7, the calculating a semantic loss value using a pre-trained text decision model from the sample semantic representation and the sample desensitization semantic representation, comprising:
According to the sample semantic representation, semantic classification is carried out by utilizing a pre-trained text decision model, and a corresponding first semantic category is obtained;
according to the sample desensitization semantic representation, semantic classification is carried out by utilizing the text decision model, and a corresponding second semantic category is obtained;
and calculating according to the first semantic category and the second semantic category to obtain a semantic discrimination loss value.
9. The method of claim 7, the adjusting model parameters of the generator and the text decision model according to the semantic loss value, comprising:
and according to the semantic loss value, alternately adjusting model parameters of the generator and the text decision model according to a preset countermeasure training strategy.
10. A text classification method, comprising:
acquiring a text to be classified;
respectively carrying out sequence coding and semantic coding on the text to be classified to obtain sequence characterization and semantic characterization of the text to be classified;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
Carrying out noise adding processing on the semantic representation by utilizing the noise disturbance to obtain a desensitized semantic representation;
and carrying out text classification according to the desensitization semantic representation to obtain a text classification result.
11. An emotion recognition method, comprising:
acquiring a text to be identified;
respectively carrying out sequence coding and semantic coding on the text to be identified to obtain sequence characterization and semantic characterization of the text to be identified;
generating corresponding noise disturbance by utilizing a pre-trained desensitization model according to the sequence characterization, wherein the desensitization model is obtained by performing countermeasure training of sensitive information identification according to sample semantic characterization of a sample text and sample desensitization semantic characterization of the sample text;
carrying out noise adding processing on the semantic representation by utilizing the noise disturbance to obtain a desensitized semantic representation;
and carrying out emotion recognition according to the desensitization semantic representation to obtain an emotion recognition result.
12. A data processing method for text processing is applied to cloud side equipment and comprises the following steps:
obtaining a sample text set, wherein the sample text set comprises a plurality of sample texts;
extracting a first sample text from the sample text set, wherein the first sample text is any one of the plurality of sample texts;
Performing sequence coding and semantic coding on the first sample to obtain sample sequence characterization and sample semantic characterization;
generating corresponding sample noise disturbance by using a generator for generating an countermeasure model according to the sample sequence characterization;
carrying out noise adding processing on the sample semantic representation by utilizing the sample noise disturbance to obtain a sample desensitization semantic representation;
calculating to obtain a discrimination loss value of the sensitive information by using the discriminator for generating the countermeasure model according to the sample semantic representation and the sample desensitization semantic representation;
according to the discrimination loss value, adjusting model parameters of the generator and the discriminator, and returning to execute the step of extracting the first sample text from the sample text set, and under the condition that a preset training ending condition is met, obtaining the generator after training is completed;
and sending the model parameters of the generator to an end-side device.
13. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions for executing the computer executable instructions, which when executed by the processor implement the steps of the text processing method of any one of claims 1 to 9, the text classification method of claim 10, the emotion recognition method of claim 11, or the data processing method of text processing of claim 12.
14. A computer readable storage medium storing computer executable instructions which when executed by a processor perform the steps of the text processing method of any one of claims 1 to 9, the text classification method of claim 10, the emotion recognition method of claim 11, or the data processing method of text processing of claim 12.
CN202310147898.4A 2023-02-10 2023-02-10 Text processing method, text classification method and emotion recognition method Pending CN116384405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310147898.4A CN116384405A (en) 2023-02-10 2023-02-10 Text processing method, text classification method and emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310147898.4A CN116384405A (en) 2023-02-10 2023-02-10 Text processing method, text classification method and emotion recognition method

Publications (1)

Publication Number Publication Date
CN116384405A true CN116384405A (en) 2023-07-04

Family

ID=86979630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310147898.4A Pending CN116384405A (en) 2023-02-10 2023-02-10 Text processing method, text classification method and emotion recognition method

Country Status (1)

Country Link
CN (1) CN116384405A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579339A (en) * 2023-07-12 2023-08-11 阿里巴巴(中国)有限公司 Task execution method and optimization task execution method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579339A (en) * 2023-07-12 2023-08-11 阿里巴巴(中国)有限公司 Task execution method and optimization task execution method
CN116579339B (en) * 2023-07-12 2023-11-14 阿里巴巴(中国)有限公司 Task execution method and optimization task execution method

Similar Documents

Publication Publication Date Title
CN110046221B (en) Machine dialogue method, device, computer equipment and storage medium
WO2020228376A1 (en) Text processing method and model training method and apparatus
CN109977201B (en) Machine chat method and device with emotion, computer equipment and storage medium
CN111897941A (en) Dialog generation method, network training method, device, storage medium and equipment
CN111339305B (en) Text classification method and device, electronic equipment and storage medium
CN108932342A (en) A kind of method of semantic matches, the learning method of model and server
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113239169B (en) Answer generation method, device, equipment and storage medium based on artificial intelligence
CN110879938A (en) Text emotion classification method, device, equipment and storage medium
CN114780831A (en) Sequence recommendation method and system based on Transformer
CN116881428A (en) Language model training method and device
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
CN116384405A (en) Text processing method, text classification method and emotion recognition method
CN117291185A (en) Task processing method, entity identification method and task processing data processing method
CN117436480A (en) Large model under Mindspore frame and recommendation method
CN117093864A (en) Text generation model training method and device
KR20240034804A (en) Evaluating output sequences using an autoregressive language model neural network
CN112749530B (en) Text encoding method, apparatus, device and computer readable storage medium
CN114254622A (en) Intention identification method and device
CN109918486B (en) Corpus construction method and device for intelligent customer service, computer equipment and storage medium
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
CN114692610A (en) Keyword determination method and device
CN117272937B (en) Text coding model training method, device, equipment and storage medium
CN117521674B (en) Method, device, computer equipment and storage medium for generating countermeasure information
CN117711001B (en) Image processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination