CN112800465A - Method and device for processing text data to be labeled, electronic equipment and medium - Google Patents

Method and device for processing text data to be labeled, electronic equipment and medium Download PDF

Info

Publication number
CN112800465A
CN112800465A CN202110176118.XA CN202110176118A CN112800465A CN 112800465 A CN112800465 A CN 112800465A CN 202110176118 A CN202110176118 A CN 202110176118A CN 112800465 A CN112800465 A CN 112800465A
Authority
CN
China
Prior art keywords
text data
defined keywords
machine learning
learning model
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110176118.XA
Other languages
Chinese (zh)
Inventor
张晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202110176118.XA priority Critical patent/CN112800465A/en
Publication of CN112800465A publication Critical patent/CN112800465A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Telephone Function (AREA)

Abstract

The application discloses a method, a device, electronic equipment and a medium for processing file data to be annotated, wherein the method comprises the following steps: acquiring a machine learning model for identifying user-defined keywords in text data; identifying respective defined keywords in the text data to be labeled based on the machine learning model; respectively encrypting the identified respective defined keywords to obtain respective ciphertexts respectively corresponding to the respective defined keywords; and replacing the defined keywords in the text data to be labeled with corresponding ciphertexts to obtain the desensitized text data to be labeled.

Description

Method and device for processing text data to be labeled, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing text data to be labeled, an electronic device, and a computer-readable storage medium.
Background
In the text data labeling process, due to the fact that the labeling workload is huge, work is often outsourced to other data labeling mechanisms in an outsourcing mode to be completed, but the data can have the risk of privacy disclosure in the labeling process in the data labeling mechanisms.
In the related technology, the technology in the data annotation process mainly focuses on how to assist manual work to accelerate the annotation efficiency, and the data desensitization problem in the annotation process is rarely concerned.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a new technical solution for processing text data to be labeled.
According to a first aspect of the present disclosure, there is provided a method for processing text data to be labeled, including:
acquiring a machine learning model for identifying user-defined keywords in text data;
identifying respective defined keywords in the text data to be labeled based on the machine learning model;
respectively encrypting the identified respective defined keywords to obtain respective ciphertexts respectively corresponding to the respective defined keywords;
and replacing the defined keywords in the text data to be labeled with corresponding ciphertexts to obtain the desensitized text data to be labeled.
Optionally, the user-defined keyword relates to at least one of an entity name, an entity relationship, a mobile phone number, an account name, and an account password.
Optionally, the obtaining a machine learning model for identifying a custom keyword in text data includes:
acquiring a training text data set according to a set acquisition rule;
training the machine learning model by using the training text data set based on a deep learning algorithm;
wherein the set acquisition rule satisfies any one or more of the following items:
the total number of words included in the training text data set exceeds a first set number;
the total number of words included in the training text dataset exceeds a second set number;
the total number of each type of self-defined keywords included in the training text data set exceeds a third set number.
Optionally, after the training text data set is used to train the machine learning model based on the deep learning algorithm, the method further includes:
according to the machine learning model, obtaining respective defined keywords in each verification text data in the verification text data set as respective prediction custom keywords;
comparing each predicted custom keyword in each piece of verification text data with each corresponding actual custom keyword to obtain an evaluation index value of the machine learning model;
and under the condition that the evaluation index value is greater than or equal to the evaluation index threshold value, identifying the respective defined keywords in the text data to be labeled based on the machine learning model.
Optionally, the method further comprises:
and in the case that the evaluation index value is smaller than the evaluation index threshold value, retraining the machine learning model by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model.
Optionally, the deep learning algorithm is a BERT-CRF algorithm.
Optionally, after the identifying the respective defined keywords in the text data to be labeled, the method further includes:
acquiring the total number of the identified defined keywords;
and under the condition that the total number of the identified user-defined keywords is larger than zero, respectively encrypting the identified respective defined keywords to obtain the ciphertexts respectively corresponding to the respective defined keywords.
Optionally, the method further comprises:
and under the condition that the total number of the identified self-defined keywords is zero, firstly, inputting the text data to be labeled into a confidential environment, re-identifying the respective defined keywords in the text data to be labeled in the confidential environment, and then, respectively encrypting the identified respective defined keywords to obtain the ciphertexts respectively corresponding to the respective defined keywords.
Optionally, the encrypting the identified respective defined keywords respectively to obtain the ciphertexts respectively corresponding to the keywords respectively includes:
and respectively encrypting the identified respective defined keywords based on a preset encryption algorithm to obtain respective ciphertexts respectively corresponding to the respective defined keywords.
Optionally, the preset encryption algorithm includes any one of a random perturbation encryption algorithm and a hash algorithm.
Optionally, before the respectively encrypting the identified respective defined keywords and obtaining respective ciphertexts respectively corresponding to the respective defined keywords, the method further includes:
and storing the mapping relation of the respectively defined keywords in the text data to be labeled and the position information of the keywords in the text data to be labeled in the security equipment.
Optionally, after the obtaining of the desensitized text data to be annotated, the method further includes:
and labeling the desensitized text data to be labeled.
Optionally, after the labeling is performed on the encrypted text data to be labeled, the method further includes:
acquiring the mapping relation from the security equipment;
traversing the text data to be labeled after labeling according to the mapping relation, and replacing each ciphertext message with a corresponding user-defined keyword so as to obtain the decrypted text data to be labeled after labeling.
According to a second aspect of the present disclosure, there is also provided a processing apparatus for text data to be annotated, including:
the acquisition module is used for acquiring a machine learning model for identifying the user-defined keywords in the text data;
the identification module is used for identifying the respective defined keywords in the text data to be labeled based on the machine learning model;
the encryption module is used for respectively encrypting the identified respective defined keywords to obtain respective ciphertexts respectively corresponding to the respective defined keywords;
and the replacing module is used for replacing the defined keywords in the text data to be marked with the corresponding ciphertext to obtain the desensitized text data to be marked.
Optionally, the user-defined keyword relates to at least one of an entity name, an entity relationship, a mobile phone number, an account name, and an account password.
Optionally, the obtaining module is specifically configured to:
acquiring a training text data set according to a set acquisition rule;
training the machine learning model by using the training text data set based on a deep learning algorithm;
wherein the set acquisition rule satisfies any one or more of the following items:
the total number of words included in the training text data set exceeds a first set number;
the total number of words included in the training text dataset exceeds a second set number;
the total number of each type of self-defined keywords included in the training text data set exceeds a third set number.
Optionally, the apparatus further comprises a verification module configured to:
according to the machine learning model, obtaining respective defined keywords in each verification text data in the verification text data set as respective prediction custom keywords;
comparing each predicted custom keyword in each piece of verification text data with each corresponding actual custom keyword to obtain an evaluation index value of the machine learning model;
and under the condition that the evaluation index value is greater than or equal to an evaluation index threshold value, identifying the respective defined keywords in the text data to be labeled by the identification module based on the machine learning model.
Optionally, the verification module is further configured to:
and in the case that the evaluation index value is smaller than the evaluation index threshold value, retraining the machine learning model by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model.
Optionally, the deep learning algorithm is a BERT-CRF algorithm.
Optionally, the obtaining module is further configured to:
acquiring the total number of the identified defined keywords;
and under the condition that the total number of the identified user-defined keywords is larger than zero, respectively encrypting the identified respective defined keywords by the encryption module to obtain the ciphertexts respectively corresponding to the respective defined keywords.
Optionally, the obtaining module is further configured to:
under the condition that the total number of the identified user-defined keywords is zero, the text data to be labeled is firstly input into a confidential environment, so that after the respective defined keywords in the text data to be labeled are re-identified in the confidential environment, the identified respective defined keywords are respectively encrypted by an encryption module, and the ciphertext respectively corresponding to the respective defined keywords is obtained.
Optionally, the encryption module is specifically configured to:
and respectively encrypting the identified respective defined keywords based on a preset encryption algorithm to obtain respective ciphertexts respectively corresponding to the respective defined keywords.
Optionally, the preset encryption algorithm includes any one of a random perturbation encryption algorithm and a hash algorithm.
Optionally, the apparatus further comprises a storage module configured to:
and storing the mapping relation of the respectively defined keywords in the text data to be labeled and the position information of the keywords in the text data to be labeled in the security equipment.
Optionally, the apparatus further comprises a labeling module configured to:
and labeling the desensitized text data to be labeled.
Optionally, the apparatus further comprises a decryption module configured to:
acquiring the mapping relation from the security equipment;
traversing the text data to be labeled after labeling according to the mapping relation, and replacing each ciphertext message with a corresponding user-defined keyword so as to obtain the decrypted text data to be labeled after labeling.
According to a third aspect of the present disclosure, there is also provided an electronic device comprising at least one computing device and at least one storage device, wherein the at least one storage device is configured to store instructions for controlling the at least one computing device to perform the method according to the above first aspect; alternatively, the apparatus implements the apparatus according to the second aspect above through the computing means and the storage means.
According to a fourth aspect of the present disclosure, there is also provided a computer readable storage medium, wherein a computer program is stored thereon, which when executed by a processor, implements the method as described above in the first aspect.
One advantageous effect of the present disclosure is that according to the method, apparatus, electronic device, and medium of the embodiments of the present disclosure, the respective defined keywords in the text data to be labeled are identified based on the machine learning model for identifying the customized keywords in the text data, which can improve the accuracy of the extracted respective defined keywords. Meanwhile, after the respectively defined keywords in the text data to be labeled are identified, the respectively defined keywords are encrypted to obtain the ciphertexts corresponding to the respectively defined keywords, the respectively defined keywords in the file data to be labeled are replaced by the corresponding ciphertexts to obtain desensitized data to be labeled, and the storage space of the desensitized text data to be labeled is far smaller than that of the original text data to be labeled, so that the data volume can be reduced, the key information can be desensitized, and the safety of information transmission is ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a hardware architecture diagram of an electronic device according to an embodiment of the disclosure;
FIG. 2 is a flow chart of a method for processing text data to be labeled according to an embodiment of the disclosure;
FIG. 3 is a flow chart of a method for processing text data to be labeled according to another embodiment of the present disclosure;
FIG. 4 is a flow chart of a method for processing text data to be annotated according to an example of the present disclosure;
FIG. 5 is a functional block diagram of a processing device for text data to be annotated according to an embodiment of the present disclosure;
FIG. 6 is a functional block diagram of an electronic device according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
< hardware configuration >
The method of the embodiments of the present disclosure may be implemented by at least one electronic device, i.e., the apparatus 5000 for implementing the method may be disposed on the at least one electronic device. Fig. 1 shows a hardware structure of an arbitrary electronic device. The electronic device shown in fig. 1 may be a portable computer, a desktop computer, a workstation, a server, or the like, or may be any other device having a computing device such as a processor and a storage device such as a memory, and is not limited herein.
As shown in fig. 1, the electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. Wherein the processor 1100 is adapted to execute computer programs. The computer program may be written in an instruction set of an architecture such as x86, Arm, RISC, MIPS, SSE, etc. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1400 is capable of wired or wireless communication, for example, and may specifically include Wifi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. The electronic device 1000 may output voice information through the speaker 1700, and may collect voice information through the microphone 1800, and the like.
The electronic device shown in fig. 1 is merely illustrative and is in no way meant to limit the invention, its application, or uses. In the embodiment of the present disclosure, the memory 1200 of the electronic device 1000 is used for storing instructions, which are used for controlling the processor 1100 to operate so as to execute the processing method of the text data to be labeled according to the embodiment of the present disclosure. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
In one embodiment, an electronic device is provided that includes at least one computing device and at least one storage device for storing instructions for controlling the at least one computing device to perform a method according to any embodiment of the present disclosure.
The apparatus may include at least one electronic device 1000 as shown in fig. 1 to provide at least one computing device, such as a processor, and at least one storage device, such as a memory, without limitation.
< method examples >
Fig. 2 is a flowchart illustrating a method for processing text data to be annotated, which is executed by the electronic device 1000 according to an embodiment of the present disclosure, and as shown in fig. 2, the method may include the following steps S2100 to S2400:
step S2100, a machine learning model for identifying the custom keywords in the text data is obtained.
In this embodiment, the minimum granularity of processing the text data is not a single word in the text data, but a word in the text data, and the word in the text data may include the custom keyword and other words except the custom keyword.
The user-defined keyword can be a sensitive word predefined according to actual needs and actual application scenes. The custom keyword may relate to at least one of an entity name, an entity relationship, a cell phone number, an account name, and an account password, for example.
The above entity names may be a person name (person), an organization name (organization), a place name (location), and other entities identified by names.
The machine learning model may be a neural network model, such as but not limited to a bp (back propagation) neural network model, a convolutional neural network model, and the like, and of course, the machine learning model may also be a logistic regression model, and the machine learning model is not specifically limited herein.
In this embodiment, the obtaining of the machine learning model for identifying the user-defined keyword in the text data in step S2100 may further include steps S2110 to S2120:
step S2110, acquiring a training text data set according to a set acquisition rule.
In step S2110, the set acquisition rule may satisfy any one or more of the following items: the total number of words included in the training text data set exceeds a first set number; the total number of words included in the training text data set exceeds a second set number; the total number of each type of the self-defined keywords included in the training text data set exceeds a third set number.
The first set quantity, the second set quantity and the third set quantity may be values set according to an actual application scenario and an actual requirement, and the embodiment is not limited herein.
And S2120, training a machine learning model by using the training text data set based on the deep learning algorithm.
The deep learning algorithm may be a BERT-CRF algorithm.
Specifically, a training data set corresponding to a plurality of application scenarios may be stored in advance in the electronic device that executes the embodiment of the present disclosure, where the training data may be, for example, image data, text data, or voice data, and according to an application scenario input through a setting entry of a provided application scenario, a training data set matching the application scenario is obtained to perform machine learning training, so that the obtained machine learning model may be applicable to the input application scenario. After the machine learning model corresponding to the application scenario is obtained, the final machine learning model may be input to the application item matched with the application scenario to which the final machine learning model is applied, so as to process the sample information in the application item in the corresponding application item by using the final machine learning model.
The more the number of training data, the more accurate the training result is, but after the training data reaches a certain number, the more slowly the accuracy of the training result increases until the orientation stabilizes. Here, the number of training data required for the determination of the accuracy of the training result and the data processing cost can be considered.
Taking an application scene as an example of the identification of the user-defined keywords, the training data is text data, and the respective defined keywords in the text data are labeled, and may be based on a deep learning algorithm, the training data are trained to train a machine learning model, and the machine learning model is used for identifying the respective defined keywords in the text data.
After obtaining the machine learning model for identifying the custom keywords in the text data, entering:
step S2200 is that based on the machine learning model, the defined keywords in the text data to be labeled are identified.
In this embodiment, any piece of text data to be labeled may be input into the machine learning model, so as to obtain the respective defined keywords in the text data to be labeled. For example, the identified custom keywords include, but are not limited to, names, relatives, company names, account numbers, cell phone numbers, and the like. In the embodiment, the pre-trained machine learning model for identifying the self-defined keywords in the text data is utilized to identify the respective defined keywords in the text data to be labeled, so that the accuracy of the extracted self-defined keywords can be improved.
In one example, the following step of encrypting the identified respective defined keywords respectively to obtain the ciphertexts respectively corresponding to the respective defined keywords may be directly performed after identifying the respective defined keywords in the text data to be labeled based on the machine learning model.
In one example, after identifying the respective defined keywords in the text data to be labeled, the total number of the identified respective defined keywords may be obtained, and when the total number of the identified self-defined keywords is greater than zero, the following steps of encrypting the identified respective defined keywords respectively and obtaining the ciphertexts respectively corresponding to the respective defined keywords may be performed.
In this example, for example, when the total number of the respective defined keywords in the text data to be labeled is greater than zero, the following steps of encrypting the respective identified defined keywords respectively and obtaining the ciphertexts respectively corresponding to the respective defined keywords are performed, so as to avoid performing the subsequent encryption step when the user-defined keywords in the text data to be labeled are not identified.
In one example, after identifying the respective defined keywords in the text data to be labeled, first obtaining the total number of the identified respective defined keywords, and under the condition that the total number of the identified respective defined keywords is zero, first inputting the text data to be labeled into a secure environment, so as to re-identify the respective defined keywords in the text data to be labeled in the secure environment, and then performing the following steps of encrypting the identified respective defined keywords respectively, and obtaining ciphertexts corresponding to the respective defined keywords respectively.
The security environment refers to an operation environment with supervision and security control when text data is operated, such as but not limited to an operation room which adopts an access control system to isolate a space, installs a camera and isolates a computer network.
In this example, for example, when the total number of the respective defined keywords in the text data to be labeled is identified as zero, the text data to be labeled may be returned to the confidential environment, and after the respective defined keywords in the text data to be labeled are manually extracted, the following steps of encrypting the identified respective defined keywords respectively and obtaining the ciphertexts respectively corresponding to the respective defined keywords are performed. Namely, under the condition that the self-defined keywords are not identified by the machine learning model, the self-defined keywords in the text data to be labeled can be manually extracted, and then the self-defined keywords are encrypted, so that the reliability of information transmission is realized.
After identifying the respective defined keywords in the text data to be labeled based on the machine learning model, entering:
step S2300, encrypting the identified respective defined keywords respectively to obtain respective ciphertexts corresponding to the respective defined keywords respectively.
In this embodiment, the identified respective defined keywords may be encrypted based on a preset encryption algorithm, so as to obtain respective ciphertexts corresponding to the respective defined keywords. According to the embodiment, the user-defined keywords are encrypted, so that the data volume can be reduced, the key information can be desensitized, and the information transmission safety is guaranteed.
The preset encryption algorithm may include any one of a random perturbation encryption algorithm and a hash algorithm.
The encryption algorithm is an algorithm for encrypting the user-defined keyword by using a cryptographic hash function. The cryptographic hash function may be any function capable of implementing a hash process, such as, but not limited to, the MD5 message digest Algorithm, the Digital Signature Algorithm (DSA), and the PBKDF2 Algorithm. Here, the same cryptographic hash function may be used to encrypt the identified respective defined keywords, or different cryptographic hash functions may be used to encrypt the identified respective defined keywords.
The above ciphertext may be a character string, for example, dccbsfg, or may be combined information of characters and numbers, for example, f4dcc3b5aa, which is not limited herein.
Exemplarily, taking the identified user-defined keyword as a name as an example, here, the name may be hash-encrypted by using an encryption hash function, and a corresponding ciphertext is dccbsfg.
According to the method disclosed by the embodiment of the disclosure, the respective defined keywords in the text data to be labeled are identified based on the machine learning model for identifying the user-defined keywords in the text data, so that the accuracy of the extracted user-defined keywords can be improved. Meanwhile, after the respective defined keywords in the text data to be labeled are identified, the respective defined keywords are encrypted to obtain the ciphertexts corresponding to the respective defined keywords, the respective defined keywords in the file data to be labeled are replaced by the corresponding ciphertexts to obtain desensitized data to be labeled, and the storage space of the desensitized text data to be labeled is far smaller than that of the original text data to be labeled, so that the data volume can be reduced, the key information can be desensitized, and the safety of information transmission is ensured.
In one embodiment, after the machine learning model is trained by using the training text data set based on the deep learning algorithm, the processing method of the text data to be labeled of the present disclosure may further include the following steps S3100 to S3400:
step S3100, obtaining respective defined keywords in each verification text data in the verification text data set as the respective prediction custom keywords according to the machine learning model.
In step S3100, a small amount of data may be selected to perform manual labeling of the custom keyword, and the custom keyword may be used as an actual custom keyword for verifying the text data.
Step S3200, comparing each predicted custom keyword in each piece of verification text data with each corresponding actual custom keyword to obtain an evaluation index value of the machine learning model.
In step S3200, according to the machine learning model, the respective defined keyword in each verification text data may be obtained as each predicted custom keyword, so as to compare each predicted custom keyword of each verification text data with each actual custom keyword of each verification text data, and further obtain an evaluation index value of the machine learning model.
The evaluation index value of the machine learning model is used for judging the quality of the machine learning model. The above evaluation index value includes at least one of Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R2, AUC, KS, Recall, Precision, Accuracy, f1, and Logloss.
And step S3300, in the case that the evaluation index value is smaller than the evaluation index threshold value, retraining the machine learning model by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model.
In step S3300, an evaluation index threshold may be set to determine whether the machine learning model is valid according to the evaluation index threshold. For example, the machine learning model may be determined to be invalid when the evaluation index value is smaller than the evaluation index threshold value, or may be determined to be valid when the evaluation value does not exceed the evaluation index threshold value.
In the case that the machine learning model is judged to be invalid according to the evaluation index value, the machine learning model can be obtained by retraining by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model, so that the identified user-defined keyword is more accurate.
And step S3400, under the condition that the evaluation index value is greater than or equal to the evaluation index threshold value, identifying the respective defined keywords in the text data to be labeled based on the machine learning model.
According to the embodiment, after the machine learning model is obtained, the verification text data is selected, the actual user-defined keyword of each verification text data is compared with the corresponding prediction user-defined keyword to obtain the evaluation index value of the machine learning model, and then under the condition that the machine learning model is judged to be effective according to the evaluation index value, the step of identifying the respective defined keyword in the text data to be labeled based on the machine learning model is executed, so that the accuracy of the identified user-defined keyword can be ensured. Meanwhile, under the condition that the machine learning model is judged to be invalid according to the evaluation index value, the machine learning model is obtained by retraining by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model, so that the identified user-defined keyword is more and more accurate.
In an embodiment, before encrypting the identified respective defined keywords respectively and obtaining respective ciphertexts corresponding to the respective defined keywords respectively, the method for processing the text data to be labeled of the present disclosure may further include:
and storing the mapping relation of the respectively defined keywords in the text data to be labeled and the position information of the keywords in the text data to be labeled in the security equipment.
The security device refers to a special computing storage device for storing security information, such as a computer isolated by a network and verified by an account password.
In this embodiment, the electronic device 1000 may establish a mapping relationship between the respective defined keywords in the text data to be labeled and the location information of the respective defined keywords in the text data to be labeled, and store the mapping relationship in the security device to avoid information leakage, and at the same time, the mapping relationship may be used as a decryption file for the electronic device 1000 to decrypt the custom keywords.
In one embodiment, after obtaining the desensitized text data to be annotated, the processing method of the text data to be annotated may further include: and labeling the decrypted text data to be labeled.
In one example, the text data to be marked may be directly encrypted by an encrypting party of the text data to be marked.
In an example, after the encryption party encrypts the text data to be annotated, the desensitized text data to be annotated is entrusted to the data annotation mechanism, so that the desensitized text data to be annotated is annotated by the data annotation mechanism.
According to the method of the embodiment, the desensitized text data to be annotated is provided for the data annotation mechanism, so that the key information can be desensitized, and the safety of information transmission is ensured.
In one embodiment, after the text data to be annotated after desensitization is annotated, the processing method of the text data to be annotated further includes: acquiring a mapping relation from the security equipment; and traversing the text data to be labeled after labeling according to the mapping relation, and replacing each ciphertext message with a corresponding self-defined keyword so as to obtain the decrypted text data to be labeled after labeling.
According to the embodiment, after the text data to be labeled after desensitization is labeled, the labeled text data can be recovered through the mapping relationship of the respectively defined keywords in the text data to be labeled stored in the security device and the position information of the keywords in the text data to be labeled.
< example >
Next, another example of a processing method of text data to be labeled is shown, in this example, as shown in fig. 4, the processing method of text data to be labeled may include the following steps:
step S4010, obtaining a machine learning model for identifying the user-defined keywords in the text data.
Step S4020, identifying the respective defined keywords in the text data to be labeled based on the machine learning model.
Step S4030, acquiring the total number of the identified respective definition keywords, and if the total number of the identified respective definition keywords is greater than zero, executing the following step S4050. Otherwise, the following step S4040 is performed.
Step S4040, inputting the text data to be labeled into the security environment, and after re-identifying the respective defined keywords in the text data to be labeled in the security environment, continuing to execute the following step S4050.
In step S4040, when the machine learning model does not identify the user-defined keyword in the text data to be labeled, the text data to be labeled may be returned to the confidential environment, so as to manually extract the respective defined keyword in the text data to be labeled in the confidential environment.
Step S4050, storing the mapping relationship of the defined keywords in the text data to be labeled and the position information of the defined keywords in the text data to be labeled in the security device.
Step S4060, based on the preset encryption algorithm, respectively encrypting the identified respective defined keywords, respectively obtaining the respective ciphertexts corresponding to the respective defined keywords.
Step S4070, replacing the defined keywords in the text data to be labeled with the corresponding ciphertext to obtain the desensitized text data to be labeled.
And S4080, labeling the desensitized text data to be labeled.
Step S4090, obtaining a mapping relation from the security device; and traversing the text data to be labeled after labeling according to the mapping relation, and replacing each ciphertext message with a corresponding self-defined keyword so as to obtain the decrypted text data to be labeled after labeling.
< apparatus embodiment >
In this embodiment, a processing apparatus 5000 for text data to be annotated is further provided, as shown in fig. 5, the processing apparatus 5000 for text data to be annotated includes an obtaining module 5100, an identifying module 5200, an encrypting module 5300 and a replacing module 5400, and is configured to implement the processing method for text data to be annotated provided in this embodiment, each module of the processing apparatus 5000 for text data to be annotated may be implemented by software, or may be implemented by hardware, which is not limited herein.
The obtaining module 5100 is configured to obtain a machine learning model for identifying a user-defined keyword in the text data.
An identifying module 5200 is configured to identify, based on the machine learning model, respective defined keywords in the text data to be labeled.
The encryption module 5300 encrypts the identified respective defined keywords, respectively, to obtain respective ciphertexts corresponding to the respective defined keywords, respectively.
And the replacing module 5400 is configured to replace the respective defined keyword in the text data to be labeled with a corresponding ciphertext to obtain the desensitized text data to be labeled.
In one embodiment, the custom keyword relates to at least one of an entity name, an entity relationship, a cell phone number, an account name, and an account password.
In one embodiment, the obtaining module 5100 is specifically configured to: acquiring a training text data set according to a set acquisition rule; and training the machine learning model by utilizing the training text data set based on a deep learning algorithm.
The set acquisition rule satisfies any one or more of the following items:
the total number of words included in the training text data set exceeds a first set number;
the total number of words included in the training text dataset exceeds a second set number;
the total number of each type of self-defined keywords included in the training text data set exceeds a third set number.
In one embodiment, the apparatus 5000 further comprises a verification module (not shown) for: according to the machine learning model, obtaining respective defined keywords in each verification text data in the verification text data set as respective prediction custom keywords; comparing each predicted custom keyword in each piece of verification text data with each corresponding actual custom keyword to obtain an evaluation index value of the machine learning model; and under the condition that the evaluation index value is greater than or equal to an evaluation index threshold value, identifying the respective defined keywords in the text data to be labeled by the identification module based on the machine learning model.
In one embodiment, the verification module is further configured to: and in the case that the evaluation index value is smaller than the evaluation index threshold value, retraining the machine learning model by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model.
In one embodiment, the deep learning algorithm is a BERT-CRF algorithm.
In one embodiment, the fetch module 5100, is further configured to: acquiring the total number of the identified defined keywords; and under the condition that the total number of the identified user-defined keywords is larger than zero, respectively encrypting the identified respective defined keywords by the encryption module to obtain the ciphertexts respectively corresponding to the respective defined keywords.
In one embodiment, the fetch module 5100, is further configured to: under the condition that the total number of the identified user-defined keywords is zero, the text data to be labeled is firstly input into a confidential environment, so that after the respective defined keywords in the text data to be labeled are re-identified in the confidential environment, the identified respective defined keywords are respectively encrypted by an encryption module, and the ciphertext respectively corresponding to the respective defined keywords is obtained.
In one embodiment, the encryption module 5300 is specifically configured to: and respectively encrypting the identified respective defined keywords based on a preset encryption algorithm to obtain respective ciphertexts respectively corresponding to the respective defined keywords.
In one embodiment, the preset encryption algorithm includes any one of a random perturbation encryption algorithm and a hash algorithm.
In one embodiment, the apparatus 5000 further comprises a storage module (not shown in the figures) configured to: and storing the mapping relation of the respectively defined keywords in the text data to be labeled and the position information of the keywords in the text data to be labeled in the security equipment.
In one embodiment, the apparatus 5000 further comprises a labeling module (not shown) for: and labeling the desensitized text data to be labeled.
In one embodiment, the apparatus 5000 further comprises a decryption module (not shown in the figures) configured to: acquiring the mapping relation from the security equipment; traversing the text data to be labeled after labeling according to the mapping relation, and replacing each ciphertext message with a corresponding user-defined keyword so as to obtain the decrypted text data to be labeled after labeling.
< apparatus embodiment >
Corresponding to the above method embodiments, in this embodiment, an electronic device is further provided, as shown in fig. 6, which may include a processing apparatus 5000 for text data to be annotated according to any embodiment of the present disclosure, and is configured to implement the method for processing text data to be annotated according to any embodiment of the present disclosure.
As shown in fig. 7, the electronic device 6000 may further include a processor 6200 and a memory 6100, where the memory 6100 is configured to store executable instructions; the processor 6200 is configured to operate the electronic device according to control of the instruction to execute a processing method of text data to be annotated according to any embodiment of the present disclosure.
The various modules of the apparatus 5000 above may be implemented by the processor 6200 executing the instructions to perform a method according to any embodiment of the disclosure.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A processing method of text data to be labeled comprises the following steps:
acquiring a machine learning model for identifying user-defined keywords in text data;
identifying respective defined keywords in the text data to be labeled based on the machine learning model;
respectively encrypting the identified respective defined keywords to obtain respective ciphertexts respectively corresponding to the respective defined keywords;
and replacing the defined keywords in the text data to be labeled with corresponding ciphertexts to obtain the desensitized text data to be labeled.
2. The method of claim 1, wherein the custom keyword relates to at least one of an entity name, an entity relationship, a cell phone number, an account name, and an account password.
3. The method of claim 1, wherein the obtaining a machine learning model for identifying custom keywords in textual data comprises:
acquiring a training text data set according to a set acquisition rule;
training the machine learning model by using the training text data set based on a deep learning algorithm;
wherein the set acquisition rule satisfies any one or more of the following items:
the total number of words included in the training text data set exceeds a first set number;
the total number of words included in the training text dataset exceeds a second set number;
the total number of each type of self-defined keywords included in the training text data set exceeds a third set number.
4. The method of claim 3, wherein after the training out the machine learning model using the training text data set based on the deep learning algorithm, further comprising:
according to the machine learning model, obtaining respective defined keywords in each verification text data in the verification text data set as respective prediction custom keywords;
comparing each predicted custom keyword in each piece of verification text data with each corresponding actual custom keyword to obtain an evaluation index value of the machine learning model;
and under the condition that the evaluation index value is greater than or equal to the evaluation index threshold value, identifying the respective defined keywords in the text data to be labeled based on the machine learning model.
5. The method of claim 4, wherein the method further comprises:
and in the case that the evaluation index value is smaller than the evaluation index threshold value, retraining the machine learning model by adjusting at least one of the training times of the machine learning model and the value of the hyper-parameter in the machine learning model.
6. The method of claim 3, wherein the deep learning algorithm is a BERT-CRF algorithm.
7. The method of claim 1, wherein after said identifying the respective defined keywords in the text data to be annotated, further comprising:
acquiring the total number of the identified defined keywords;
and under the condition that the total number of the identified user-defined keywords is larger than zero, respectively encrypting the identified respective defined keywords to obtain the ciphertexts respectively corresponding to the respective defined keywords.
8. A processing apparatus for text data to be annotated, comprising:
the acquisition module is used for acquiring a machine learning model for identifying the user-defined keywords in the text data;
the identification module is used for identifying the respective defined keywords in the text data to be labeled based on the machine learning model;
the encryption module is used for respectively encrypting the identified respective defined keywords to obtain respective ciphertexts respectively corresponding to the respective defined keywords;
and the replacing module is used for replacing the defined keywords in the text data to be marked with the corresponding ciphertext to obtain the desensitized text data to be marked.
9. An electronic device comprising at least one computing device and at least one storage device, wherein the at least one storage device is to store instructions for controlling the at least one computing device to perform the method of any of claims 1 to 7; alternatively, the apparatus implements the apparatus of claim 8 through the computing device and the storage device.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110176118.XA 2021-02-09 2021-02-09 Method and device for processing text data to be labeled, electronic equipment and medium Pending CN112800465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110176118.XA CN112800465A (en) 2021-02-09 2021-02-09 Method and device for processing text data to be labeled, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110176118.XA CN112800465A (en) 2021-02-09 2021-02-09 Method and device for processing text data to be labeled, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN112800465A true CN112800465A (en) 2021-05-14

Family

ID=75814935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110176118.XA Pending CN112800465A (en) 2021-02-09 2021-02-09 Method and device for processing text data to be labeled, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN112800465A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100299589A1 (en) * 2009-05-19 2010-11-25 Studio Ousia Inc. Keyword display method and keyword display system
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN109214008A (en) * 2018-09-28 2019-01-15 珠海中科先进技术研究院有限公司 A kind of sentiment analysis method and system based on keyword extraction
CN110633577A (en) * 2019-08-22 2019-12-31 阿里巴巴集团控股有限公司 Text desensitization method and device
CN111275122A (en) * 2020-02-03 2020-06-12 腾讯医疗健康(深圳)有限公司 Label labeling method, device, equipment and readable storage medium
CN111797430A (en) * 2020-06-30 2020-10-20 平安国际智慧城市科技股份有限公司 Data verification method, device, server and storage medium
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Information extraction method and device based on machine learning, computer equipment and medium
CN112001174A (en) * 2020-08-10 2020-11-27 深圳中兴网信科技有限公司 Text desensitization method, apparatus, electronic device and computer-readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100299589A1 (en) * 2009-05-19 2010-11-25 Studio Ousia Inc. Keyword display method and keyword display system
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN109214008A (en) * 2018-09-28 2019-01-15 珠海中科先进技术研究院有限公司 A kind of sentiment analysis method and system based on keyword extraction
CN110633577A (en) * 2019-08-22 2019-12-31 阿里巴巴集团控股有限公司 Text desensitization method and device
CN111275122A (en) * 2020-02-03 2020-06-12 腾讯医疗健康(深圳)有限公司 Label labeling method, device, equipment and readable storage medium
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Information extraction method and device based on machine learning, computer equipment and medium
CN111797430A (en) * 2020-06-30 2020-10-20 平安国际智慧城市科技股份有限公司 Data verification method, device, server and storage medium
CN112001174A (en) * 2020-08-10 2020-11-27 深圳中兴网信科技有限公司 Text desensitization method, apparatus, electronic device and computer-readable storage medium

Similar Documents

Publication Publication Date Title
US9331856B1 (en) Systems and methods for validating digital signatures
EP3418950A1 (en) Data exchange method, data exchange device and computing device
CN110489985B (en) Data processing method and device, computer readable storage medium and electronic equipment
TWI679553B (en) Method, system and intelligent equipment for checking tickets based on user interface
US10554641B2 (en) Second factor authorization via a hardware token device
CN107528830B (en) Account login method, system and storage medium
US11082425B2 (en) Pressure-based authentication
CN110933117B (en) Derivation and verification method, device and equipment of digital identity information
US9736122B2 (en) Bluesalt security
US11178022B2 (en) Evidence mining for compliance management
Radhika et al. Toeplitz matrices whose elements are the coefficients of functions with bounded boundary rotation
US20150310206A1 (en) Password management
CN112287376A (en) Method and device for processing private data
US9928378B2 (en) Sensitive data obfuscation in output files
CN110545542A (en) Main control key downloading method and device based on asymmetric encryption algorithm and computer equipment
CN109740359B (en) Method, apparatus and storage medium for data desensitization
CN112800465A (en) Method and device for processing text data to be labeled, electronic equipment and medium
CN112182509A (en) Method, device and equipment for detecting abnormity of compliance data
CN110990848A (en) Sensitive word encryption method and device based on hive data warehouse and storage medium
US11088923B2 (en) Multi-stage authorization
US20140089025A1 (en) Authenticating a response to a change request
US20200089896A1 (en) Encrypted log aggregation
US10838915B2 (en) Data-centric approach to analysis
CN111026800A (en) Data export method and device, electronic equipment and storage medium
KR101467123B1 (en) Monitoring of enterprise information leakage in smart phones

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination