CN113297378A - Text data labeling method and system, electronic equipment and storage medium - Google Patents

Text data labeling method and system, electronic equipment and storage medium Download PDF

Info

Publication number
CN113297378A
CN113297378A CN202110568451.5A CN202110568451A CN113297378A CN 113297378 A CN113297378 A CN 113297378A CN 202110568451 A CN202110568451 A CN 202110568451A CN 113297378 A CN113297378 A CN 113297378A
Authority
CN
China
Prior art keywords
data
sampling
label
text
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110568451.5A
Other languages
Chinese (zh)
Inventor
张振
张寒杉
许冬冬
蒋宏飞
宋旸
田晓飞
赵慧娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zuoyebang Education Technology Beijing Co Ltd
Original Assignee
Zuoyebang Education Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zuoyebang Education Technology Beijing Co Ltd filed Critical Zuoyebang Education Technology Beijing Co Ltd
Priority to CN202110568451.5A priority Critical patent/CN113297378A/en
Publication of CN113297378A publication Critical patent/CN113297378A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text data labeling method and system, electronic equipment and a storage medium are provided. The text data labeling method comprises the following steps: performing text preprocessing on a text to be labeled; counting the data set and extracting various characteristics of the data set; judging whether the marked data exist or not, and if so, training and predicting a model; and selecting a sampling strategy, and extracting data for labeling. The invention is based on the active learning technology, selects the data with the most representative and information quantity to be manually marked, cancels the limiting conditions for the label set and the seed data, improves the efficiency of manual marking and effectively reduces the use threshold.

Description

Text data labeling method and system, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a text data labeling method and system, electronic equipment and a storage medium.
Background
With the development of networks and artificial intelligence, the demand of data annotation services is increasing. Data labeling also goes from pure manual labeling at the very beginning to machine labeling through partial manual labeling, partial active learning. The existing intelligent data labeling platform in the current market is usually realized based on active learning, and unlabelled data are sampled according to uncertainty/diversity so as to improve the labeling efficiency. Since the sampling method is too simple, many constraints have to be put on the user, such as that the tag set is known, that each tag has a certain amount of seed data, and that the data amount of each tag is relatively balanced. However, in the case of a complex data set, such as unknown tag set, no seed data, and unbalanced category, the simple sampling strategy cannot meet the requirement of usage, and the complex sampling strategy requires a user to have a certain professional knowledge to know when the stage is marked, the policy a is switched to the policy b, or a combination of multiple sampling strategies is performed, which results in an excessively high usage threshold.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a method and system for labeling text data, an electronic device and a storage medium, so as to at least partially solve at least one of the above technical problems.
In order to achieve the above object, as a first aspect of the present invention, a text data annotation platform sampling strategy recommendation method is provided, including the following steps:
taking a text to be marked as a current data set;
extracting selected features of the data set based on the current data set;
and judging whether the current data set has labeled data or not, selecting a sampling strategy according to the judgment result, and labeling the extracted data of the current data set by using the sampling strategy.
As a second aspect of the present invention, there is also provided a text data annotation system, including:
the feature statistics extraction module is used for taking the text to be labeled as a current data set and extracting selected features of the data set;
the judging module is used for judging whether the current data set has labeled data or not;
and the strategy sampling module is used for selecting a sampling strategy and using the sampling strategy to extract data and label the current data set according to the judgment result of the judgment module.
As a third aspect of the present invention, there is also provided an electronic device comprising a processor and a memory, the memory storing a computer-executable program, the processor executing the text data annotation platform sampling strategy recommendation method as described above when the computer-executable program is executed by the processor.
As a fourth aspect of the present invention, there is also provided a computer-readable medium storing a computer-executable program which, when executed, implements the text data annotation platform sampling policy recommendation method as described above.
Based on the technical scheme, compared with the prior art, the text data labeling method and the text data labeling system have at least one of the following beneficial effects:
based on the text clustering, self-learning, active learning and other artificial intelligence technologies, the invention can select the most representative data and information quantity under the condition of complex data sets and deliver the data to artificial labeling, and the invention considers the data expansion of historical labels and the discovery of new labels in the labeling process, cancels the limiting conditions for the label sets and seed data, improves the efficiency of artificial labeling and effectively reduces the use threshold;
the invention is particularly suitable for complex data sets, the sampling strategy adopted by the invention can not only reduce the manual marking amount, but also ensure that the marking platform can adapt to various occasions, even the occasions of zero-seed data, and reduce the dependence on professional technicians, and the use threshold is lower.
Drawings
FIG. 1 is a block flow diagram of a text data labeling platform sampling strategy recommendation method of the present invention;
FIG. 2 is a block diagram of a text data annotation system of the present invention;
FIG. 3 is a schematic diagram of the electronic device of the present invention;
FIG. 4 is a schematic illustration of a storage medium of the present invention;
fig. 5 is a block flow diagram of a text data annotation platform sampling policy recommendation method according to embodiment 1 of the present invention.
Detailed Description
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
Some technical terms in this specification have the following meanings:
clustering
The process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. The traditional clustering analysis and calculation method mainly comprises the following steps: partitioning methods (e.g., k-means algorithms), hierarchical methods (e.g., hierarchical methods), density-based methods (density-based methods), grid-based methods (grid-based methods), and model-based methods (model-based methods), among others.
Clustering clusters
A set of samples generated by the clustering. Samples within the same cluster are similar to each other and different from samples in other clusters.
Uncertainty (uncertaintiy)i)
In machine learning problems, models often need to learn representations from a large amount of training data. One major challenge is: manual data labeling is often semantically ambiguous. Therefore, some papers adopt uncertainty to describe and correct errors on labeling, so that the model reduces the weight of learning uncertain samples, and turns to learning more accurate labeling.
Example of label
The examples are also referred to as "positive exemplification". Each concept is properly exemplified and inappropriately exemplified, the latter being referred to as "counterexample" or "negative exemplification". Such as elephants, lions, tigers, cats, dogs, whales, etc., are all positive examples of the concept of mammals. The label positive case represents a suitable instantiation satisfying this label.
Recall ratio (Recall)
Recall is a measure of coverage indicating the number of correct data to predict from those that are truly positive examples. The recall rate recall is TP/(TP + FN) ═ TP/P positive, and it can be seen that the recall rate and the sensitivity are the same. In the above formula, TP is True Positive, and FN is False Negative.
Sentence vector (sensor embedding)
And expressing sentences with indefinite length by using vectors with definite length to provide services for NLP downstream tasks. For word vectors, after training, each word corresponds to one vector, and the quality of embedding can be intuitively judged. However, for the sensor embedding, its evaluation is without a ground truth. Only the sensor embedding can be fed into the downstream task and its quality can be evaluated based on the performance of the downstream task.
Cosine similarity
Cosine distance, also called cosine similarity, is a measure of the magnitude of the difference between two individuals using the cosine value of the angle between two vectors in a vector space. The cosine value is closer to 1, which indicates that the included angle is closer to 0 degree, namely the two vectors are more similar, which is called cosine similarity.
Diversity sampling
The diversity sampling is used for ensuring the diversity of samples, and the data which is low in labeling rate of the cluster to which the data belongs or contains the effective words which are not covered is preferentially selected based on two dimensions of the cluster and the effective words.
Uncertainty sampling
And (4) carrying out reverse order arrangement on the data according to the uncertainty of the unlabeled data, and preferentially selecting the data samples with larger uncertainty.
Probabilistic hierarchical sampling
And layering each label according to the prediction scores, and randomly drawing a certain amount of data from each layer.
Minority class sampling
For labels with less labeled data, based on the predicted labels and scores of the data by the model, the data are arranged in a reverse order according to the scores and are layered, the width of each layer is increased according to an index, and a certain amount of data is extracted from each layer.
Tail-swept sampling
And according to the cosine similarity of the sentence vector, extracting data with the similarity smaller than a threshold value with the labeled data from the unlabeled data, and simultaneously ensuring that the similarity between the extracted data is larger than the threshold value.
For the automatic labeling method of active learning, many studies are already available at present, but certain seed data is needed to be used, or professional knowledge of sampling personnel is relied on. The invention provides a text data labeling method aiming at a complex data set, wherein a sampling strategy candidate set of a labeling platform is expanded into diversity sampling, uncertainty sampling, probability hierarchical sampling, minority class sampling and/or tail scanning sampling and the like; in the labeling process, data labeling conditions are counted, characteristics are constructed, and then an optimal sampling strategy or strategy combination is recommended based on the characteristics to extract data for manual labeling. When the manual marking amount is reduced, the application range of the marking platform is wider, and the application threshold is lower.
As shown in fig. 1, the method for recommending a sampling strategy of a text data labeling platform of the present invention includes the following steps:
taking a text to be marked as a current data set;
extracting selected features of the data set based on the current data set;
and judging whether the current data set has labeled data or not, selecting a sampling strategy according to the judgment result, and labeling the extracted data of the current data set by using the sampling strategy.
The sampling strategy recommendation method may further include:
judging whether the marked data set reaches a target coverage rate, if so, finishing the text marking to be marked; if not, repeating all the steps of extracting the selected characteristics of the data set, judging whether the current data set has labeled data or not, selecting a sampling strategy according to the judgment result, and labeling the extracted data of the current data set by using the sampling strategy until judging whether the target coverage rate is reached or not. The purpose of this step is to repeat the above-described sampling/labeling step of the present invention several times in order to better label a certain data set in its entirety.
After the sampling strategy is selected and labeled, the sampling strategy recommendation method can also comprise the steps of summarizing the data set and post-processing the strategy before judging whether the target coverage rate is reached. The purpose of summarizing is to ensure the correctness of data and to totally accumulate the data labeled for a plurality of times to the final coverage rate index. The strategy post-processing mainly refers to that for a partial sampling strategy, calculation and processing of some data are required after data extraction, for example, a few-class sampling strategy, recall rate estimation is required, and calculation of some other related parameters is also required, the recall rate estimation is already included in a few-class sampling strategy step, and other calculation and processing steps are not required and are only required in specific situations, so that the other calculation and processing steps can be placed in the strategy post-processing.
The text to be labeled may contain a certain amount of pre-labeled data, or may also be completely unlabeled data without seeds.
Wherein, the method further comprises a text preprocessing step, for example, the method may comprise:
hierarchically clustering text data based on sentence vectors, and recording cluster center (C) to which each piece of data belongs1,C2,…,Ck);
And segmenting the data based on a segmentation tool, counting the document frequency of the words, and recording the words with the document frequency greater than a first threshold (for example, 2) as an effective word set.
The clustering is carried out to ensure that words with the same or similar meanings can be put together for processing, thereby reducing the labor cost of repeated labeling.
Wherein, the text preprocessing step may further include, for example: the method comprises the following steps of performing one or more of primary screening on the text information, removing invalid texts, performing word segmentation on the text information, stopping words and the like. These need to be processed if there is a relevant situation depending on the state of the original text data, and the purpose of preprocessing is to ensure that the retained data are valid data, which can reduce the interference and deviation in post-sampling and labeling.
Wherein the selected features of the data set can be selected from the total number N of data and the number N of labelslabelMark coverage
Figure BDA0003081517150000061
Number of manual labels NmanualNumber N of machine expansion labelsmodelNumber of tags MtagThe number N of the labeled data of each labeltagUncertainty of each dataiNumber of data pieces with uncertainty greater than threshold NuncertainOne or more of historical sampling strategy set and labeling situation. It is necessary to determine which feature parameters need to be calculated in turn according to the sampling strategy that may be used by the input text data, and which sampling strategy may be used to calculate the feature parameters that are needed by the sampling strategy.
Wherein the uncertainty is defined as
Figure BDA0003081517150000071
x is text, tagiIs the ith label, Pθ(tagi| x) as a model to predict text x as tagiThe probability of (c).
The step of judging whether the labeled data exists or not, selecting a sampling strategy according to the judgment result and labeling comprises the following logic branches:
if the marked data does not exist, directly selecting a sampling strategy, and marking the extracted data of the current data set by using the sampling strategy;
if the labeled data exists, training a model based on the labeled data and the label, predicting the current data set by adopting the model, then selecting a sampling strategy, and labeling the extracted data of the current data set by using the selected sampling strategy;
wherein the step of training and predicting a model based on the labeled data and labels comprises:
a text classifier model is trained based on the labeled data and labels,predicting the unlabeled data, and recording the prediction result tag of each piece of dataiAnd scorei
Wherein the text classifier model is selected from, for example, LSTM, TextCNN or BERT models.
Wherein the selective sampling strategy satisfies the following conditions:
selecting a sampled strategy to accord with the strategy forbidding, the strategy voting and the strategy mutual exclusion;
one or more policies with higher scores are preferentially selected.
Wherein the policy disabling means:
if N is presentlabelDisabling all sampling strategies;
if N is presentlabelLess than or equal to 1 or MtagThe probability hierarchical sampling, minority class sampling and uncertainty sampling are forbidden when the probability hierarchical sampling is less than or equal to 1;
wherein the policy recommendation refers to:
if mark coverage ratiolabelWhen the probability is smaller than the threshold value, the probability stratified sampling and the score are added;
if the number of data pieces with higher uncertainty is higher than a threshold value, the uncertainty samples and the score are added;
if the cluster center or the effective word list is not completely covered, sampling and adding the diversity;
if the minimum number of the label is a single label
Figure BDA0003081517150000081
Less than threshold, minority class sample plus score;
annotating coverage ratiolabelIf the tail is larger than the threshold value, sampling and adding points;
wherein the policy mutual exclusion means:
the minority sampling strategy cannot coexist with other strategies;
and (4) sampling minority classes in the previous round, and if the estimated recall rate is greater than the threshold value, not selecting in the current round.
In the step of selecting the sampling strategy, the sampling strategy is selected from one or more of diversity sampling, uncertainty sampling, probability hierarchical sampling, minority class sampling and tail-sweeping sampling strategy;
wherein the diversity sampling strategy is as follows: preferentially selecting data which is low in labeling rate of the cluster or contains the effective words which are not covered on the basis of at least two dimensions of the cluster and the effective words;
wherein the uncertainty sampling strategy is as follows: according to the uncertainty of the unlabeled dataiCarrying out reverse order arrangement on the data, and preferentially selecting the data with high uncertainty;
wherein, the probability hierarchical sampling strategy is as follows: layering the unlabeled data predicted as the label according to the predicted score scorei for each label based on the predicted label and the score of the unlabeled data, and randomly extracting a certain amount of data from each layer;
for the probability hierarchical sampling strategy, after the current round of manual labeling is finished, counting a fractional layer with the accuracy rate of each label higher than a second threshold (for example, 0.8) according to a manual labeling result, marking the label on data on the layer, and adding the label as machine extension data into a labeled set;
wherein the minority class sampling strategy is as follows: for labels with less labeled data, based on the predicted labels and scores of the data by the model, arranging the data in a reverse order according to the scores and layering, wherein the width of each layer is increased according to an index, and extracting a certain amount of data from each layer;
for the minority sampling strategies, after the manual labeling of the current round is finished, the label regular case concentration in each layer is estimated according to the manual labeling result, so that the recall rate estimation of the label is obtained;
wherein, the tail-sweeping sampling strategy is as follows: according to the cosine similarity of the sentence vectors, extracting data with the similarity smaller than a third threshold (for example, smaller than 0.15) with the labeled data from the unlabeled data, and simultaneously ensuring that the similarity between the extracted data is smaller than the third threshold (for example, smaller than 0.15).
As shown in fig. 2, the present invention also discloses a text data annotation system, which includes:
the feature statistics extraction module is used for taking the text to be labeled as a current data set and extracting selected features of the data set;
the judging module is used for judging whether the current data set has labeled data or not;
and the strategy sampling module is used for selecting a sampling strategy and using the sampling strategy to extract data and label the current data set according to the judgment result of the judgment module.
The text data labeling system also comprises a target coverage rate detection module for detecting whether the target coverage rate of the labeled text processed by the sampling labeling module meets the requirement, if not, the feature statistics extraction module, the machine learning module and the sampling labeling module are called to continuously label the residual unlabeled text.
The text to be labeled may contain a certain amount of pre-labeled data, or may also be completely unlabeled data without seeds.
The text data annotation system may further include a preprocessing module, configured to perform text preprocessing on the current data set, for example, including the following operations:
hierarchically clustering text data based on sentence vectors, and recording cluster center (C) to which each piece of data belongs1,C2,…,Ck);
And segmenting the data based on a segmentation tool, counting the document frequency of the words, and recording the words with the document frequency greater than a first threshold (for example, 2) as an effective word set.
Wherein, the text preprocessing operation of the preprocessing module further comprises, for example: the method comprises the following steps of performing one or more of primary screening on the text information, removing invalid texts, performing word segmentation on the text information, stopping words and the like.
Wherein the selected features of the data set are selected from the group consisting of total number of data N, number of labels NlabelMark coverage
Figure BDA0003081517150000091
Number of manual labels NmanualNumber N of machine expansion labelsmodelSign, signNumber of labels MtagThe number N of the labeled data of each labeltagUncertainty of each dataiNumber of data pieces with uncertainty greater than threshold NuncertainOne or more of historical sampling strategy set and labeling situation.
Wherein the uncertainty is defined as
Figure BDA0003081517150000092
x is text, tagiIs the ith label, Pθ(tagi| x) as a model to predict text x as tagiThe probability of (c).
The text data labeling system may further include a machine learning module for performing operations of training and predicting a model based on labeled data and labels, for example, the operations include the following steps:
training a text classifier model based on labeled data and labels, predicting unlabeled data, and recording the label tag of the prediction result of each piece of dataiAnd scorei
Wherein the text classifier model is selected from, for example, LSTM, TextCNN or BERT models.
Wherein the selective sampling strategy satisfies the following conditions:
selecting a sampled strategy to accord with the strategy forbidding, the strategy voting and the strategy mutual exclusion;
one or more policies with higher scores are preferentially selected.
Wherein the policy disabling means:
if N is presentlabelDisabling all sampling strategies;
if N is presentlabelLess than or equal to 1 or MtagThe probability hierarchical sampling, minority class sampling and uncertainty sampling are forbidden when the probability hierarchical sampling is less than or equal to 1;
wherein the policy recommendation refers to:
if mark coverage ratiolabelWhen the probability is smaller than the threshold value, the probability stratified sampling and the score are added;
if the number of data pieces with higher uncertainty is higher than a threshold value, the uncertainty samples and the score are added;
if the cluster center or the effective word list is not completely covered, sampling and adding the diversity;
if the minimum number of the label is a single label
Figure BDA0003081517150000101
Less than threshold, minority class sample plus score;
annotating coverage ratiolabelIf the tail is larger than the threshold value, sampling and adding points;
wherein the policy mutual exclusion means:
the minority sampling strategy cannot coexist with other strategies;
and (4) sampling minority classes in the previous round, and if the estimated recall rate is greater than the threshold value, not selecting in the current round.
Various thresholds in the policy recommendation rule and the policy mutual exclusion rule are valued according to actual experience, and different application scenes can be finely adjusted.
In the step of selecting the sampling strategy, the sampling strategy is selected from one or more of diversity sampling, uncertainty sampling, probability hierarchical sampling, minority class sampling and tail-sweeping sampling strategy;
wherein the diversity sampling strategy is as follows: preferentially selecting data which is low in labeling rate of the cluster or contains the effective words which are not covered on the basis of at least two dimensions of the cluster and the effective words;
wherein the uncertainty sampling strategy is as follows: according to the uncertainty of the unlabeled dataiCarrying out reverse order arrangement on the data, and preferentially selecting the data with high uncertainty;
wherein, the probability hierarchical sampling strategy is as follows: based on the predicted label and the score of the unlabeled data, for each label, the unlabeled data predicted as the label is according to the prediction scoreiLayering and randomly extracting a certain amount (for example, 5 or 10) of data from each layer; after the manual labeling of the current round is finished, counting the fractional layer with the accuracy rate of each label higher than a second threshold (for example, 0.8) according to the manual labeling result, and locating the fractional layer in the stepThe label is printed on the data of the layer, and the data is used as machine label expanding data to be added into the labeled set;
wherein the minority class sampling strategy is as follows: for labels with less labeled data, based on the predicted labels and scores of the data by the model, arranging the data in a reverse order according to the scores and layering, wherein the width of each layer is increased according to an index, and extracting a certain amount of data from each layer;
for the minority class sampling strategy, after the manual labeling of the current round is finished, estimating the label regular case concentration in each layer according to the manual labeling result, so as to obtain the recall rate estimation of the label;
wherein, the tail-sweeping sampling strategy is as follows: according to the cosine similarity of the sentence vectors, extracting data with the similarity smaller than a third threshold (for example, smaller than 0.15) with the labeled data from the unlabeled data, and simultaneously ensuring that the similarity between the extracted data is smaller than the third threshold (for example, smaller than 0.15).
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
Example 1
As shown in fig. 5, the intelligent text annotation method of this embodiment includes the following steps:
step 1, taking a text to be marked as a current data set, and performing text preprocessing on the current data set;
step 2, counting various features required by a recommendation strategy based on the current data labeling condition;
step 3, judging whether labeled data exist or not, if so, training a text classifier model of a BERT algorithm based on the labeled data and the label, and predicting the current data set by adopting the text classifier model;
step 4, recommending a sampling strategy, wherein the sampling strategy is screened according to the following conditions:
the rules of strategy forbidding, strategy voting and strategy mutual exclusion are met;
preferentially selecting one or more policies with higher scores;
step 5, extracting data according to a sampling strategy;
step 6, manually labeling the extracted data;
step 7, data summarization and strategy post-processing;
step 8, judging whether the data label of the current data set reaches the target coverage rate, if not, skipping to the step 2; if so, completing the text data labeling method.
Uploading the method code to a company platform, carrying out actual labeling and using, wherein the data comes from voc questionnaire, and the data volume: 278566, original label unknown.
The labeling is performed for 5 rounds, and after each round is completed, the number of the labels is respectively 46, 75, 76, 87 and 91 (namely, each round is labeled with new appearance of 46, 29, 1, 11 and 4 labels).
The sampling strategies adopted by each round are respectively diversity sampling, probability layering + diversity + uncertainty sampling, minority class sampling, probability layering + diversity sampling and diversity + tail sweeping sampling, the number of manual labels of each round is respectively 400, 2226, 1289, 2521 and 1542, and finally 251277 data are covered by the manual label and the machine label.
Therefore, the method can well execute the data labeling scene without seed data and unknown label, and expands the application field of conventional machine labeling.
The invention also discloses an electronic device, and fig. 3 is a schematic structural diagram of the electronic device of the invention, and as shown in fig. 3, the electronic device includes a processor and a memory, the memory is used for storing a computer executable program, wherein when the computer executable program is executed by the processor, the processor executes the text data annotation platform sampling strategy recommendation method as described above.
The electronic device of the present invention is embodied in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.
The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.
The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).
Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for data exchange between the electronic device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.
It should be understood that the electronic device shown in fig. 3 is only one example of the present invention, and elements or components not shown in the above example may be further included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method.
The invention also discloses a storage medium, and FIG. 4 is a schematic diagram of the storage medium of the invention. As shown in fig. 4, the storage medium stores thereon a computer executable program, wherein the computer executable program, when executed, implements the text data annotation platform sampling strategy recommendation method as described above. The storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, Java, C + + or the like and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
From the above description of embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system, and other electronic devices, such as communication electronic devices, entertainment electronic devices, learning electronic devices, etc., including at least a portion of the system or components described above. The invention can also be implemented by computer software executing the method of the invention, e.g. by control software executed by a microprocessor of a client, an electronic control unit, a client, a server, etc. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, but may also be implemented in a distributed manner by hardware entities without specific details, for example, some method steps executed by the computer program may be executed at the locomotive end, and another part may be executed in the mobile terminal or the smart helmet, etc. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A text data labeling method is characterized by comprising the following steps:
taking a text to be marked as a current data set;
extracting selected features of the data set based on the current data set;
and judging whether the current data set has labeled data or not, selecting a sampling strategy according to the judgment result, and labeling the extracted data of the current data set by using the sampling strategy.
2. The method of claim 1, further comprising:
judging whether the marked data set reaches a target coverage rate, if so, finishing the text marking to be marked; if not, repeating all the steps of extracting the selected characteristics of the data set, judging whether the current data set has labeled data or not, selecting a sampling strategy according to the judgment result, and labeling the extracted data of the current data set by using the sampling strategy until judging whether the target coverage rate is reached or not.
3. The method of claim 1,
the text to be marked comprises a certain amount of pre-marked data or completely unmarked data without seeds;
preferably, the method further comprises a text preprocessing step, the text preprocessing step comprising:
hierarchically clustering text data based on sentence vectors, and recording cluster center (C) to which each piece of data belongs1,C2,…,Ck);
Segmenting data, counting document frequency of words, and recording the words with the document frequency larger than a first threshold value as an effective word set;
preferably, the text preprocessing step further includes: one or more steps of primary screening the text information, removing invalid texts, segmenting words of the text information and removing stop words are carried out on the text information;
preferably, the selected features of the data set are selected from the group consisting of total number of data N, number of labels NlabelMark coverage
Figure FDA0003081517140000011
Number of manual labels NmanualNumber N of machine expansion labelsmodelNumber of tags MtagThe number N of the labeled data of each labeltagUncertainty of each dataiNumber of data pieces with uncertainty greater than threshold NuncertainOne or more of a historical sampling strategy set and a labeling situation;
wherein the uncertainty is defined as
Figure FDA0003081517140000021
x is text, tagiIs the ith label, Pθ(tagi| x) as a model to predict text x as tagiThe probability of (c).
4. The method according to any one of claims 1 to 3,
judging whether the labeled data exists, selecting a sampling strategy according to the judgment result and labeling the sampling strategy, wherein the steps comprise:
if the marked data does not exist, directly selecting a sampling strategy, and marking the extracted data of the current data set by using the sampling strategy;
if the labeled data exists, training a model based on the labeled data and the label, predicting the current data set by adopting the model, then selecting a sampling strategy, and labeling the extracted data of the current data set by using the selected sampling strategy;
preferably, the step of training a model and predicting based on the labeled data and labels comprises:
training a text classifier model based on labeled data and labels, predicting unlabeled data, and recording the label tag of the prediction result of each piece of dataiAnd scorei
Preferably, the text classifier model is selected from the LSTM, TextCNN, or BERT models.
5. The method of claim 4,
the selective sampling strategy satisfies the following conditions:
selecting a sampled strategy to accord with the strategy forbidding, the strategy voting and the strategy mutual exclusion;
one or more policies with higher scores are preferentially selected.
6. The method of claim 5,
the policy disabling means:
if N is presentlabelDisabling all sampling strategies;
if N is presentlabelLess than or equal to 1 or MtagThe probability hierarchical sampling, minority class sampling and uncertainty sampling are forbidden when the probability hierarchical sampling is less than or equal to 1;
optionally, the policy recommendation refers to:
if mark coverage ratiolabelWhen the probability is smaller than the threshold value, the probability stratified sampling and the score are added;
if the number of data pieces with higher uncertainty is higher than a threshold value, the uncertainty samples and the score are added;
if the cluster center or the effective word list is not completely covered, sampling and adding the diversity;
if the minimum number of the label is a single label
Figure FDA0003081517140000031
Less than threshold, minority class sample plus score;
annotating coverage ratiolabelIf the tail is larger than the threshold value, sampling and adding points;
optionally, the policy mutual exclusion refers to:
the minority sampling strategy cannot coexist with other strategies;
and (4) sampling minority classes in the previous round, and if the estimated recall rate is greater than the threshold value, not selecting in the current round.
7. The method of claim 6,
in the step of selecting the sampling strategy, the sampling strategy is selected from one or more of diversity sampling, uncertainty sampling, probability hierarchical sampling, minority class sampling and tail-sweeping sampling strategy;
preferably, the diversity sampling strategy is: preferentially selecting data which is low in labeling rate of the cluster or contains the effective words which are not covered on the basis of at least two dimensions of the cluster and the effective words;
preferably, the uncertainty sampling strategy is as follows: according to the uncertainty of the unlabeled dataiCarrying out reverse order arrangement on the data, and preferentially selecting the data with high uncertainty;
preferably, the probabilistic hierarchical sampling strategy is that: based on the predicted label and the score of the unlabeled data, for each label, the unlabeled data predicted as the label is according to the prediction scoreiLayering is carried out, and a certain amount of data is randomly extracted from each layer; after the manual labeling of the current round is finished, according to peopleCounting a fraction layer with the accuracy rate of each label higher than a second threshold value according to the result of the worker annotation, marking the label on the data on the layer, and adding the label as machine extension data into an annotated set;
preferably, the minority sampling strategy is as follows: for labels with less labeled data, based on the predicted labels and scores of the data by the model, arranging the data in a reverse order according to the scores and layering, wherein the width of each layer is increased according to an index, and extracting a certain amount of data from each layer;
preferably, for the minority sampling strategies, after the manual labeling of the current round is finished, the label regular case concentration in each layer is estimated according to the manual labeling result, so that the recall rate estimation of the label is obtained;
preferably, the tail-sweeping sampling strategy is as follows: and according to the cosine similarity of the sentence vectors, extracting data with the similarity smaller than a third threshold value with the labeled data from the unlabeled data, and simultaneously ensuring that the similarity between the extracted data is smaller than the third threshold value.
8. A text data annotation system, comprising:
the feature statistics extraction module is used for taking the text to be labeled as a current data set and extracting selected features of the data set;
the judging module is used for judging whether the current data set has labeled data or not;
and the strategy sampling module is used for selecting a sampling strategy and using the sampling strategy to extract data and label the current data set according to the judgment result of the judgment module.
9. An electronic device comprising a processor and a memory, the memory for storing a computer-executable program, characterized in that:
the computer executable program, when executed by the processor, performs the text data annotation method of any one of claims 1-7.
10. A computer-readable medium storing a computer-executable program, wherein the computer-executable program, when executed, implements the text data annotation method of any one of claims 1-7.
CN202110568451.5A 2021-05-24 2021-05-24 Text data labeling method and system, electronic equipment and storage medium Pending CN113297378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110568451.5A CN113297378A (en) 2021-05-24 2021-05-24 Text data labeling method and system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110568451.5A CN113297378A (en) 2021-05-24 2021-05-24 Text data labeling method and system, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113297378A true CN113297378A (en) 2021-08-24

Family

ID=77324640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110568451.5A Pending CN113297378A (en) 2021-05-24 2021-05-24 Text data labeling method and system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113297378A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357990A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Text data labeling method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357990A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Text data labeling method and device, electronic equipment and storage medium
CN114357990B (en) * 2022-03-18 2022-05-31 北京创新乐知网络技术有限公司 Text data labeling method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2018196561A1 (en) Label information generating method and device for application and storage medium
CN111476256A (en) Model training method and device based on semi-supervised learning and electronic equipment
CN109783812B (en) Chinese named entity recognition method, system and device based on self-attention mechanism
CN110008474B (en) Key phrase determining method, device, equipment and storage medium
CN113326380B (en) Equipment measurement data processing method, system and terminal based on deep neural network
CN111027600B (en) Image category prediction method and device
CN116049412B (en) Text classification method, model training method, device and electronic equipment
CN111159485A (en) Tail entity linking method, device, server and storage medium
CN110162609B (en) Method and device for recommending consultation problems to user
CN111309910A (en) Text information mining method and device
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113297351A (en) Text data labeling method and device, electronic equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN110599200A (en) Detection method, system, medium and device for false address of OTA hotel
CN110008365A (en) A kind of image processing method, device, equipment and readable storage medium storing program for executing
CN111898704A (en) Method and device for clustering content samples
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN113297378A (en) Text data labeling method and system, electronic equipment and storage medium
CN113486174A (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN112685374B (en) Log classification method and device and electronic equipment
CN117216249A (en) Data classification method, device, electronic equipment, medium and vehicle
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN114358011A (en) Named entity extraction method and device and electronic equipment
CN113591731A (en) Knowledge distillation-based weak surveillance video time sequence behavior positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination