CN111914061A

CN111914061A - Radius-based uncertainty sampling method and system for text classification active learning

Info

Publication number: CN111914061A
Application number: CN202010669244.4A
Authority: CN
Inventors: 朱其立; 沈李斌; 廖千姿; 顾钰仪; 赵迎功; 吴海华
Original assignee: Shanghai Leyan Information Technology Co ltd
Current assignee: Shanghai Leyan Information Technology Co ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-11-10
Anticipated expiration: 2040-07-13
Also published as: CN111914061B

Abstract

The invention discloses a radius-based uncertainty sampling method and system for text classification active learning, which are applied to multi-class short texts, optimize and weaken the adverse effect of noise in a scene on a result, have higher universality and can be suitable for any depth model with a hidden layer. The technical scheme is as follows: grading the information entropy of the data point of each labeled data through a text classifier, and giving a prediction type of the labeled data; the text classifier gives corresponding prediction categories to the unlabelled data; respectively calculating the radius of each prediction category; and combining the information entropy scores of the marked data, the prediction types of the marked data and the radii of the prediction types of the unmarked data to obtain a comprehensive score.

Description

Radius-based uncertainty sampling method and system for text classification active learning

Technical Field

The invention relates to a sampling method and a system, in particular to a radius-based uncertainty sampling method and a system for text classification active learning.

Background

With the prevalence of e-commerce and online communication, multi-category short texts in many application fields, such as instant messaging, online chat logs, bulletin board system titles, Internet news reviews, Twitter, etc., are filling up people's daily lives. Therefore, in many cases of theme recommendation, e-commerce chat robots, and the like, it becomes very important to process short text. However, due to the nature of short text such as non-standard, manual work is required to deal with spelling errors, non-standard terms and noise. Furthermore, since most short text data sets are typically very unevenly distributed, repeating the labeling in the same type of data class wastes a significant amount of labeling work.

Active learning can now be used to deal with the short text classification problem.

The framework of active learning is shown in fig. 1.

Given a dataset z { (x)₁,1),...x_NN), where x_iIs a D-dimensional feature vector, y_iE.g., {0, 1., K }. To describe active learning, we will divide into labeled and unlabeled datasets. f is the classifier.

The general active learning algorithm mainly comprises the following steps:

a. algorithms from small portions of labeled datasets

And most unlabelled data U_t＝Z\_tTo begin, when t is 0;

b. by L_tTraining classifier f_t；

c.Determining the data x to be marked in the next iteration according to the sampling method^*∈U_t；

d.x^*By manual marking gives the label y^*；

e.t are incremented and steps b-e are repeated until the classifier achieves the desired model accuracy or the number of iterations reaches a preset limit.

In the process shown in fig. 1, the text classifier learns the labeled data, evaluates the unlabeled data, selects the most valuable data to label manually, then adds the most valuable data to the labeled data, and repeats the step until the iteration number reaches the upper limit or the model accuracy reaches the standard.

However, when the traditional active learning method is applied to the multi-class short text data set, the traditional active learning method finds that the performance of the multi-class short text data set is poor and has no great difference with random sampling. Experiments show that with the increase of the category number of the data sets, the performance of the existing sampling method is reduced, and no method is applied to the industry.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

The invention aims to solve the problems and provides a radius-based uncertainty sampling method and system for text classification active learning, which are applied to multi-class short texts, optimize and weaken the adverse effect of noise in scenes on results, have high universality and can be suitable for any depth model with a hidden layer.

The technical scheme of the invention is as follows: the invention discloses a radius-based uncertainty sampling method for text classification active learning, which comprises the steps of processing unmarked data and comprehensively scoring a processing result based on the unmarked data, wherein the steps of:

the processing process of the unlabeled data comprises the following steps:

grading the information entropy of each data point of the labeled data through a text classifier, and giving a prediction category of the unlabeled data;

respectively calculating the radius of each prediction category;

the comprehensive scoring process comprises the following steps:

and combining the information entropy scores of the unlabeled data and the radiuses of the prediction categories of the unlabeled data to obtain a comprehensive score.

According to an embodiment of the radius-based uncertainty sampling method for active learning of text classification of the present invention, the score H (x) of the information entropy of data points of unlabeled data_i)：

Wherein the content of the first and second substances,

is the text classifier predicts the data point x_iIs labeled as the probability of j, and the parameter k represents the data point x_iThe number of tags.

According to an embodiment of the radius-based uncertainty sampling method for active learning by text classification of the present invention, the process of calculating the radius of the prediction class further comprises:

obtaining the center of the category according to the data points in the prediction category on average;

calculating the cosine similarity of each data point and the center thereof in the prediction category;

and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.

According to an embodiment of the radius-based uncertainty sampling method of text classification active learning of the present invention, the center c of the class is predicted (class y):

wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):

radius r of prediction category y (category y):

according to an embodiment of the radius-based uncertainty sampling method for active learning of text classification of the present invention, the calculation of the composite score v (x) is:

where h (x) is the score of the information entropy of the label data x and r (category y) is the radius of the predicted category y.

The invention also discloses a radius-based uncertainty sampling system for text classification active learning, which comprises an unlabeled data processing module and a comprehensive scoring module, wherein:

the unlabeled data processing module is configured to grade the information entropy of the data point of the unlabeled data through a text classifier, give a prediction category corresponding to the unlabeled data, and respectively calculate the radius of each prediction category;

the comprehensive scoring module is configured to combine the information entropy scoring of the labeled data and the radius of the prediction category of the unlabeled data to obtain a comprehensive scoring.

According to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the unlabeled data processing module is configured to score H (x) the entropy of data points of unlabeled data_i)：

Wherein the content of the first and second substances,

According to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the configuration of the unlabeled data processing module further includes: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.

In accordance with an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the unlabeled data processing module is configured to predict the center c of a class (class y):

radius r of prediction category y (category y):

according to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, in the configuration of the comprehensive scoring module, the calculation of the comprehensive score v (x) is:

Compared with the prior art, the invention has the following beneficial effects: the uncertainty sampling method based on the radius relaxes the condition of similarity weight, and takes the radius of the whole category as the weight compared with the cosine similarity calculated for each point. In the aspect of application fields and design frameworks, traditional active learning is more focused on theories when applied to text classification, and scenes with more noise in the industry are easily ignored. In the aspect of model universality, the optimization scheme of the invention is not limited to downstream text classifiers, and the sampling method of the invention can be used in depth models with hidden layers.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.

Fig. 1 illustrates a conventional active learning framework.

FIG. 2 is a flowchart illustrating an embodiment of the radius-based uncertainty sampling method of text classification active learning of the present invention.

Fig. 3 shows a flow chart of the radius calculation step in the embodiment shown in fig. 2.

FIG. 4 illustrates a schematic diagram of an embodiment of the present invention text classification active learning radius-based uncertainty sampling system.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.

FIG. 2 illustrates a flow of an embodiment of the present invention of a radius-based uncertainty sampling method of text classification active learning. Referring to fig. 2, the method for sampling uncertainty based on radius for active learning of text classification in this embodiment includes processing unlabeled data and performing comprehensive scoring on the processing result based on the unlabeled data.

Firstly, the information entropy of each data point of the unlabeled data is scored through a text classifier, and the prediction category of the unlabeled data is given.

All data points x constituting the data x to be annotated_iScore of information entropy of (H (x)_i) The following were used:

wherein the content of the first and second substances,

The text classifier gives the prediction category of the data x to be labeled.

The text classifier of the present embodiment may use deep learning models such as FastText, BERT, CNN, and LSTM using Attention mechanism.

The FastText model is based on the Word2Vec model, using the entire sequence of text to predict the intermediary words, rather than context, many orders of magnitude faster than a traditional deep network.

BERT (bidirectional Encoder reproduction from transformers) implements the Encoder portion of a bi-directional Transformer. The model realizes two methods, mask LM and Next sequence Prediction, on Wikipedia and Book desk to respectively capture expression of word and Sentence level. The main innovation point of the model is to learn a pre-training language model and improve the model effect in an upstream task through fine tuning application.

A Convolutional Neural Network (CNN) can also be applied to text classification similarly to picture classification, and the most basic CNN model is used in this embodiment, having five layers. The first layer is an embedding layer (embedding layer) which converts sentences into a two-dimensional matrix; the second layer is a convolutional layer, the third layer is a pooling layer, the fourth layer is a fully-connected layer, and the last layer is a softmax layer.

The Attention mechanism breaks the limitation that the traditional coder-decoder (encoder-decoder) structure depends on an internal fixed length in coding and decoding. In implementation, the attribute mechanism outputs the intermediate output result of the input sequence by keeping the LSTM (Long Short-Term Memory) encoder, and then trains the model to selectively learn it and associate the output sequence with it when the model outputs. The invention uses hidden layer output before the model softmax layer as a data representation method.

The processing for unlabeled data is partially as follows.

First, the text classifier gives corresponding prediction classes to unlabeled data.

Then, for each prediction category 1-k, the radius of the prediction category is calculated separately.

As shown in fig. 3, taking the calculation of the prediction type y as an example, the process of calculating the radius r (type y) of the prediction type y is as follows.

Firstly, the center of the category y is obtained according to the average of data points in the predicted category y

The | category y | | represents the collective measure belonging to the prediction category y. In this equation, data i in the prediction type y, i.e., data 1 to data x in the leftmost dashed box in fig. 3, is actually used for the calculation of the hidden layer vector.

Then, the cosine similarity of each data point i in the prediction category y and the center c (category y) is calculated,

the cosine similarity is used to measure the representativeness of a data point, i.e. the distance from the data point i to the center, and h (data i) is a representative vector of the data i output by the model hidden layer. Cosine similarity is a measurement mode of distance, d (data i) on the left side of the formula refers to cosine similarity between a data point i and a center c, and distance between the data i and the c is abbreviated as d.

And finally, selecting the value d (data i) of the maximum cosine similarity in the prediction category y as the radius r (category y) of the prediction category y:

and obtaining a comprehensive score based on the processing result of the unlabeled data (the information entropy score H (x) of the unlabeled data x and the radius r of the prediction category). Specifically, after the radius r of each prediction type is calculated, the information entropy score h (x) and the radius r of the prediction type y (type y) to which data is not labeled are combined to obtain a composite score:

v (x) is a sampling criterion for active learning. For each point, V (x) is calculated according to a certain method, then all V (x) are sequenced, and the best point is selected for marking, which is an outline of a sampling method of active learning. The method provided by the invention is used for realizing V (x).

FIG. 4 illustrates the principles of an embodiment of the present invention of a radius-based uncertainty sampling system for text classification active learning. Referring to fig. 4, the system of the present embodiment includes: the system comprises an unlabeled data processing module and a comprehensive grading module.

The unlabeled data processing module is configured to give corresponding prediction categories to the unlabeled data through the text classifier, and calculate the radius of each prediction category respectively. In the configuration of the unlabeled data processing module, the score H (x) of the information entropy of the data point of the unlabeled data_i)：

Wherein the content of the first and second substances,

In the configuration of the unlabeled data processing module, the calculating the radius of the prediction class further includes: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.

In the above configuration, the center c of the prediction category (category y):

radius r of prediction category y (category y):

the comprehensive scoring module is configured to combine the information entropy score of the unlabeled data and the radius of the prediction category of the unlabeled data to obtain a comprehensive score.

The composite score v (x) is calculated as:

where H (x is the score of the entropy of the information of the annotation data x, and r (category y) is the radius of the predicted category y.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A radius-based uncertainty sampling method for active learning of text classification is characterized by comprising the steps of processing unlabeled data and comprehensively scoring the processing results based on the unlabeled data, wherein:

the processing process of the unlabeled data comprises the following steps:

grading the information entropy of each data point of the unlabeled data through a text classifier, and giving a prediction type of the unlabeled data;

respectively calculating the radius of each prediction category;

the comprehensive scoring process comprises the following steps:

2. The method of claim 1, wherein the score of entropy of data points not labeled with data is H (x)_i)：

Wherein the content of the first and second substances,

3. The method of claim 1, wherein the step of calculating the radius of the prediction class further comprises:

4. The method of claim 3, wherein the center of the prediction class c (class y):

radius r of prediction category y (category y):

5. the method of claim 1, wherein the composite score, v (x), is calculated by:

6. A radius-based uncertainty sampling system for active learning of text classification, the system comprising an unlabeled data processing module and a comprehensive scoring module, wherein:

the unlabeled data processing module is configured to grade the information entropy of the data point of each unlabeled data through a text classifier, give a prediction category corresponding to the unlabeled data, and respectively calculate the radius of each prediction category;

7. The radius-based uncertainty sampling system for active learning of text classification according to claim 6, characterized by a score of H (x) of the information entropy of data points of unlabeled data_i)：

Wherein the content of the first and second substances,

8. The radius-based uncertainty sampling system for text classification active learning according to claim 6, wherein the configuration of the unlabeled data processing module wherein calculating the radius of the prediction class further comprises: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.

9. The radius-based uncertainty sampling system for text classification active learning according to claim 8, wherein the unlabeled data processing module is configured to predict a center c (class y) of a class:

radius r of prediction category y (category y):

10. the radius-based uncertainty sampling system for active learning of text classification according to claim 6, wherein the composite score module is configured such that the composite score V (x) is calculated as: