CN111914061A - Radius-based uncertainty sampling method and system for text classification active learning - Google Patents

Radius-based uncertainty sampling method and system for text classification active learning Download PDF

Info

Publication number
CN111914061A
CN111914061A CN202010669244.4A CN202010669244A CN111914061A CN 111914061 A CN111914061 A CN 111914061A CN 202010669244 A CN202010669244 A CN 202010669244A CN 111914061 A CN111914061 A CN 111914061A
Authority
CN
China
Prior art keywords
data
category
radius
prediction
unlabeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010669244.4A
Other languages
Chinese (zh)
Other versions
CN111914061B (en
Inventor
朱其立
沈李斌
廖千姿
顾钰仪
赵迎功
吴海华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Leyan Information Technology Co ltd
Original Assignee
Shanghai Leyan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Leyan Information Technology Co ltd filed Critical Shanghai Leyan Information Technology Co ltd
Priority to CN202010669244.4A priority Critical patent/CN111914061B/en
Publication of CN111914061A publication Critical patent/CN111914061A/en
Application granted granted Critical
Publication of CN111914061B publication Critical patent/CN111914061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a radius-based uncertainty sampling method and system for text classification active learning, which are applied to multi-class short texts, optimize and weaken the adverse effect of noise in a scene on a result, have higher universality and can be suitable for any depth model with a hidden layer. The technical scheme is as follows: grading the information entropy of the data point of each labeled data through a text classifier, and giving a prediction type of the labeled data; the text classifier gives corresponding prediction categories to the unlabelled data; respectively calculating the radius of each prediction category; and combining the information entropy scores of the marked data, the prediction types of the marked data and the radii of the prediction types of the unmarked data to obtain a comprehensive score.

Description

Radius-based uncertainty sampling method and system for text classification active learning
Technical Field
The invention relates to a sampling method and a system, in particular to a radius-based uncertainty sampling method and a system for text classification active learning.
Background
With the prevalence of e-commerce and online communication, multi-category short texts in many application fields, such as instant messaging, online chat logs, bulletin board system titles, Internet news reviews, Twitter, etc., are filling up people's daily lives. Therefore, in many cases of theme recommendation, e-commerce chat robots, and the like, it becomes very important to process short text. However, due to the nature of short text such as non-standard, manual work is required to deal with spelling errors, non-standard terms and noise. Furthermore, since most short text data sets are typically very unevenly distributed, repeating the labeling in the same type of data class wastes a significant amount of labeling work.
Active learning can now be used to deal with the short text classification problem.
The framework of active learning is shown in fig. 1.
Given a dataset z { (x)1,1),...xNN), where xiIs a D-dimensional feature vector, yiE.g., {0, 1., K }. To describe active learning, we will divide into labeled and unlabeled datasets. f is the classifier.
The general active learning algorithm mainly comprises the following steps:
a. algorithms from small portions of labeled datasets
Figure BDA0002581619460000011
And most unlabelled data Ut=Z\tTo begin, when t is 0;
b. by LtTraining classifier ft
c.Determining the data x to be marked in the next iteration according to the sampling method*∈Ut
d.x*By manual marking gives the label y*
e.t are incremented and steps b-e are repeated until the classifier achieves the desired model accuracy or the number of iterations reaches a preset limit.
In the process shown in fig. 1, the text classifier learns the labeled data, evaluates the unlabeled data, selects the most valuable data to label manually, then adds the most valuable data to the labeled data, and repeats the step until the iteration number reaches the upper limit or the model accuracy reaches the standard.
However, when the traditional active learning method is applied to the multi-class short text data set, the traditional active learning method finds that the performance of the multi-class short text data set is poor and has no great difference with random sampling. Experiments show that with the increase of the category number of the data sets, the performance of the existing sampling method is reduced, and no method is applied to the industry.
Disclosure of Invention
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
The invention aims to solve the problems and provides a radius-based uncertainty sampling method and system for text classification active learning, which are applied to multi-class short texts, optimize and weaken the adverse effect of noise in scenes on results, have high universality and can be suitable for any depth model with a hidden layer.
The technical scheme of the invention is as follows: the invention discloses a radius-based uncertainty sampling method for text classification active learning, which comprises the steps of processing unmarked data and comprehensively scoring a processing result based on the unmarked data, wherein the steps of:
the processing process of the unlabeled data comprises the following steps:
grading the information entropy of each data point of the labeled data through a text classifier, and giving a prediction category of the unlabeled data;
respectively calculating the radius of each prediction category;
the comprehensive scoring process comprises the following steps:
and combining the information entropy scores of the unlabeled data and the radiuses of the prediction categories of the unlabeled data to obtain a comprehensive score.
According to an embodiment of the radius-based uncertainty sampling method for active learning of text classification of the present invention, the score H (x) of the information entropy of data points of unlabeled datai):
Figure BDA0002581619460000031
Wherein the content of the first and second substances,
Figure BDA0002581619460000032
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
According to an embodiment of the radius-based uncertainty sampling method for active learning by text classification of the present invention, the process of calculating the radius of the prediction class further comprises:
obtaining the center of the category according to the data points in the prediction category on average;
calculating the cosine similarity of each data point and the center thereof in the prediction category;
and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.
According to an embodiment of the radius-based uncertainty sampling method of text classification active learning of the present invention, the center c of the class is predicted (class y):
Figure BDA0002581619460000033
wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):
Figure BDA0002581619460000034
radius r of prediction category y (category y):
Figure BDA0002581619460000035
Figure BDA0002581619460000036
according to an embodiment of the radius-based uncertainty sampling method for active learning of text classification of the present invention, the calculation of the composite score v (x) is:
Figure BDA0002581619460000037
where h (x) is the score of the information entropy of the label data x and r (category y) is the radius of the predicted category y.
The invention also discloses a radius-based uncertainty sampling system for text classification active learning, which comprises an unlabeled data processing module and a comprehensive scoring module, wherein:
the unlabeled data processing module is configured to grade the information entropy of the data point of the unlabeled data through a text classifier, give a prediction category corresponding to the unlabeled data, and respectively calculate the radius of each prediction category;
the comprehensive scoring module is configured to combine the information entropy scoring of the labeled data and the radius of the prediction category of the unlabeled data to obtain a comprehensive scoring.
According to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the unlabeled data processing module is configured to score H (x) the entropy of data points of unlabeled datai):
Figure BDA0002581619460000041
Wherein the content of the first and second substances,
Figure BDA0002581619460000042
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
According to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the configuration of the unlabeled data processing module further includes: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.
In accordance with an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, the unlabeled data processing module is configured to predict the center c of a class (class y):
Figure BDA0002581619460000043
Figure BDA0002581619460000044
wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):
Figure BDA0002581619460000045
radius r of prediction category y (category y):
Figure BDA0002581619460000046
according to an embodiment of the radius-based uncertainty sampling system for active learning of text classification of the present invention, in the configuration of the comprehensive scoring module, the calculation of the comprehensive score v (x) is:
Figure BDA0002581619460000047
where h (x) is the score of the information entropy of the label data x and r (category y) is the radius of the predicted category y.
Compared with the prior art, the invention has the following beneficial effects: the uncertainty sampling method based on the radius relaxes the condition of similarity weight, and takes the radius of the whole category as the weight compared with the cosine similarity calculated for each point. In the aspect of application fields and design frameworks, traditional active learning is more focused on theories when applied to text classification, and scenes with more noise in the industry are easily ignored. In the aspect of model universality, the optimization scheme of the invention is not limited to downstream text classifiers, and the sampling method of the invention can be used in depth models with hidden layers.
Drawings
The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.
Fig. 1 illustrates a conventional active learning framework.
FIG. 2 is a flowchart illustrating an embodiment of the radius-based uncertainty sampling method of text classification active learning of the present invention.
Fig. 3 shows a flow chart of the radius calculation step in the embodiment shown in fig. 2.
FIG. 4 illustrates a schematic diagram of an embodiment of the present invention text classification active learning radius-based uncertainty sampling system.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.
FIG. 2 illustrates a flow of an embodiment of the present invention of a radius-based uncertainty sampling method of text classification active learning. Referring to fig. 2, the method for sampling uncertainty based on radius for active learning of text classification in this embodiment includes processing unlabeled data and performing comprehensive scoring on the processing result based on the unlabeled data.
Firstly, the information entropy of each data point of the unlabeled data is scored through a text classifier, and the prediction category of the unlabeled data is given.
All data points x constituting the data x to be annotatediScore of information entropy of (H (x)i) The following were used:
Figure BDA0002581619460000061
wherein the content of the first and second substances,
Figure BDA0002581619460000062
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
The text classifier gives the prediction category of the data x to be labeled.
The text classifier of the present embodiment may use deep learning models such as FastText, BERT, CNN, and LSTM using Attention mechanism.
The FastText model is based on the Word2Vec model, using the entire sequence of text to predict the intermediary words, rather than context, many orders of magnitude faster than a traditional deep network.
BERT (bidirectional Encoder reproduction from transformers) implements the Encoder portion of a bi-directional Transformer. The model realizes two methods, mask LM and Next sequence Prediction, on Wikipedia and Book desk to respectively capture expression of word and Sentence level. The main innovation point of the model is to learn a pre-training language model and improve the model effect in an upstream task through fine tuning application.
A Convolutional Neural Network (CNN) can also be applied to text classification similarly to picture classification, and the most basic CNN model is used in this embodiment, having five layers. The first layer is an embedding layer (embedding layer) which converts sentences into a two-dimensional matrix; the second layer is a convolutional layer, the third layer is a pooling layer, the fourth layer is a fully-connected layer, and the last layer is a softmax layer.
The Attention mechanism breaks the limitation that the traditional coder-decoder (encoder-decoder) structure depends on an internal fixed length in coding and decoding. In implementation, the attribute mechanism outputs the intermediate output result of the input sequence by keeping the LSTM (Long Short-Term Memory) encoder, and then trains the model to selectively learn it and associate the output sequence with it when the model outputs. The invention uses hidden layer output before the model softmax layer as a data representation method.
The processing for unlabeled data is partially as follows.
First, the text classifier gives corresponding prediction classes to unlabeled data.
Then, for each prediction category 1-k, the radius of the prediction category is calculated separately.
As shown in fig. 3, taking the calculation of the prediction type y as an example, the process of calculating the radius r (type y) of the prediction type y is as follows.
Firstly, the center of the category y is obtained according to the average of data points in the predicted category y
Figure BDA0002581619460000071
The | category y | | represents the collective measure belonging to the prediction category y. In this equation, data i in the prediction type y, i.e., data 1 to data x in the leftmost dashed box in fig. 3, is actually used for the calculation of the hidden layer vector.
Then, the cosine similarity of each data point i in the prediction category y and the center c (category y) is calculated,
Figure BDA0002581619460000072
the cosine similarity is used to measure the representativeness of a data point, i.e. the distance from the data point i to the center, and h (data i) is a representative vector of the data i output by the model hidden layer. Cosine similarity is a measurement mode of distance, d (data i) on the left side of the formula refers to cosine similarity between a data point i and a center c, and distance between the data i and the c is abbreviated as d.
And finally, selecting the value d (data i) of the maximum cosine similarity in the prediction category y as the radius r (category y) of the prediction category y:
Figure BDA0002581619460000073
and obtaining a comprehensive score based on the processing result of the unlabeled data (the information entropy score H (x) of the unlabeled data x and the radius r of the prediction category). Specifically, after the radius r of each prediction type is calculated, the information entropy score h (x) and the radius r of the prediction type y (type y) to which data is not labeled are combined to obtain a composite score:
Figure BDA0002581619460000074
v (x) is a sampling criterion for active learning. For each point, V (x) is calculated according to a certain method, then all V (x) are sequenced, and the best point is selected for marking, which is an outline of a sampling method of active learning. The method provided by the invention is used for realizing V (x).
FIG. 4 illustrates the principles of an embodiment of the present invention of a radius-based uncertainty sampling system for text classification active learning. Referring to fig. 4, the system of the present embodiment includes: the system comprises an unlabeled data processing module and a comprehensive grading module.
The unlabeled data processing module is configured to give corresponding prediction categories to the unlabeled data through the text classifier, and calculate the radius of each prediction category respectively. In the configuration of the unlabeled data processing module, the score H (x) of the information entropy of the data point of the unlabeled datai):
Figure BDA0002581619460000081
Wherein the content of the first and second substances,
Figure BDA0002581619460000082
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
In the configuration of the unlabeled data processing module, the calculating the radius of the prediction class further includes: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.
In the above configuration, the center c of the prediction category (category y):
Figure BDA0002581619460000083
wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):
Figure BDA0002581619460000084
radius r of prediction category y (category y):
Figure BDA0002581619460000085
Figure BDA0002581619460000086
the comprehensive scoring module is configured to combine the information entropy score of the unlabeled data and the radius of the prediction category of the unlabeled data to obtain a comprehensive score.
The composite score v (x) is calculated as:
Figure BDA0002581619460000087
where H (x is the score of the entropy of the information of the annotation data x, and r (category y) is the radius of the predicted category y.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A radius-based uncertainty sampling method for active learning of text classification is characterized by comprising the steps of processing unlabeled data and comprehensively scoring the processing results based on the unlabeled data, wherein:
the processing process of the unlabeled data comprises the following steps:
grading the information entropy of each data point of the unlabeled data through a text classifier, and giving a prediction type of the unlabeled data;
respectively calculating the radius of each prediction category;
the comprehensive scoring process comprises the following steps:
and combining the information entropy scores of the unlabeled data and the radiuses of the prediction categories of the unlabeled data to obtain a comprehensive score.
2. The method of claim 1, wherein the score of entropy of data points not labeled with data is H (x)i):
Figure FDA0002581619450000011
Wherein the content of the first and second substances,
Figure FDA0002581619450000012
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
3. The method of claim 1, wherein the step of calculating the radius of the prediction class further comprises:
obtaining the center of the category according to the data points in the prediction category on average;
calculating the cosine similarity of each data point and the center thereof in the prediction category;
and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.
4. The method of claim 3, wherein the center of the prediction class c (class y):
Figure FDA0002581619450000013
wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):
Figure FDA0002581619450000021
radius r of prediction category y (category y):
Figure FDA0002581619450000022
Figure FDA0002581619450000023
5. the method of claim 1, wherein the composite score, v (x), is calculated by:
Figure FDA0002581619450000024
where h (x) is the score of the information entropy of the label data x and r (category y) is the radius of the predicted category y.
6. A radius-based uncertainty sampling system for active learning of text classification, the system comprising an unlabeled data processing module and a comprehensive scoring module, wherein:
the unlabeled data processing module is configured to grade the information entropy of the data point of each unlabeled data through a text classifier, give a prediction category corresponding to the unlabeled data, and respectively calculate the radius of each prediction category;
the comprehensive scoring module is configured to combine the information entropy scoring of the labeled data and the radius of the prediction category of the unlabeled data to obtain a comprehensive scoring.
7. The radius-based uncertainty sampling system for active learning of text classification according to claim 6, characterized by a score of H (x) of the information entropy of data points of unlabeled datai):
Figure FDA0002581619450000025
Wherein the content of the first and second substances,
Figure FDA0002581619450000026
is the text classifier predicts the data point xiIs labeled as the probability of j, and the parameter k represents the data point xiThe number of tags.
8. The radius-based uncertainty sampling system for text classification active learning according to claim 6, wherein the configuration of the unlabeled data processing module wherein calculating the radius of the prediction class further comprises: obtaining the center of the category according to the data points in the prediction category on average; calculating the cosine similarity of each data point and the center thereof in the prediction category; and selecting the value of the maximum cosine similarity in the prediction category as the radius of the prediction category.
9. The radius-based uncertainty sampling system for text classification active learning according to claim 8, wherein the unlabeled data processing module is configured to predict a center c (class y) of a class:
Figure FDA0002581619450000031
wherein y is a prediction category; predicting the cosine similarity d (data i) of each data point i in the category y and the center c (category y):
Figure FDA0002581619450000032
radius r of prediction category y (category y):
Figure FDA0002581619450000033
10. the radius-based uncertainty sampling system for active learning of text classification according to claim 6, wherein the composite score module is configured such that the composite score V (x) is calculated as:
Figure FDA0002581619450000034
where h (x) is the score of the information entropy of the label data x and r (category y) is the radius of the predicted category y.
CN202010669244.4A 2020-07-13 2020-07-13 Radius-based uncertainty sampling method and system for text classification active learning Active CN111914061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010669244.4A CN111914061B (en) 2020-07-13 2020-07-13 Radius-based uncertainty sampling method and system for text classification active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010669244.4A CN111914061B (en) 2020-07-13 2020-07-13 Radius-based uncertainty sampling method and system for text classification active learning

Publications (2)

Publication Number Publication Date
CN111914061A true CN111914061A (en) 2020-11-10
CN111914061B CN111914061B (en) 2021-04-16

Family

ID=73227058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010669244.4A Active CN111914061B (en) 2020-07-13 2020-07-13 Radius-based uncertainty sampling method and system for text classification active learning

Country Status (1)

Country Link
CN (1) CN111914061B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434736A (en) * 2020-11-24 2021-03-02 成都潜在人工智能科技有限公司 Deep active learning text classification method based on pre-training model
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN115357718A (en) * 2022-10-20 2022-11-18 佛山科学技术学院 Method, system, device and storage medium for discovering repeated materials of theme integration service

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130304824A1 (en) * 2010-11-24 2013-11-14 Crambo S.A. Communication system and method involving the creation of virtual spaces
US9311599B1 (en) * 2011-07-08 2016-04-12 Integral Ad Science, Inc. Methods, systems, and media for identifying errors in predictive models using annotators
US20160307113A1 (en) * 2015-04-20 2016-10-20 Xerox Corporation Large-scale batch active learning using locality sensitive hashing
CN107273912A (en) * 2017-05-10 2017-10-20 重庆邮电大学 A kind of Active Learning Method based on three decision theories
CN108846424A (en) * 2018-05-30 2018-11-20 华东理工大学 A kind of fuzzy multi-core classifier of cost-sensitive
CN108875816A (en) * 2018-06-05 2018-11-23 南京邮电大学 Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
CN109656808A (en) * 2018-11-07 2019-04-19 江苏工程职业技术学院 A kind of Software Defects Predict Methods based on hybrid active learning strategies
CN110059752A (en) * 2019-04-19 2019-07-26 江苏工程职业技术学院 A kind of statistical learning querying method based on comentropy Sampling Estimation
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110188197A (en) * 2019-05-13 2019-08-30 北京一览群智数据科技有限责任公司 It is a kind of for marking the Active Learning Method and device of platform
CN110348481A (en) * 2019-06-05 2019-10-18 华东理工大学 One kind being based on the gravitational network inbreak detection method of neighbour's sample
US20200151518A1 (en) * 2018-09-17 2020-05-14 Purdue Research Foundation Regularized multi-metric active learning system for image classification
US20200202210A1 (en) * 2018-12-24 2020-06-25 Nokia Solutions And Networks Oy Systems and methods for training a neural network

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130304824A1 (en) * 2010-11-24 2013-11-14 Crambo S.A. Communication system and method involving the creation of virtual spaces
US9311599B1 (en) * 2011-07-08 2016-04-12 Integral Ad Science, Inc. Methods, systems, and media for identifying errors in predictive models using annotators
US20160307113A1 (en) * 2015-04-20 2016-10-20 Xerox Corporation Large-scale batch active learning using locality sensitive hashing
CN107273912A (en) * 2017-05-10 2017-10-20 重庆邮电大学 A kind of Active Learning Method based on three decision theories
CN108846424A (en) * 2018-05-30 2018-11-20 华东理工大学 A kind of fuzzy multi-core classifier of cost-sensitive
CN108875816A (en) * 2018-06-05 2018-11-23 南京邮电大学 Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
US20200151518A1 (en) * 2018-09-17 2020-05-14 Purdue Research Foundation Regularized multi-metric active learning system for image classification
CN109656808A (en) * 2018-11-07 2019-04-19 江苏工程职业技术学院 A kind of Software Defects Predict Methods based on hybrid active learning strategies
US20200202210A1 (en) * 2018-12-24 2020-06-25 Nokia Solutions And Networks Oy Systems and methods for training a neural network
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110059752A (en) * 2019-04-19 2019-07-26 江苏工程职业技术学院 A kind of statistical learning querying method based on comentropy Sampling Estimation
CN110188197A (en) * 2019-05-13 2019-08-30 北京一览群智数据科技有限责任公司 It is a kind of for marking the Active Learning Method and device of platform
CN110348481A (en) * 2019-06-05 2019-10-18 华东理工大学 One kind being based on the gravitational network inbreak detection method of neighbour's sample

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHICONG QIU 等: "A Maximum Entropy Framework for Semisupervised and Active Learning With Unknown and Label-Scarce Classes", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 *
王珍钰: "基于不确定性的主动学习算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
胡峰 等: "基于邻域粗糙集的主动学习方法", 《重庆邮电大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434736A (en) * 2020-11-24 2021-03-02 成都潜在人工智能科技有限公司 Deep active learning text classification method based on pre-training model
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN115357718A (en) * 2022-10-20 2022-11-18 佛山科学技术学院 Method, system, device and storage medium for discovering repeated materials of theme integration service
CN115357718B (en) * 2022-10-20 2023-01-24 佛山科学技术学院 Method, system, device and storage medium for discovering repeated materials of theme integration service

Also Published As

Publication number Publication date
CN111914061B (en) 2021-04-16

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
WO2021203581A1 (en) Key information extraction method based on fine annotation text, and apparatus and storage medium
CN111914061B (en) Radius-based uncertainty sampling method and system for text classification active learning
CN111046941B (en) Target comment detection method and device, electronic equipment and storage medium
CN111008274B (en) Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN110795525B (en) Text structuring method, text structuring device, electronic equipment and computer readable storage medium
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN111160031A (en) Social media named entity identification method based on affix perception
CN111914062B (en) Long text question-answer pair generation system based on keywords
US11120268B2 (en) Automatically evaluating caption quality of rich media using context learning
CN115495555A (en) Document retrieval method and system based on deep learning
CN116416480B (en) Visual classification method and device based on multi-template prompt learning
CN114298157A (en) Short text sentiment classification method, medium and system based on public sentiment big data analysis
CN116151132A (en) Intelligent code completion method, system and storage medium for programming learning scene
CN114647715A (en) Entity recognition method based on pre-training language model
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN114781376A (en) News text abstract generation method based on deep learning
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN113901781B (en) Similar case matching method integrating segment coding and affine mechanism
CN113836941B (en) Contract navigation method and device
CN113297485B (en) Method for generating cross-modal representation vector and cross-modal recommendation method
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 200050 19th floor, Unicom building, 1033 Changning Road, Changning District, Shanghai

Applicant after: Shanghai Leyan Technology Co.,Ltd.

Address before: 200050 16th / 18th / 19th floor, Unicom building, 1033 Changning Road, Changning District, Shanghai

Applicant before: SHANGHAI LEYAN INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant