CN109241397A

CN109241397A - A kind of method and apparatus for cleaning data

Info

Publication number: CN109241397A
Application number: CN201810721515.9A
Authority: CN
Inventors: 徐兴
Original assignee: Sichuan Feixun Information Technology Co Ltd
Current assignee: Phicomm Shanghai Co Ltd
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2019-01-18

Abstract

The application discloses a kind of method and apparatus for cleaning data, during to data cleansing, first choose the data that maximum probability in data to be cleaned is determined as correct data and mistake, the data that centre has some comparisons to be difficult to confirm are screened again, positive sample and negative sample are picked out again, it is artificial to be greatly reduced, select positive sample and the accuracy rate of negative sample are very high in this way, by the method for transfer learning and automatic setting threshold value, can quickly and reliably data be cleaned.

Description

A kind of method and apparatus for cleaning data

Technical field

This application involves field of computer technology more particularly to a kind of method and apparatus for cleaning data.

Background technique

With the development of Computer Science and Technology, deep learning is applied to more and more widely in our life.Number According to just as power source is for machine, not playing effect without data, then good deep learning model for deep learning yet. An important mode for obtaining data is web crawlers, can be comprising a large amount of mistakes but climb down the data come on the net, this is also Very big workload is brought to data cleansing staff.

Convolutional neural networks are just largely used to image classification at present, and the premise of image classification is to possess a large amount of data.It is right In the picture that web crawlers obtains, need further to clean, currently used cleaning method has:

(1) artificial cleaning

The method manually cleaned is that a kind of most common method, this method mainly pass through manual identified in current data cleansing Mode cleaned from a large amount of data get rid of mistake image.

But artificial cleaning method, major defect are needed that human cost is bigger, and speed is slow.

(2) similar image is removed by md5 duplicate removal or image similarity algorithm

By duplicate removal and go similar image algorithm that can get rid of the data of some repeated datas or difference very little.

But this method major defect is can only to remove some duplicate or similar image, can not complete data really Cleaning.

(3) cleaning based on multiple deep learning training iteration

This method is first directly using low quality classification image data one preliminary convolutional neural networks of training, then with being somebody's turn to do Network identifies data itself, washes model and is identified as the pseudo- probability of this class as low as a degree of image, or number Amount is less than a degree of image category, repeats the above process until the discrimination for obtaining all picture data types reaches default Standard.

This method has certain application range, such as only concentrates in each classification in a data comprising a small amount of mistake Data, and there's almost no interference between wrong data and overall data.But if wrong data accounts for greatly in a certain classification Interfered between majority or wrong data and correct data it is bigger, can on the result of data cleansing again very big influence.

Therefore, the data that web crawlers obtains how automatically correctly and are quickly cleaned, the skill it is necessary to solution is become Art problem.

Summary of the invention

The many aspects of the application provide a kind of method and apparatus for cleaning data, can automatically correct and quickly cleaning The data that web crawlers obtains.

The first aspect of the application provides a kind of method for cleaning data, comprising:

Multi-class data is cleaned to obtain correct data and wrong data；

The correct data is trained, the first training pattern after training is obtained；

A certain data to be cleaned are first carried out duplicate removal and carry out similarity with similarity threshold to obtain the first residue to clear Wash data；

Pick out from the described first remaining data to be cleaned that at least one positive sample, at least one is negative according to specified rule Sample and the second remaining data to be cleaned；

Migration is done using first training pattern at least one described positive sample and at least one described negative sample to learn Acquistion is to the second training pattern；

First threshold and second threshold are determined according to second training pattern, wherein the first threshold and described For two threshold values for being judged as the confidence level of positive negative sample to data and being arranged, the first threshold is less than second threshold Value, the first threshold are that gained is calculated according to the default accuracy of negative sample, and the second threshold is according to the pre- of positive sample If accuracy calculates gained；

Using second training pattern, the first threshold and the second threshold by the described second remaining number to be cleaned According to be divided into positive sample, to manually clean, three classifications of negative sample, wherein in second data to be cleaned confidence level be greater than institute The data for stating second threshold are judged as that positive sample class data, the data that confidence level is less than the first threshold are judged as negative sample class Data, data of the confidence level between the first threshold and the second threshold are at being judged as to manually clean class data.

Optionally, described the first threshold to be determined according to second training pattern and the second threshold includes:

By at least one described positive sample and at least one described negative sample be divided into according to a certain percentage training set and Verifying collection；

According to the statistics of verification result of the positive negative sample of second training pattern on the verifying collection and described The respective default accuracy of the positive negative sample of second training pattern determines the first threshold and the second threshold.

Optionally, the training set includes the negative sample of the positive sample and predetermined second ratio of predetermined first ratio, described Verifying collection includes the positive sample of predetermined third ratio and the negative sample of predetermined 4th ratio, wherein first ratio is greater than institute Third ratio is stated, second ratio is greater than the 4th ratio.

Optionally, the sum of first ratio and the third ratio are 100%, second ratio and the 4th ratio The sum of example is 100%.

Optionally, first ratio and second ratio are 90%, the third ratio and the 4th ratio It is 10%.

The second aspect of the application provides a kind of device for cleaning data, comprising:

First cleaning module obtains correct data and wrong data for being cleaned to multi-class data；

First training module obtains the first training pattern after training for being trained to the correct data；

Deduplication module is obtained for first carrying out duplicate removal to a certain data to be cleaned and carrying out similarity with similarity threshold First remaining data to be cleaned；

Choosing module, for picking out at least one positive sample from the described first remaining data to be cleaned according to specified rule Originally, at least one negative sample and the second remaining data to be cleaned；

Study module, for using first training at least one described positive sample and at least one described negative sample Model does transfer learning and obtains the second training pattern；

Determining module, for determining first threshold and second threshold according to second training pattern, wherein described first For being judged as the confidence level of positive negative sample to data and being arranged, the first threshold is less than for threshold value and the second threshold The second threshold, the first threshold is that gained is calculated according to the default accuracy of negative sample, according to the second threshold The default accuracy of positive sample calculates gained；

Second cleaning module, for using second training pattern, the first threshold and the second threshold by institute State the second remaining data to be cleaned be divided into positive sample, to manually clean, three classifications of negative sample, wherein described second is to be cleaned The data that confidence level is greater than the second threshold in data are judged as that positive sample class data, confidence level are less than the first threshold Data are judged as negative sample class data, data of the confidence level between the first threshold and the second threshold at be judged as to Artificial cleaning class data.

Optionally, the determining module specifically includes:

Division unit, for drawing at least one described positive sample and at least one described negative sample according to a certain percentage It is divided into training set and verifying collection；

Determination unit, for verification result of the positive negative sample according to second training pattern on the verifying collection The respective default accuracy of positive negative sample of statistics and second training pattern determines the first threshold and described second Threshold value.

The method and apparatus of the cleaning data of foregoing description first choose data to be cleaned during to data cleansing Middle maximum probability is determined as the data (i.e. negative sample) of correct data (i.e. positive sample) and mistake, and centre has some comparisons to be difficult to really The data recognized are screened again, then pick out positive sample and negative sample, to be greatly reduced manually, are selected in this way The accuracy rate of positive sample and negative sample out is very high, by the method for transfer learning and automatic setting threshold value, can quickly and And reliably data are cleaned.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the method for cleaning data of one embodiment of the application；

Fig. 2 is the flow diagram of the method for another cleaning data of another embodiment of the application；

Fig. 3 is a kind of structural schematic diagram of the device of cleaning data of another embodiment of the application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall in the protection scope of this application.

The terms "and/or", only a kind of incidence relation for describing affiliated partner, indicates that there may be three kinds of passes System, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.In addition, herein Middle character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".In addition, the terms " system " and " network " It is often used interchangeably herein.

As shown in Figure 1, the flow diagram of the method for a kind of cleaning data of one embodiment of the application, the cleaning number According to method can be executed by the processor or chip of various equipment, for example, can be by computer, mobile phone, palm PC Processor or the chip execution of (Personal Digital Assistant, PDA) etc..

Step 101, multi-class data is cleaned to obtain correct data and wrong data.

For example, first carrying out duplicate removal (i.e. removal repeated data) to the multi-class data (such as multiclass picture) of acquisition and with similar Degree threshold value carries out similarity (i.e. approximate data of the removal similarity more than or equal to threshold value), for example, can pass through eap-message digest Algorithm the 5th edition (Message Digest Algorithm, MD5) verification carries out duplicate removal, passes through grey level histogram similarity mode Approximate data is removed, then manually data are cleaned by equipment, obtain correct data and wrong data.

The data cleansing, which refers to, picks out data wrong in initial data, and being picked the data come is error number According to cleaning remaining is correct data and uncertain data, or the remaining only correct data of cleaning.

Step 102, the correct data is trained, obtains the first training pattern after training.

Step 103, a certain data to be cleaned are first carried out duplicate removal and carry out similarity with similarity threshold to obtain first Remaining data to be cleaned.

For example, being removed repeated data using MD5 verification, approximate number is removed by grey level histogram similarity mode According to.

Step 104, according to specified rule from the described first remaining data to be cleaned pick out at least one positive sample, to A few negative sample and the second remaining data to be cleaned.

For example, picking out at least one positive sample, at least from the described first remaining data to be cleaned according to similarity threshold One negative sample and the second remaining data to be cleaned.

Step 105, first training pattern is used at least one described positive sample and at least one described negative sample It does transfer learning and obtains the second training pattern.

Migration is done using first training pattern at least one described positive sample and at least one described negative sample to learn It practises predetermined time (such as 5 minutes) and obtains second training pattern.

Step 106, first threshold and second threshold are determined according to second training pattern.

Second training pattern is two disaggregated models, the first threshold and the second threshold for being judged to data The confidence level of the disconnected negative sample that is positive and be arranged, the first threshold is less than the second threshold, according to the first threshold The default accuracy of negative sample calculates gained, and the second threshold is to calculate gained according to the default accuracy of positive sample.

For example, the first threshold and the second threshold are the value between 0 to 1, the essence for the more high positive class that threshold value is set Exactness is higher, and the accuracy of negative class is lower.

For example, to second training pattern, it may appear that four kinds of situations: if an example is positive sample and also pre- Positive sample, as positive sample (True positive, TP) are surveyed into, if example is that negative sample is predicted to positive sample, referred to as False positive sample (False positive, FP)；Correspondingly, referred to as very negative if example is that negative sample is predicted to negative sample Sample (True negative, TN) is referred to as false negative sample (False if example is that positive sample is predicted to negative sample Negative, FN).

TP: the positive sample number correctly detected；

FN: the number for the positive sample not found is failed to report；

FP: it is the negative class number of positive sample by false judgment, that is, reports by mistake；

TN: correctly it is judged as the number of negative sample；

Correlation between them:

ALL=TP+FP+FN+FN: total sample number

Actual Positive=P=TP+FN: practical positive sample number

Actual Negative=N=FP+TN: actual negative sample number

Predict Ture=T=TP+FP: prediction result is the sum of positive sample

Predict False=F=TN+FN: prediction result is the sum of negative sample

Relationship between them can be as described in Table 1:

Table 1

After one threshold value (A > 0.7 threshold) is set, the negative sample that includes in the positive sample (TP+FP) that is selected out The fewer this (FP) the better, is equivalent to and accuracy rate TP/ (TP+FP) is required to be the bigger the better；It is required that one new threshold value of setting It is more fewer better comprising positive sample (FN) in (C < 0.3 threshold) select negative sample, that is, require the accurate of negative sample Rate TN/ (TN+FN) is the bigger the better.

It is managed according to Receiver operating curve (receiver operating characteristic curve, ROC) By provided with two accuracys rate:

The accuracy rate (positive-precision) of positive sample: TP/ (TP+FP).

The accuracy rate (negative-precision) of negative sample: TN/ (TN+FN).

The accuracy rate (positive-precision) and negative sample of positive sample can not be required when cleaning data Accuracy rate (negative-precision) reach 100%, but also guarantee in the positive sample data cleaned out only to wrap as far as possible Containing few negative sample, can be set in the negative sample that cleans out in the case where guaranteeing data volume wider.For example, Positive sample generally requires accuracy rate relatively high, and 98% or more, the accuracy rate of negative sample (can choose 95% or more when data are more Value more dot, guarantee that remaining positive sample has enough data).

It, can be between 0.9 to 1.0 when the model (i.e. described second training pattern) of trained two classification is verified 20 threshold values are set, and export corresponding positive-precision value, select the threshold value closest to 0.98 value (threshold) as the threshold value (the i.e. described second threshold) for picking out reliable positive sample.20 are arranged between 0.3 to 0.1 A threshold value equally exports corresponding negative-precision value, selects reliable negative as picking out closest to 0.95 value The threshold value (the i.e. described first threshold) of sample completes the threshold value setting for picking out positive sample and negative sample automatically.

Step 107, remaining by described second using second training pattern, the first threshold and the second threshold Data to be cleaned be divided into positive sample, to manually clean, three classifications of negative sample.

For example, the data that confidence level is greater than the second threshold in second data to be cleaned are judged as positive sample class number According to (for example, being True class data), the data that confidence level is less than the first threshold are judged as negative sample class data (for example, being False class data), data of the confidence level between the first threshold and the second threshold are at being judged as to manually clean class Data (for example, being Check class data).

Then, it is manually cleaned by the equipment to described to manually clean class data.

Therefore, the method for the cleaning data of foregoing description is first chosen in data to be cleaned during to data cleansing Maximum probability is determined as the data (i.e. negative sample) of correct data (i.e. positive sample) and mistake, and centre has some comparisons to be difficult to confirm Data screened again, then pick out positive sample and negative sample, to be greatly reduced artificial, pick out in this way The accuracy rate of the positive sample and negative sample come is very high, by the method for transfer learning and automatic setting threshold value, can quickly and Reliably data are cleaned.

It is following to illustrate by taking the cleaning of picture as an example more preferably to illustrate embodiment, such as retouched by taking the picture of vegetable as an example It states, as shown in Fig. 2, the flow diagram of the method for another cleaning data of another embodiment of the application.

Step 211, a variety of vegetable data are obtained by web crawlers.

For example, obtaining 193 kinds of vegetable image datas by web crawlers, the vegetable picture refers to various cuisine pictures, Such as tomato omelette/omelet picture, green pepper scrambled eggs picture, but may may mix non-vegetable picture in 193 kinds of vegetable pictures.

Step 212, removal multiimage is verified using MD5.

Step 213, approximate image is removed by the way of grey level histogram similarity mode.

For example, the grey level histogram similarity of two images is greater than 90%, then wherein piece image is removed.

After step 213 has executed, step 214 and 216 can be executed respectively.

Step 214, artificial screening obtains the data cleaned, and then can execute step 215 and step 217 respectively.

Step 215, the model of deep learning network training.

Step 216,100 non-vegetable images are picked out, step 225 is then executed.

For example, non-vegetable image can be human body image, landscape image etc..

Step 217,380 vegetable images are picked out as negative sample, then execute step 225.

For example, picking out image of the similarity less than threshold value, referred to as negative sample inside tomato scrambled eggs image.

Step 221, web crawlers obtains a certain vegetable data.

For example, obtaining the image of all tomato scrambled eggs, but green pepper scrambled eggs may be mixed with inside tomato scrambled eggs image Or the picture of non-dish, green pepper scrambled eggs and the picture of these non-dishes are known as wrong data.

Step 222, removal multiimage is verified using MD5.

Step 223, approximate image is removed by the way of grey level histogram similarity mode.

After step 223 has executed, step 224 and 226 can be executed respectively.

Step 224,20 negative sample images are picked out, step 225 is then executed.

Step 226,200 positive sample images are picked out.

Step 227, transfer training and threshold value is set automatically.

For example, the Statistics based on ROC curve, automatic that first threshold and second threshold is arranged.For example, passing through basis ROC curve calculates optimal threshold value (threshold) value, is classified by optimal threshold to two. Pick out two threshold value threshold A (the i.e. described first threshold) and threshold C (the i.e. described second threshold), confidence level Image greater than threshold A is positive sample, and image of the confidence level less than threshold C is negative sample.

Step 228-229, remaining image carry out data cleansing using threshold value set by step 227.

As shown in figure 3, the structural schematic diagram of the device for a kind of cleaning data of another embodiment of the application, the cleaning The device of data can be the processor of various equipment or chip executes, for example, it may be computer, mobile phone, palm PC Processor or the chip execution of (Personal Digital Assistant, PDA) etc..The device packet of the cleaning data Include: the first cleaning module 311, deduplication module 313, Choosing module 314, study module 315, determines mould at first training module 312 Block 316 and the second cleaning module 317, wherein first cleaning module 311, first training module 312, the duplicate removal Module 313, the Choosing module 314, the study module 315, the determining module 316 and second cleaning module 317 Communication between each other, such as be in communication with each other by the bus.In the present embodiment, the determining module 316 can also include drawing Sub-unit and determination unit.

It is total that the bus can be industry standard architecture (Industry Standard Architecture, ISA) Line, Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or extension industrial standard Structure (Extended Industry Standard Architecture, EISA) bus etc..The bus system can be divided into ground Location bus, data/address bus, control bus etc..

In the present embodiment, the division of above-mentioned module and unit is a kind of logical partitioning respectively, and the application does not limit to division Logic, can be realized, also be can integrate together by least by least one circuit or at least one chip respectively One circuit or at least one chip are realized, for example, the circuit or chip can be realized by processor.

In another embodiment of the application, the processor can be central processing unit (Central Processing Unit, CPU), the processor can also be other general controls processors, digital signal processor (Digital Signal Processing, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..The general controls processor can be Microcontrol processor either any conventional control processor, such as single-chip microcontroller etc..

First cleaning module 311 obtains correct data and wrong data for being cleaned to multi-class data.

First training module 312 obtains the first training after training for being trained to the correct data Model.

The deduplication module 313, for first carrying out duplicate removal to a certain data to be cleaned and carrying out phase with similarity threshold The first remaining data to be cleaned are obtained like degree.

The Choosing module 314, for being picked out at least according to specified rule from the described first remaining data to be cleaned The remaining data to be cleaned of one positive sample, at least one negative sample and second.

The study module 315, for described at least one described positive sample and at least one described negative sample use First training pattern does transfer learning and obtains the second training pattern.

The determining module 316, for determining first threshold and second threshold according to second training pattern, wherein The first threshold and the second threshold are for being judged as the confidence level of positive negative sample to data and being arranged, described first Threshold value is less than the second threshold, and the first threshold is to calculate gained, second threshold according to the default accuracy of negative sample Value is according to the default accuracy of positive sample calculating gained.

Second cleaning module 317, for using second training pattern, the first threshold and second threshold Value by the described second remaining data to be cleaned be divided into positive sample, to manually clean, three classifications of negative sample, wherein described second The data that confidence level is greater than the second threshold in data to be cleaned are judged as that positive sample class data, confidence level are less than described first The data of threshold value are judged as negative sample class data, and data of the confidence level between the first threshold and the second threshold are at sentencing Break as to manually clean class data.

The determining module 316 specifically includes: division unit, at least one positive sample and described at least one by described in A negative sample is divided into training set and verifying collection according to a certain percentage；Determination unit, for according to second training pattern Positive negative sample it is described verifying collection on verification result statistics and second training pattern positive negative sample it is respective Default accuracy determines the first threshold and the second threshold.

The training set includes the positive sample of predetermined first ratio and the negative sample of predetermined second ratio, the verifying Ji Bao Include the positive sample of predetermined third ratio and the negative sample of predetermined 4th ratio, wherein first ratio is greater than the third ratio Example, second ratio are greater than the 4th ratio.

For example, the sum of first ratio and the third ratio are 100%, second ratio and the 4th ratio The sum of be 100%.

For example, first ratio and second ratio are 90%, the third ratio and the 4th ratio are equal It is 10%.

The concrete function and implementation procedure of above-mentioned module and unit, the process that can be described with reference to Fig. 1 and 2 embodiment, Details are not described herein.

In conclusion therefore, the method and apparatus of the cleaning data of foregoing description, during to data cleansing, first Choose the data (i.e. negative sample) that maximum probability in data to be cleaned is determined as correct data (i.e. positive sample) and mistake, centre has The data that some comparisons are difficult to confirm are screened again, then pick out positive sample and negative sample, to be greatly reduced manually, are led to The accuracy rate for crossing the select positive sample of this mode and negative sample is very high, passes through the side of transfer learning and automatic setting threshold value Method, can quickly and also reliably data are cleaned.

Another embodiment of the application also provides a kind of computer-readable medium, and computer-readable medium can be computer Readable signal medium or computer-readable medium.Processor in computer reads the meter of storage in computer-readable medium Calculation machine readable program code enables a processor to execute in flow chart 1 specified in the combination of each step or each step Function action；Generate the device for implementing defined function action in each block of the block diagram, or a combination of blocks.

Computer-readable medium is including but not limited to electronics, magnetism, optics, electromagnetism, infrared memory or semiconductor system System, equipment perhaps device or above-mentioned any appropriately combined, the memory is described for storing program code or instruction Program code includes computer operation instruction, and the processor is used to execute program code or the instruction of the memory storage.

The memory may include volatile memory, for example, random access memory (random access Memory, RAM), the RAM may include static RAM or dynamic ram.The memory may also include non-volatile memories Device (non-volatile memory), such as read-only memory (read-only memory, PROM), may be programmed read-only storage Device (programmable read-only memory, PROM), Erarable Programmable Read only Memory (erasable Programmable read-only memory, EPROM), electrically erasable programmable read-only memory (electrically Erasable programmable read-only memory, EEPROM) or flash memory (flash memory).The memory It is also possible to be external flash, at least one magnetic disk storage or buffer.

The processor can be CPU, DSP, ASIC, FPGA or other programmable logic device, discrete gate or crystalline substance Body pipe logical device, discrete hardware components etc..

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product can store in computer-readable medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used So that a computer equipment (can be personal computer, server or the network equipment etc.) or processor or chip are held Method described in certain parts of each embodiment of row or embodiment.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of method for cleaning data characterized by comprising

Multi-class data is cleaned to obtain correct data and wrong data；

A certain data to be cleaned are first carried out duplicate removal and carry out similarity with similarity threshold to obtain the first remaining number to be cleaned According to；

At least one positive sample, at least one negative sample are picked out from the described first remaining data to be cleaned according to specified rule With the second remaining data to be cleaned；

Transfer learning is done using first training pattern at least one described positive sample and at least one described negative sample to obtain To the second training pattern；

First threshold and second threshold are determined according to second training pattern, wherein the first threshold and second threshold Value is arranged for data to be judged as with the confidence level of positive negative sample, and the first threshold is less than the second threshold, institute Stating first threshold is that gained is calculated according to the default accuracy of negative sample, and the second threshold is according to the default accurate of positive sample Degree calculates gained；

The described second remaining data to be cleaned are divided using second training pattern, the first threshold and the second threshold For positive sample, to manually clean, three classifications of negative sample, wherein confidence level is greater than described the in second data to be cleaned The data of two threshold values are judged as that positive sample class data, the data that confidence level is less than the first threshold are judged as negative sample class number According to, data of the confidence level between the first threshold and the second threshold at being judged as to manually clean class data.

2. the method as described in claim 1, which is characterized in that described to determine first threshold according to second training pattern Value and the second threshold include:

At least one described positive sample and at least one described negative sample are divided into training set and verifying according to a certain percentage Collection；

According to the statistics and described second of verification result of the positive negative sample of second training pattern on the verifying collection The respective default accuracy of the positive negative sample of training pattern determines the first threshold and the second threshold.

3. method according to claim 2, which is characterized in that the training set includes the positive sample of predetermined first ratio and pre- The negative sample of fixed second ratio, the verifying collection include the negative sample of the positive sample and predetermined 4th ratio of predetermined third ratio, Wherein, first ratio is greater than the third ratio, and second ratio is greater than the 4th ratio.

4. method as claimed in claim 3, which is characterized in that the sum of first ratio and the third ratio are 100%, The sum of second ratio and the 4th ratio are 100%.

5. the method as claimed in claim 3 or 4, which is characterized in that first ratio and second ratio are 90%, The third ratio and the 4th ratio are 10%.

6. a kind of device for cleaning data characterized by comprising

Deduplication module obtains first for first carrying out duplicate removal to a certain data to be cleaned and carrying out similarity with similarity threshold Remaining data to be cleaned；

Choosing module, for according to specified rule from the described first remaining data to be cleaned pick out at least one positive sample, At least one negative sample and the second remaining data to be cleaned；

Study module, for using first training pattern at least one described positive sample and at least one described negative sample It does transfer learning and obtains the second training pattern；

Determining module, for determining first threshold and second threshold according to second training pattern, wherein the first threshold It is arranged with the second threshold for being judged as the confidence level of positive negative sample to data, the first threshold is less than described Second threshold, the first threshold are that gained is calculated according to the default accuracy of negative sample, and the second threshold is according to positive sample This default accuracy calculates gained；

Second cleaning module, for using second training pattern, the first threshold and the second threshold by described the Two remaining data to be cleaned be divided into positive sample, to manually clean, three classifications of negative sample, wherein second data to be cleaned The data that middle confidence level is greater than the second threshold are judged as that positive sample class data, confidence level are less than the data of the first threshold It is judged as negative sample class data, data of the confidence level between the first threshold and the second threshold are at being judged as to artificial Clean class data.

7. device as claimed in claim 6, which is characterized in that the determining module specifically includes:

Division unit, at least one described positive sample and at least one described negative sample to be divided into according to a certain percentage Training set and verifying collection；

Determination unit, the statistics for verification result of the positive negative sample according to second training pattern on the verifying collection And the respective default accuracy of positive negative sample of second training pattern determines the first threshold and the second threshold.

8. device as claimed in claim 7, which is characterized in that the training set includes the positive sample of predetermined first ratio and pre- The negative sample of fixed second ratio, the verifying collection include the negative sample of the positive sample and predetermined 4th ratio of predetermined third ratio, Wherein, first ratio is greater than the third ratio, and second ratio is greater than the 4th ratio.

9. device as claimed in claim 8, which is characterized in that the sum of first ratio and the third ratio are 100%, The sum of second ratio and the 4th ratio are 100%.

10. device as claimed in claim 8 or 9, which is characterized in that first ratio and second ratio are 90%, the third ratio and the 4th ratio are 10%.