CN108595497A - Data screening method, apparatus and terminal - Google Patents

Data screening method, apparatus and terminal Download PDF

Info

Publication number
CN108595497A
CN108595497A CN201810220055.1A CN201810220055A CN108595497A CN 108595497 A CN108595497 A CN 108595497A CN 201810220055 A CN201810220055 A CN 201810220055A CN 108595497 A CN108595497 A CN 108595497A
Authority
CN
China
Prior art keywords
sample data
data
target labels
probability
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810220055.1A
Other languages
Chinese (zh)
Other versions
CN108595497B (en
Inventor
张志伟
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201810220055.1A priority Critical patent/CN108595497B/en
Publication of CN108595497A publication Critical patent/CN108595497A/en
Application granted granted Critical
Publication of CN108595497B publication Critical patent/CN108595497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

An embodiment of the present invention provides a kind of data screening method, apparatus and terminals, wherein the data screening method includes:From the multiple noise datas of extracting data to be screened as sample data;Conversion process is carried out to each sample data, obtains the transformation data of each sample data;By advance trained image classification model, Tag Estimation is carried out to each sample data and each transformation data, determines the target labels and target labels probability of each sample data;According to the target labels and target labels probability of each sample data, each sample data is screened, obtain target database data screening scheme provided in an embodiment of the present invention, it treats garbled data manually without user and screening is marked one by one, data screening can be carried out automatically according to computer program, simple operation and take it is short, human resources can either be saved, and data screening efficiency can be promoted.

Description

Data screening method, apparatus and terminal
Technical field
The present invention relates to noise data screening technique fields, more particularly to a kind of data screening method, apparatus and terminal.
Background technology
Recently, deep learning achieves breakthrough in the related contents understanding such as natural language processing, text translation field Progress.However these development depend critically upon the scale of training data, so these technologies are being applied to actual production by data Most important bottleneck in environment.
By taking current data sorting task as an example, the data volume of each general labeling requirement is magnitude as " thousand ". Traditional method uses full monitoring data training pattern then to be reused that is, firstly the need of enough labeled data are obtained This part labeled data training pattern.But the mode based on artificial labeled data obtains extensive mark in internet data Data exist following insufficient:
The data of the first, " thousand " magnitude seem few, but the amount of data to be marked is but very huge.Under normal circumstances Just there is a training data in the labeled data of 10-20 or so, this means that the mark human cost of each labeling requirement It increases sharply.
The second, general label system is comparatively very huge, and the use of each label is manually marked in this way Method will consume a large amount of human resources.Moreover, the data generated daily in internet environment continually, hardly may be used All data can manually be marked, mark difficulty is big.
Invention content
A kind of data screening method, apparatus of offer of the embodiment of the present invention and terminal are existing in the prior art right to solve The data generated daily in internet environment carry out data screening after being labeled, difficulty is big and consumption human cost is high asks Topic.
One side according to the present invention provides a kind of data screening method, wherein the method includes:From waiting sieving Select the multiple noise datas of extracting data as sample data;Conversion process is carried out to each sample data, is obtained each described The transformation data of sample data;By advance trained image classification model, to each sample data and each transformation Data carry out Tag Estimation, determine the target labels and target labels probability of each sample data;According to each sample number According to target labels and target labels probability, each sample data is screened, obtain target database.
Optionally, the target labels and target labels probability according to each sample data, to each sample number According to the step of being screened, obtaining target database, including:Each sample data is grouped according to target labels;Its In, each corresponding target labels of grouping;The sample data in same grouping is ranked up according to target labels probability;Its In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping Sample data generates target database.
Optionally, described by advance trained image classification model, to each sample data and each transformation Data carry out Tag Estimation, the step of determining the target labels and target labels probability of each sample data, including:By pre- First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes:Data correspond to Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result, Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate.
Optionally, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data Recognition result, the step of determining the target labels of the sample data and the target labels probability, including:For each mark The probability of the corresponding label of the transformation data of the sample data and the sample data is weighted averagely, obtains by label To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample The target labels probability of notebook data.
Optionally, described that each sample data is converted, obtain the step of the transformation data of each sample data Suddenly, including:Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data; Wherein, default transform method includes at least one of:Rotation, translation and shearing.
According to another aspect of the present invention, a kind of data screening device is provided, wherein described device includes:Extract mould Block is configured as from the multiple noise datas of extracting data to be screened as sample data;Conversion module is configured as to each institute It states sample data and carries out conversion process, obtain the transformation data of each sample data;Determining module is configured as by advance Trained image classification model carries out Tag Estimation to each sample data and each transformation data, determines each described The target labels and target labels probability of sample data;Screening module is configured as the target mark according to each sample data Label and target labels probability, screen each sample data, obtain target database.
Optionally, the screening module includes:It is grouped submodule, is configured as each sample data according to target mark Label are grouped;Wherein, the corresponding target labels of each grouping;Sorting sub-module is configured as according to target labels probability Sample data in same grouping is ranked up;Wherein, the target labels probability value for the preceding sample data that sorts is big;It generates Submodule is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target database.
Optionally, the determining module includes:It identifies submodule, is configured as through advance trained image classification mould Type carries out Tag Estimation to each sample data and each transformation data, respectively obtains each sample data and each The tag recognition result of the transformation data;Wherein, tag recognition result includes:The corresponding each label of data and each label pair The probability answered;Label determination sub-module is configured as being directed to each sample data, the tag recognition knot according to the sample data The tag recognition of the transformation data of fruit and the sample data is as a result, determine the target labels and target mark of the sample data Sign probability.
Optionally, the label determination sub-module is specifically configured to:For each label, by the sample data and institute The probability for stating the corresponding label of transformation data of sample data is weighted averagely, and the weighted average for obtaining the label is general Rate;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as described The target labels of sample data;The maximum weighted average probability is determined as to the target labels probability of the sample data.
Optionally, the conversion module is specifically configured to:Each sample data is become according to default mapping mode It changes, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of:Rotation, translate with And shearing.
In accordance with a further aspect of the present invention, a kind of terminal is provided, including:Memory, processor and it is stored in described deposit On reservoir and the computer program that can run on the processor, the computer program are realized when being executed by the processor The step of any one heretofore described data screening method.
According to another aspect of the invention, a kind of computer readable storage medium, the computer-readable storage are provided Computer program is stored on medium, the computer program realizes any one heretofore described when being executed by processor The step of data screening method.
Compared with prior art, the present invention has the following advantages:
Data screening scheme provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice Data, that is, extracting data the sample data to be screened generated in choosing interval, converts to carry out data each sample data Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention The data screening scheme that embodiment provides, garbled data is treated without user, screening is marked one by one manually, can be according to calculating Machine program carries out data screening automatically, simple operation and takes short, can either save human resources, and can promote data screening Efficiency.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, below the special specific implementation mode for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various advantages and benefit are for ordinary skill people Member will become clear.Attached drawing is only used for showing preferred embodiment, and is not considered as limitation of the present invention.And In entire attached drawing, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the step flow chart of a kind of according to embodiments of the present invention one data screening method;
Fig. 2 is the step flow chart of a kind of according to embodiments of the present invention two data screening method;
Fig. 3 is a kind of structure diagram of according to embodiments of the present invention three data screening device;
Fig. 4 is a kind of structure diagram of according to embodiments of the present invention four terminal.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
Embodiment one
Referring to Fig.1, a kind of step flow chart of data screening method of the embodiment of the present invention one is shown.
The data screening method of the embodiment of the present invention may comprise steps of:
Step 101:From the multiple noise datas of extracting data to be screened as sample data.
Data screening mode provided in an embodiment of the present invention can be adapted for the big rule generated in user's history operation Mode noise data are screened, and noise data can be image.Such as:Different user uploads image to platform, server according to Prefixed time interval periodically screens image caused by user, the image that user's operation is generated in prefixed time interval It is then data to be screened.Prefixed time interval can be one day, two days or 12 hours etc., in the embodiment of the present invention not to this It is particularly shown.Single data screening flow is illustrated in the embodiment of the present invention, during specific implementation, each data Flow when screening described in the executable embodiment of the present invention.
A pre-existing trained image classification model in the embodiment of the present invention, comprising more in the image classification model A label and the corresponding training data of each label need to be trained when executing data screening operation by the pre-selection of management service Good image classification model carries out Tag Estimation to data.
From the noise data number of extracting data to be screened, can be carried out according to actual demand by those skilled in the art Adjustment.Such as:It can extract necessarily or the noise data of hundred million orders of magnitude is as sample data.It makes an uproar from extracting data to be screened When sound data, it can extract at random.
Step 102:Conversion process is carried out to each sample data, obtains the transformation data of each sample data.
Wherein, the mapping mode of sample data can include but is not limited to:The arbitrary side such as rotation, translation and shearing Formula.
Step 103:By advance trained image classification model, to each sample data and each data that convert into row label Prediction, determines the target labels and target labels probability of each sample data.
Respectively by the transformation data of each sample data and each sample data, input in trained image classification model in advance Carry out Tag Estimation, you can the tag recognition result of each data inputted.For specifically according to trained image The concrete mode of disaggregated model prediction data label does not do this specific limit with reference to the relevant technologies in the embodiment of the present invention System.
Wherein, the tag recognition result of each data includes:The probability of at least one label and each label;Label Probability it is higher, then illustrate data belong to the label instruction data category possibility it is bigger.
It, can be according to the sample data and the sample number when determining the target labels and target labels probability of a sample data According to transformation data tag recognition as a result, according to ballot mode, determine a final label i.e. target labels.
Step 104:According to the target labels of each sample data and target labels probability, each sample data is screened, Obtain target database.
When being screened to sample data, each sample data can be grouped according to said target label;Then The data that preset quantity is screened out out from each grouping, the sample data screened constitute target database.
After this data screening, only retain the sample data in target data Kuku, for the sample number screened out It will be dropped as the data of sample data according to being not extracted by out in data to be screened.Sample data in target database is then It can be used for expanding image classification model.
Preset quantity can be configured by those skilled in the art according to actual demand, in the embodiment of the present invention not to this Do concrete restriction.Preset quantity is smaller, then the sample data quantity screened out is more, and the sample data volume of reservation is fewer, accordingly The precision of sample data is higher in ground target database.
Data screening method provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice Data, that is, extracting data the sample data to be screened generated in choosing interval, converts to carry out data each sample data Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention The data screening method that embodiment provides, garbled data is treated without user, screening is marked one by one manually, can be according to calculating Machine program carries out data screening automatically, simple operation and takes short, can either save human resources, and can promote data screening Efficiency.
Embodiment two
With reference to Fig. 2, a kind of step flow chart of data screening method of the embodiment of the present invention two is shown.
The data screening method of the embodiment of the present invention can specifically include following steps:
Step 201:From the multiple noise datas of extracting data to be screened as sample data.
User can upload noise data such as image on platform in real time during historical operation, after managing the platform Platform server can periodically screen the noise data generated in user's history operating process.Screening the period can be by this field skill Art personnel are configured according to actual demand.The noise data that adjacent bolting house twice is generated every middle user's operation is then to be screened Data.
During specific implementation, multiple noise datas can be extracted at random from data to be screened as sample data, extraction The quantity of sample data can be ten million magnitude or hundred million magnitudes.Such as:The noise data number that user generates daily in platform Amount is tens, but since database volume is limited, then need to extract several hundred million or several ten million noise datas as sample number According to abandoning remaining non-extracted noise data.
Wherein, the sample data extracted may make up a database, and database is represented by:DBnoise
Step 202:Conversion process is carried out to each sample data, obtains the transformation data of each sample data.
After being converted to sample data, each sample data corresponds to one or more transformation data.
Sample data is represented by:samplei ori, convert data and be represented by:samplei trans
Preferably, for a sample data, the total number of sample data transformation data corresponding with the sample data is Odd number.
Step 203:By advance trained image classification model, to each sample data and each transformation data into row label Prediction respectively obtains the tag recognition result of each sample data and each transformation data.
Before executing data screening flow, advance training image disaggregated model is needed.It is wrapped in trained image classification model Containing multiple labels and the corresponding training data of each label, training data is clean data.For being trained based on training data The concrete mode of image classification model is not particularly limited this in the embodiment of the present invention with reference to the relevant technologies.Image point The training of class model is substantially the continuous renewal to model parameter, until image classification model converges to preset standard.
Such as:Stochastic gradient descent method counting loss function L (θ) may be used for the parameter θ in image classification model GradientThe gradientIt is used to constantly update the parameter in image classification model, furthermore it is also possible to according to this GradientThe value of undated parameter θWherein, η is learning rate, is used for the newer width of control parameter θ Degree.
Wherein, tag recognition result includes:The corresponding each label of data and the corresponding probability of each label.Sample data and Transformation data are referred to as data, enter data into image classification model and carry out Tag Estimation, it is defeated that image classification model will export institute Enter the corresponding tag recognition result of data.
Image classification model can carry out Tag Estimation to the data of input in the following way:
First, the characteristic pattern of input data is determined;
Secondly, characteristic pattern is subjected to dimension-reduction treatment, obtains intermediate features figure;
Again, intermediate features figure is averaged pond, obtains the corresponding feature vector of intermediate features figure;Wherein, feature vector In include multiple points, each pair of point answers a label and a probability, using the non-zero label of probability as the corresponding label of data It is exported for effective label, and exports the corresponding probability of each effective label.
Step 204:For each sample data, the tag recognition according to sample data is as a result, transformation with sample data The tag recognition of data is as a result, determine the target labels and target labels probability of sample data.
After image classification model tag recognition, each sample data corresponds at least one label, final in this step It needs by way of ballot, determines the unique objects label and target labels probability of each sample data.It is a kind of preferably logical Cross ballot mode determine sample data target labels and target labels probability mode it is as follows:
First, for each label of each sample data, the transformation data of sample data and sample data are corresponding The probability of the label is weighted averagely, obtains the weighted average probability of the label;
The weighted average probability of the single label of single sample data can be calculated by following formula:
Wherein, i identifies for sample data, and j is tag identifier,Weighted average for the label j of sample data i is general Rate.In this formula, the probability of the j labels of sample data and each transformation data is weighted averagely, label correspondence can be obtained Weighted average probability value.#sampleiFor samplei oriWith samplei transThe sum of, S is the mark for including in image classification model Label set.
Secondly, the maximum value in the weighted average probability of each label is determined;
Finally, by the corresponding label of maximum weighted average probability, it is determined as the target labels of the sample data;It will most greatly Weight average determine the probability is the target labels probability of the sample data.
Repeat which, it may be determined that the target labels and target labels probability of each sample data.Determine each sample number According to target labels and target labels probability after, according to the target labels of each sample data and target labels probability, to each sample Data are screened, and target database is obtained.Specific screening process such as step 205 is to step
Step 205:Each sample data is grouped according to target labels.
Wherein, the corresponding target labels of each grouping include at least one sample data in each grouping, for various kinds The corresponding transformation data of notebook data directly abandon, without being added in grouping.
Step 206:The sample data in same grouping is ranked up according to target labels probability.
Wherein, the target labels probability value for the preceding sample data that sorts is big.
Step 207:The sample data for screening the preceding preset quantity that obtains sorting in each grouping, obtains target database.
Wherein, preset quantity can be configured by those skilled in the art according to actual demand, in the embodiment of the present invention This is not particularly limited.
The target labels probability size of each sample data in same grouping is ranked up in this step, in each grouping Topk sample data is filtered out, target database is constituted.Only retain the sample data in target database, for being screened out Sample data and data to be screened in be not extracted by out and will be dropped as the noise data of sample data.In target database Sample data then can be used for expand training image disaggregated model.
Data screening method provided in an embodiment of the present invention, except being had with data screening method shown in embodiment one Outside some advantageous effects, the probability based on each label by way of soft ballot determines the target labels and target of sample data Label probability can promote the accuracy of sample data target labels.
Embodiment three
With reference to Fig. 3, a kind of structure diagram of data screening device of the embodiment of the present invention three is shown.
The data screening device of the embodiment of the present invention may include:Extraction module 301 is configured as from data to be screened Multiple noise datas are extracted as sample data;Conversion module 302 is configured as carrying out at transformation each sample data Reason, obtains the transformation data of each sample data;Determining module 303 is configured as through advance trained image classification Model carries out Tag Estimation to each sample data and each transformation data, determines the target mark of each sample data Label and target labels probability;Screening module 304 is configured as general according to the target labels and target labels of each sample data Rate screens each sample data, obtains target database.
Preferably, the screening module 304 may include:It is grouped submodule 3041, is configured as each sample number It is grouped according to according to target labels;Wherein, the corresponding target labels of each grouping;Sorting sub-module 3042, is configured as The sample data in same grouping is ranked up according to target labels probability;Wherein, the target for the preceding sample data that sorts Label probability value is big;Submodule 3043 is generated, is configured as screening the sample for the preceding preset quantity that obtains sorting in each grouping Data generate target database.
Preferably, the determining module 303 may include:It identifies submodule 3031, is configured as by training in advance Image classification model, Tag Estimation is carried out to each sample data and each transformation data, respectively obtains each sample The tag recognition result of notebook data and each transformation data;Wherein, tag recognition result includes:The corresponding each label of data Probability corresponding with each label;Label determination sub-module 3032 is configured as being directed to each sample data, according to the sample The tag recognitions of data is as a result, with the tag recognitions of the transformation data of the sample data as a result, determining the sample data Target labels and target labels probability.
Preferably, the label determination sub-module 3032 is specifically configured to:For each label, by the sample data The probability of the label corresponding with the transformation data of the sample data, which is weighted, to be averaged, and the weighting for obtaining the label is flat Equal probability;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as The target labels of the sample data;The target labels that the maximum weighted average probability is determined as the sample data are general Rate.
Preferably, the conversion module 302 is specifically configured to:Each sample data is carried out according to default mapping mode Transformation, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of:Rotation, translation And shearing.
The data screening device of the embodiment of the present invention is for realizing corresponding data sieve in previous embodiment one, embodiment two Choosing method, and there is advantageous effect corresponding with embodiment of the method, details are not described herein.
Example IV
With reference to Fig. 4, a kind of structure diagram of terminal for garbled data of the embodiment of the present invention four is shown.
The terminal of the embodiment of the present invention may include:Memory, processor and storage are on a memory and can be in processor The computer program of upper operation realizes any one heretofore described data screening when computer program is executed by processor The step of method.
Fig. 4 is a kind of block diagram of data screening terminal 600 shown according to an exemplary embodiment.For example, terminal 600 can To be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, Medical Devices are good for Body equipment, personal digital assistant etc..
With reference to Fig. 4, terminal 600 may include following one or more components:Processing component 602, memory 604, power supply Component 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, and Communication component 616.
The integrated operation of 602 usual control device 600 of processing component, such as with display, call, data communication, phase Machine operates and record operates associated operation.Processing component 602 may include that one or more processors 620 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate Interaction between media component 608 and processing component 602.
Memory 604 is configured as storing various types of data to support the operation in terminal 600.These data are shown Example includes instruction for any application program or method that are operated in terminal 600, contact data, and telephone book data disappears Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 606 provides electric power for the various assemblies of terminal 600.Power supply module 606 may include power management system System, one or more power supplys and other generated with for terminal 600, management and the associated component of distribution electric power.
Multimedia component 608 is included in the screen of one output interface of offer between the terminal 600 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 608 includes a front camera and/or rear camera.When terminal 600 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike Wind (MIC), when terminal 600 is in operation mode, when such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set Part 616 is sent.In some embodiments, audio component 610 further includes a loud speaker, is used for exports audio signal.
I/O interfaces 612 provide interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to:Home button, volume button, start button and lock Determine button.
Sensor module 614 includes one or more sensors, and the state for providing various aspects for terminal 600 is commented Estimate.For example, sensor module 614 can detect the state that opens/closes of terminal 600, and the relative positioning of component, for example, it is described Component is the display and keypad of terminal 600, and sensor module 614 can be with 600 1 components of detection terminal 600 or terminal Position change, the existence or non-existence that user contacts with terminal 600,600 orientation of device or acceleration/deceleration and terminal 600 Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 616 is configured to facilitate the communication of wired or wireless way between terminal 600 and other equipment.Terminal 600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary implementation In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 600 can be believed by one or more application application-specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing data screening method, specifically Data screening method includes:From the multiple noise datas of extracting data to be screened as sample data;To each sample data Conversion process is carried out, the transformation data of each sample data are obtained;By advance trained image classification model, to each institute It states sample data and each transformation data carries out Tag Estimation, determine the target labels and target labels of each sample data Probability;According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained Target database.
Preferably, the target labels and target labels probability according to each sample data, to each sample number According to the step of being screened, obtaining target database, including:Each sample data is grouped according to target labels;Its In, each corresponding target labels of grouping;The sample data in same grouping is ranked up according to target labels probability;Its In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping Sample data generates target database.
Preferably, described by advance trained image classification model, to each sample data and each transformation Data carry out Tag Estimation, the step of determining the target labels and target labels probability of each sample data, including:By pre- First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes:Data correspond to Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result, Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate.
Preferably, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data Recognition result, the step of determining the target labels of the sample data and the target labels probability, including:For each mark The probability of the corresponding label of the transformation data of the sample data and the sample data is weighted averagely, obtains by label To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample The target labels probability of notebook data.
Preferably, described that each sample data is converted, obtain the step of the transformation data of each sample data Suddenly, including:Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data; Wherein, default transform method includes at least one of:Rotation, translation and shearing.
In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of Such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of terminal 600 to complete above-mentioned data screening side Method.For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic Band, floppy disk and optical data storage devices etc..When the instruction in storage medium is executed by the processor of terminal so that terminal can The step of executing any one heretofore described data screening method.
Terminal provided in an embodiment of the present invention, periodically carries out data screening, when screening from user in bolting house twice every interior The extracting data sample data of data of generation, that is, to be screened converts to carry out data augmentation each sample data, is led to The data and sample data crossed after augmentation determine the target labels and target labels probability of each sample data, according to each sample number According to target labels and target labels probability, each sample data is screened, obtain target database.The embodiment of the present invention carries The data screening scheme of confession treats garbled data without user and screening is marked one by one manually, can be according to computer program certainly It is dynamic to carry out data screening, it simple operation and takes short, human resources can either be saved, and data screening efficiency can be promoted.
For device embodiments, since it is basically similar to the method embodiment, so fairly simple, the correlation of description Place illustrates referring to the part of embodiment of the method.
Provided herein data screening scheme not with the intrinsic phase of any certain computer, virtual system or miscellaneous equipment It closes.Various general-purpose systems can also be used together with teaching based on this.As described above, construction has present invention side Structure required by the system of case is obvious.In addition, the present invention is not also directed to any certain programmed language.It should be bright In vain, various programming languages can be utilized to realize the content of invention described herein, and is retouched above to what language-specific was done State is to disclose the preferred forms of the present invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect Shield the present invention claims the more features of feature than being expressly recited in each claim.More precisely, such as right As claim reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows tool Thus claims of body embodiment are expressly incorporated in the specific implementation mode, wherein each claim conduct itself The separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment Change and they are arranged in the one or more equipment different from the embodiment.It can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it may be used any Combination is disclosed to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit requires, abstract and attached drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of arbitrary It mode can use in any combination.
The all parts embodiment of the present invention can be with hardware realization, or to run on one or more processors Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) come realize in data screening scheme according to the ... of the embodiment of the present invention some or The some or all functions of person's whole component.The present invention is also implemented as one for executing method as described herein Divide either whole equipment or program of device (for example, computer program and computer program product).Such this hair of realization Bright program can may be stored on the computer-readable medium, or can be with the form of one or more signal.It is such Signal can be downloaded from internet website and be obtained, and either provided on carrier signal or provided in any other forms.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame Claim.

Claims (12)

1. a kind of data screening method, which is characterized in that the method includes:
From the multiple noise datas of extracting data to be screened as sample data;
Conversion process is carried out to each sample data, obtains the transformation data of each sample data;
It is pre- into row label to each sample data and each transformation data by advance trained image classification model It surveys, determines the target labels and target labels probability of each sample data;
According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained Target database.
2. according to the method described in claim 1, it is characterized in that, the target labels and mesh according to each sample data The step of marking label probability, each sample data screened, obtaining target database, including:
Each sample data is grouped according to target labels;Wherein, the corresponding target labels of each grouping;
The sample data in same grouping is ranked up according to target labels probability;Wherein, sort preceding sample data Target labels probability value is big;
The sample data for screening the preceding preset quantity that obtains sorting in each grouping, generates target database.
3. right according to the method described in claim 1, it is characterized in that, described by advance trained image classification model Each sample data and each transformation data carry out Tag Estimation, determine the target labels and target of each sample data The step of label probability, including:
It is pre- into row label to each sample data and each transformation data by advance trained image classification model It surveys, respectively obtains the tag recognition result of each sample data and each transformation data;Wherein, tag recognition result packet It includes:The corresponding each label of data and the corresponding probability of each label;
For each sample data, the tag recognition according to the sample data is as a result, transformation data with the sample data Tag recognition as a result, determining the target labels and target labels probability of the sample data.
4. according to the method described in claim 3, it is characterized in that, the tag recognition according to the sample data is as a result, and institute The tag recognition of the transformation data of sample data is stated as a result, determining that the target labels of the sample data and the target labels are general The step of rate, including:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into Row weighted average obtains the weighted average probability of the label;
Determine the maximum value in the weighted average probability of each label;
By the corresponding label of maximum weighted average probability, it is determined as the target labels of the sample data;By the maximum weighted Average probability is determined as the target labels probability of the sample data.
5. according to the method described in claim 1, it is characterized in that, described convert each sample data, obtain each The step of transformation data of the sample data, including:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein, Default transform method includes at least one of:Rotation, translation and shearing.
6. a kind of data screening device, which is characterized in that described device includes:
Extraction module is configured as from the multiple noise datas of extracting data to be screened as sample data;
Conversion module is configured as carrying out conversion process to each sample data, obtains the transformation number of each sample data According to;
Determining module is configured as by advance trained image classification model, to each sample data and each change It changes data and carries out Tag Estimation, determine the target labels and target labels probability of each sample data;
Screening module is configured as target labels and target labels probability according to each sample data, to each sample Data are screened, and target database is obtained.
7. device according to claim 6, which is characterized in that the screening module includes:
It is grouped submodule, is configured as each sample data being grouped according to target labels;Wherein, each grouping corresponds to One target labels;
Sorting sub-module is configured as being ranked up the sample data in same grouping according to target labels probability;Wherein, it arranges The target labels probability value of the preceding sample data of sequence is big;
Submodule is generated, is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target Database.
8. device according to claim 6, which is characterized in that the determining module includes:
It identifies submodule, is configured as through advance trained image classification model, to each sample data and each described It converts data and carries out Tag Estimation, respectively obtain the tag recognition result of each sample data and each transformation data; Wherein, tag recognition result includes:The corresponding each label of data and the corresponding probability of each label;
Label determination sub-module, be configured as be directed to each sample data, according to the sample data tag recognition as a result, and The tag recognition of the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate.
9. device according to claim 8, which is characterized in that the label determination sub-module is specifically configured to:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into Row weighted average obtains the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;It will most The corresponding label of weighted average probability greatly, is determined as the target labels of the sample data;By the maximum weighted average probability It is determined as the target labels probability of the sample data.
10. device according to claim 6, which is characterized in that the conversion module is specifically configured to:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein, Default transform method includes at least one of:Rotation, translation and shearing.
11. a kind of terminal, which is characterized in that including:It memory, processor and is stored on the memory and can be at the place The computer program run on reason device is realized when the computer program is executed by the processor as appointed in claim 1 to 5 The step of data screening method described in one.
12. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes the data screening method as described in any one of claim 1 to 5 when the computer program is executed by processor The step of.
CN201810220055.1A 2018-03-16 2018-03-16 Data screening method, apparatus and terminal Active CN108595497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810220055.1A CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810220055.1A CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Publications (2)

Publication Number Publication Date
CN108595497A true CN108595497A (en) 2018-09-28
CN108595497B CN108595497B (en) 2019-09-27

Family

ID=63626547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810220055.1A Active CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Country Status (1)

Country Link
CN (1) CN108595497B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109598307A (en) * 2018-12-06 2019-04-09 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN109657710A (en) * 2018-12-06 2019-04-19 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN110147850A (en) * 2019-05-27 2019-08-20 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the storage medium of image recognition
CN110348993A (en) * 2019-06-28 2019-10-18 北京淇瑀信息科技有限公司 Wind is discussed and select model workers determination method, determining device and the electronic equipment of type label
CN110807767A (en) * 2019-10-24 2020-02-18 北京旷视科技有限公司 Target image screening method and target image screening device
WO2021139274A1 (en) * 2020-06-09 2021-07-15 平安科技(深圳)有限公司 Document classification method and apparatus based on deep learning model, and computer device
CN113139628A (en) * 2021-06-22 2021-07-20 腾讯科技(深圳)有限公司 Sample image identification method, device and equipment and readable storage medium
CN113837670A (en) * 2021-11-26 2021-12-24 北京芯盾时代科技有限公司 Risk recognition model training method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005509978A (en) * 2001-11-16 2005-04-14 チェン,ユアン,ヤン Ambiguous neural network with supervised and unsupervised cluster analysis
US7512273B2 (en) * 2004-10-21 2009-03-31 Microsoft Corporation Digital ink labeling
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
CN104463202A (en) * 2014-11-28 2015-03-25 苏州大学 Multi-class image semi-supervised classifying method and system
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN106960219A (en) * 2017-03-10 2017-07-18 百度在线网络技术(北京)有限公司 Image identification method and device, computer equipment and computer-readable medium
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005509978A (en) * 2001-11-16 2005-04-14 チェン,ユアン,ヤン Ambiguous neural network with supervised and unsupervised cluster analysis
US7512273B2 (en) * 2004-10-21 2009-03-31 Microsoft Corporation Digital ink labeling
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
CN104463202A (en) * 2014-11-28 2015-03-25 苏州大学 Multi-class image semi-supervised classifying method and system
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN106960219A (en) * 2017-03-10 2017-07-18 百度在线网络技术(北京)有限公司 Image identification method and device, computer equipment and computer-readable medium
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
方向: "伪标签:教你玩转无标签数据的半监督学习方法", 《阿里云》 *
汪忠国 等: "稀疏混合图随机跳跃WEB对象多标签半监督分类", 《计算机科学与探索》 *
追梦飞阳: "数据增强在卷积神经网络中的应用", 《CSDN》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109598307A (en) * 2018-12-06 2019-04-09 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN109657710A (en) * 2018-12-06 2019-04-19 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN109657710B (en) * 2018-12-06 2022-01-21 北京达佳互联信息技术有限公司 Data screening method and device, server and storage medium
CN110147850B (en) * 2019-05-27 2021-12-07 北京达佳互联信息技术有限公司 Image recognition method, device, equipment and storage medium
CN110147850A (en) * 2019-05-27 2019-08-20 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the storage medium of image recognition
CN110348993A (en) * 2019-06-28 2019-10-18 北京淇瑀信息科技有限公司 Wind is discussed and select model workers determination method, determining device and the electronic equipment of type label
CN110348993B (en) * 2019-06-28 2023-12-22 北京淇瑀信息科技有限公司 Determination method and determination device for label for wind assessment model and electronic equipment
CN110807767A (en) * 2019-10-24 2020-02-18 北京旷视科技有限公司 Target image screening method and target image screening device
WO2021139274A1 (en) * 2020-06-09 2021-07-15 平安科技(深圳)有限公司 Document classification method and apparatus based on deep learning model, and computer device
CN113139628A (en) * 2021-06-22 2021-07-20 腾讯科技(深圳)有限公司 Sample image identification method, device and equipment and readable storage medium
CN113139628B (en) * 2021-06-22 2021-09-17 腾讯科技(深圳)有限公司 Sample image identification method, device and equipment and readable storage medium
CN113837670A (en) * 2021-11-26 2021-12-24 北京芯盾时代科技有限公司 Risk recognition model training method and device

Also Published As

Publication number Publication date
CN108595497B (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN108595497B (en) Data screening method, apparatus and terminal
CN108399409A (en) Image classification method, device and terminal
CN108171254A (en) Image tag determines method, apparatus and terminal
CN104737523B (en) The situational model in mobile device is managed by assigning for the situation label of data clustering
CN109299387A (en) A kind of information push method based on intelligent recommendation, device and terminal device
CN108256549B (en) Image classification method, device and terminal
CN108614858B (en) Image classification model optimization method, apparatus and terminal
CN107609185B (en) Method, device, equipment and computer-readable storage medium for similarity calculation of POI
CN106372059A (en) Information input method and information input device
CN108256555A (en) Picture material recognition methods, device and terminal
CN108563683A (en) Label addition method, device and terminal
CN109918684A (en) Model training method, interpretation method, relevant apparatus, equipment and storage medium
CN110363084A (en) A kind of class state detection method, device, storage medium and electronics
CN105528403B (en) Target data identification method and device
CN108536669A (en) Literal information processing method, device and terminal
CN109871843A (en) Character identifying method and device, the device for character recognition
CN107305549A (en) Language data processing method, device and the device for language data processing
CN109859770A (en) Music separation method, device and computer readable storage medium
CN110390086A (en) A kind of method, apparatus and storage medium generating text
CN106649661A (en) Method and device for establishing knowledge base
CN102741840A (en) Method and apparatus for modelling personalized contexts
CN108960283A (en) Classification task incremental processing method and device, electronic equipment and storage medium
CN109858558A (en) Training method, device, electronic equipment and the storage medium of disaggregated model
CN109741108A (en) Streaming application recommended method, device and electronic equipment based on context aware
CN106909560A (en) Point of interest sort method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant