CN108595497A - Data screening method, apparatus and terminal - Google Patents
Data screening method, apparatus and terminal Download PDFInfo
- Publication number
- CN108595497A CN108595497A CN201810220055.1A CN201810220055A CN108595497A CN 108595497 A CN108595497 A CN 108595497A CN 201810220055 A CN201810220055 A CN 201810220055A CN 108595497 A CN108595497 A CN 108595497A
- Authority
- CN
- China
- Prior art keywords
- sample data
- data
- target labels
- probability
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
An embodiment of the present invention provides a kind of data screening method, apparatus and terminals, wherein the data screening method includes:From the multiple noise datas of extracting data to be screened as sample data;Conversion process is carried out to each sample data, obtains the transformation data of each sample data;By advance trained image classification model, Tag Estimation is carried out to each sample data and each transformation data, determines the target labels and target labels probability of each sample data;According to the target labels and target labels probability of each sample data, each sample data is screened, obtain target database data screening scheme provided in an embodiment of the present invention, it treats garbled data manually without user and screening is marked one by one, data screening can be carried out automatically according to computer program, simple operation and take it is short, human resources can either be saved, and data screening efficiency can be promoted.
Description
Technical field
The present invention relates to noise data screening technique fields, more particularly to a kind of data screening method, apparatus and terminal.
Background technology
Recently, deep learning achieves breakthrough in the related contents understanding such as natural language processing, text translation field
Progress.However these development depend critically upon the scale of training data, so these technologies are being applied to actual production by data
Most important bottleneck in environment.
By taking current data sorting task as an example, the data volume of each general labeling requirement is magnitude as " thousand ".
Traditional method uses full monitoring data training pattern then to be reused that is, firstly the need of enough labeled data are obtained
This part labeled data training pattern.But the mode based on artificial labeled data obtains extensive mark in internet data
Data exist following insufficient:
The data of the first, " thousand " magnitude seem few, but the amount of data to be marked is but very huge.Under normal circumstances
Just there is a training data in the labeled data of 10-20 or so, this means that the mark human cost of each labeling requirement
It increases sharply.
The second, general label system is comparatively very huge, and the use of each label is manually marked in this way
Method will consume a large amount of human resources.Moreover, the data generated daily in internet environment continually, hardly may be used
All data can manually be marked, mark difficulty is big.
Invention content
A kind of data screening method, apparatus of offer of the embodiment of the present invention and terminal are existing in the prior art right to solve
The data generated daily in internet environment carry out data screening after being labeled, difficulty is big and consumption human cost is high asks
Topic.
One side according to the present invention provides a kind of data screening method, wherein the method includes:From waiting sieving
Select the multiple noise datas of extracting data as sample data;Conversion process is carried out to each sample data, is obtained each described
The transformation data of sample data;By advance trained image classification model, to each sample data and each transformation
Data carry out Tag Estimation, determine the target labels and target labels probability of each sample data;According to each sample number
According to target labels and target labels probability, each sample data is screened, obtain target database.
Optionally, the target labels and target labels probability according to each sample data, to each sample number
According to the step of being screened, obtaining target database, including:Each sample data is grouped according to target labels;Its
In, each corresponding target labels of grouping;The sample data in same grouping is ranked up according to target labels probability;Its
In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping
Sample data generates target database.
Optionally, described by advance trained image classification model, to each sample data and each transformation
Data carry out Tag Estimation, the step of determining the target labels and target labels probability of each sample data, including:By pre-
First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains
The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes:Data correspond to
Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result,
Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general
Rate.
Optionally, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data
Recognition result, the step of determining the target labels of the sample data and the target labels probability, including:For each mark
The probability of the corresponding label of the transformation data of the sample data and the sample data is weighted averagely, obtains by label
To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally
The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample
The target labels probability of notebook data.
Optionally, described that each sample data is converted, obtain the step of the transformation data of each sample data
Suddenly, including:Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;
Wherein, default transform method includes at least one of:Rotation, translation and shearing.
According to another aspect of the present invention, a kind of data screening device is provided, wherein described device includes:Extract mould
Block is configured as from the multiple noise datas of extracting data to be screened as sample data;Conversion module is configured as to each institute
It states sample data and carries out conversion process, obtain the transformation data of each sample data;Determining module is configured as by advance
Trained image classification model carries out Tag Estimation to each sample data and each transformation data, determines each described
The target labels and target labels probability of sample data;Screening module is configured as the target mark according to each sample data
Label and target labels probability, screen each sample data, obtain target database.
Optionally, the screening module includes:It is grouped submodule, is configured as each sample data according to target mark
Label are grouped;Wherein, the corresponding target labels of each grouping;Sorting sub-module is configured as according to target labels probability
Sample data in same grouping is ranked up;Wherein, the target labels probability value for the preceding sample data that sorts is big;It generates
Submodule is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target database.
Optionally, the determining module includes:It identifies submodule, is configured as through advance trained image classification mould
Type carries out Tag Estimation to each sample data and each transformation data, respectively obtains each sample data and each
The tag recognition result of the transformation data;Wherein, tag recognition result includes:The corresponding each label of data and each label pair
The probability answered;Label determination sub-module is configured as being directed to each sample data, the tag recognition knot according to the sample data
The tag recognition of the transformation data of fruit and the sample data is as a result, determine the target labels and target mark of the sample data
Sign probability.
Optionally, the label determination sub-module is specifically configured to:For each label, by the sample data and institute
The probability for stating the corresponding label of transformation data of sample data is weighted averagely, and the weighted average for obtaining the label is general
Rate;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as described
The target labels of sample data;The maximum weighted average probability is determined as to the target labels probability of the sample data.
Optionally, the conversion module is specifically configured to:Each sample data is become according to default mapping mode
It changes, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of:Rotation, translate with
And shearing.
In accordance with a further aspect of the present invention, a kind of terminal is provided, including:Memory, processor and it is stored in described deposit
On reservoir and the computer program that can run on the processor, the computer program are realized when being executed by the processor
The step of any one heretofore described data screening method.
According to another aspect of the invention, a kind of computer readable storage medium, the computer-readable storage are provided
Computer program is stored on medium, the computer program realizes any one heretofore described when being executed by processor
The step of data screening method.
Compared with prior art, the present invention has the following advantages:
Data screening scheme provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice
Data, that is, extracting data the sample data to be screened generated in choosing interval, converts to carry out data each sample data
Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to
The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention
The data screening scheme that embodiment provides, garbled data is treated without user, screening is marked one by one manually, can be according to calculating
Machine program carries out data screening automatically, simple operation and takes short, can either save human resources, and can promote data screening
Efficiency.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention,
And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, below the special specific implementation mode for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various advantages and benefit are for ordinary skill people
Member will become clear.Attached drawing is only used for showing preferred embodiment, and is not considered as limitation of the present invention.And
In entire attached drawing, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the step flow chart of a kind of according to embodiments of the present invention one data screening method;
Fig. 2 is the step flow chart of a kind of according to embodiments of the present invention two data screening method;
Fig. 3 is a kind of structure diagram of according to embodiments of the present invention three data screening device;
Fig. 4 is a kind of structure diagram of according to embodiments of the present invention four terminal.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
Embodiment one
Referring to Fig.1, a kind of step flow chart of data screening method of the embodiment of the present invention one is shown.
The data screening method of the embodiment of the present invention may comprise steps of:
Step 101:From the multiple noise datas of extracting data to be screened as sample data.
Data screening mode provided in an embodiment of the present invention can be adapted for the big rule generated in user's history operation
Mode noise data are screened, and noise data can be image.Such as:Different user uploads image to platform, server according to
Prefixed time interval periodically screens image caused by user, the image that user's operation is generated in prefixed time interval
It is then data to be screened.Prefixed time interval can be one day, two days or 12 hours etc., in the embodiment of the present invention not to this
It is particularly shown.Single data screening flow is illustrated in the embodiment of the present invention, during specific implementation, each data
Flow when screening described in the executable embodiment of the present invention.
A pre-existing trained image classification model in the embodiment of the present invention, comprising more in the image classification model
A label and the corresponding training data of each label need to be trained when executing data screening operation by the pre-selection of management service
Good image classification model carries out Tag Estimation to data.
From the noise data number of extracting data to be screened, can be carried out according to actual demand by those skilled in the art
Adjustment.Such as:It can extract necessarily or the noise data of hundred million orders of magnitude is as sample data.It makes an uproar from extracting data to be screened
When sound data, it can extract at random.
Step 102:Conversion process is carried out to each sample data, obtains the transformation data of each sample data.
Wherein, the mapping mode of sample data can include but is not limited to:The arbitrary side such as rotation, translation and shearing
Formula.
Step 103:By advance trained image classification model, to each sample data and each data that convert into row label
Prediction, determines the target labels and target labels probability of each sample data.
Respectively by the transformation data of each sample data and each sample data, input in trained image classification model in advance
Carry out Tag Estimation, you can the tag recognition result of each data inputted.For specifically according to trained image
The concrete mode of disaggregated model prediction data label does not do this specific limit with reference to the relevant technologies in the embodiment of the present invention
System.
Wherein, the tag recognition result of each data includes:The probability of at least one label and each label;Label
Probability it is higher, then illustrate data belong to the label instruction data category possibility it is bigger.
It, can be according to the sample data and the sample number when determining the target labels and target labels probability of a sample data
According to transformation data tag recognition as a result, according to ballot mode, determine a final label i.e. target labels.
Step 104:According to the target labels of each sample data and target labels probability, each sample data is screened,
Obtain target database.
When being screened to sample data, each sample data can be grouped according to said target label;Then
The data that preset quantity is screened out out from each grouping, the sample data screened constitute target database.
After this data screening, only retain the sample data in target data Kuku, for the sample number screened out
It will be dropped as the data of sample data according to being not extracted by out in data to be screened.Sample data in target database is then
It can be used for expanding image classification model.
Preset quantity can be configured by those skilled in the art according to actual demand, in the embodiment of the present invention not to this
Do concrete restriction.Preset quantity is smaller, then the sample data quantity screened out is more, and the sample data volume of reservation is fewer, accordingly
The precision of sample data is higher in ground target database.
Data screening method provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice
Data, that is, extracting data the sample data to be screened generated in choosing interval, converts to carry out data each sample data
Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to
The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention
The data screening method that embodiment provides, garbled data is treated without user, screening is marked one by one manually, can be according to calculating
Machine program carries out data screening automatically, simple operation and takes short, can either save human resources, and can promote data screening
Efficiency.
Embodiment two
With reference to Fig. 2, a kind of step flow chart of data screening method of the embodiment of the present invention two is shown.
The data screening method of the embodiment of the present invention can specifically include following steps:
Step 201:From the multiple noise datas of extracting data to be screened as sample data.
User can upload noise data such as image on platform in real time during historical operation, after managing the platform
Platform server can periodically screen the noise data generated in user's history operating process.Screening the period can be by this field skill
Art personnel are configured according to actual demand.The noise data that adjacent bolting house twice is generated every middle user's operation is then to be screened
Data.
During specific implementation, multiple noise datas can be extracted at random from data to be screened as sample data, extraction
The quantity of sample data can be ten million magnitude or hundred million magnitudes.Such as:The noise data number that user generates daily in platform
Amount is tens, but since database volume is limited, then need to extract several hundred million or several ten million noise datas as sample number
According to abandoning remaining non-extracted noise data.
Wherein, the sample data extracted may make up a database, and database is represented by:DBnoise。
Step 202:Conversion process is carried out to each sample data, obtains the transformation data of each sample data.
After being converted to sample data, each sample data corresponds to one or more transformation data.
Sample data is represented by:samplei ori, convert data and be represented by:samplei trans。
Preferably, for a sample data, the total number of sample data transformation data corresponding with the sample data is
Odd number.
Step 203:By advance trained image classification model, to each sample data and each transformation data into row label
Prediction respectively obtains the tag recognition result of each sample data and each transformation data.
Before executing data screening flow, advance training image disaggregated model is needed.It is wrapped in trained image classification model
Containing multiple labels and the corresponding training data of each label, training data is clean data.For being trained based on training data
The concrete mode of image classification model is not particularly limited this in the embodiment of the present invention with reference to the relevant technologies.Image point
The training of class model is substantially the continuous renewal to model parameter, until image classification model converges to preset standard.
Such as:Stochastic gradient descent method counting loss function L (θ) may be used for the parameter θ in image classification model
GradientThe gradientIt is used to constantly update the parameter in image classification model, furthermore it is also possible to according to this
GradientThe value of undated parameter θWherein, η is learning rate, is used for the newer width of control parameter θ
Degree.
Wherein, tag recognition result includes:The corresponding each label of data and the corresponding probability of each label.Sample data and
Transformation data are referred to as data, enter data into image classification model and carry out Tag Estimation, it is defeated that image classification model will export institute
Enter the corresponding tag recognition result of data.
Image classification model can carry out Tag Estimation to the data of input in the following way:
First, the characteristic pattern of input data is determined;
Secondly, characteristic pattern is subjected to dimension-reduction treatment, obtains intermediate features figure;
Again, intermediate features figure is averaged pond, obtains the corresponding feature vector of intermediate features figure;Wherein, feature vector
In include multiple points, each pair of point answers a label and a probability, using the non-zero label of probability as the corresponding label of data
It is exported for effective label, and exports the corresponding probability of each effective label.
Step 204:For each sample data, the tag recognition according to sample data is as a result, transformation with sample data
The tag recognition of data is as a result, determine the target labels and target labels probability of sample data.
After image classification model tag recognition, each sample data corresponds at least one label, final in this step
It needs by way of ballot, determines the unique objects label and target labels probability of each sample data.It is a kind of preferably logical
Cross ballot mode determine sample data target labels and target labels probability mode it is as follows:
First, for each label of each sample data, the transformation data of sample data and sample data are corresponding
The probability of the label is weighted averagely, obtains the weighted average probability of the label;
The weighted average probability of the single label of single sample data can be calculated by following formula:
Wherein, i identifies for sample data, and j is tag identifier,Weighted average for the label j of sample data i is general
Rate.In this formula, the probability of the j labels of sample data and each transformation data is weighted averagely, label correspondence can be obtained
Weighted average probability value.#sampleiFor samplei oriWith samplei transThe sum of, S is the mark for including in image classification model
Label set.
Secondly, the maximum value in the weighted average probability of each label is determined;
Finally, by the corresponding label of maximum weighted average probability, it is determined as the target labels of the sample data;It will most greatly
Weight average determine the probability is the target labels probability of the sample data.
Repeat which, it may be determined that the target labels and target labels probability of each sample data.Determine each sample number
According to target labels and target labels probability after, according to the target labels of each sample data and target labels probability, to each sample
Data are screened, and target database is obtained.Specific screening process such as step 205 is to step
Step 205:Each sample data is grouped according to target labels.
Wherein, the corresponding target labels of each grouping include at least one sample data in each grouping, for various kinds
The corresponding transformation data of notebook data directly abandon, without being added in grouping.
Step 206:The sample data in same grouping is ranked up according to target labels probability.
Wherein, the target labels probability value for the preceding sample data that sorts is big.
Step 207:The sample data for screening the preceding preset quantity that obtains sorting in each grouping, obtains target database.
Wherein, preset quantity can be configured by those skilled in the art according to actual demand, in the embodiment of the present invention
This is not particularly limited.
The target labels probability size of each sample data in same grouping is ranked up in this step, in each grouping
Topk sample data is filtered out, target database is constituted.Only retain the sample data in target database, for being screened out
Sample data and data to be screened in be not extracted by out and will be dropped as the noise data of sample data.In target database
Sample data then can be used for expand training image disaggregated model.
Data screening method provided in an embodiment of the present invention, except being had with data screening method shown in embodiment one
Outside some advantageous effects, the probability based on each label by way of soft ballot determines the target labels and target of sample data
Label probability can promote the accuracy of sample data target labels.
Embodiment three
With reference to Fig. 3, a kind of structure diagram of data screening device of the embodiment of the present invention three is shown.
The data screening device of the embodiment of the present invention may include:Extraction module 301 is configured as from data to be screened
Multiple noise datas are extracted as sample data;Conversion module 302 is configured as carrying out at transformation each sample data
Reason, obtains the transformation data of each sample data;Determining module 303 is configured as through advance trained image classification
Model carries out Tag Estimation to each sample data and each transformation data, determines the target mark of each sample data
Label and target labels probability;Screening module 304 is configured as general according to the target labels and target labels of each sample data
Rate screens each sample data, obtains target database.
Preferably, the screening module 304 may include:It is grouped submodule 3041, is configured as each sample number
It is grouped according to according to target labels;Wherein, the corresponding target labels of each grouping;Sorting sub-module 3042, is configured as
The sample data in same grouping is ranked up according to target labels probability;Wherein, the target for the preceding sample data that sorts
Label probability value is big;Submodule 3043 is generated, is configured as screening the sample for the preceding preset quantity that obtains sorting in each grouping
Data generate target database.
Preferably, the determining module 303 may include:It identifies submodule 3031, is configured as by training in advance
Image classification model, Tag Estimation is carried out to each sample data and each transformation data, respectively obtains each sample
The tag recognition result of notebook data and each transformation data;Wherein, tag recognition result includes:The corresponding each label of data
Probability corresponding with each label;Label determination sub-module 3032 is configured as being directed to each sample data, according to the sample
The tag recognitions of data is as a result, with the tag recognitions of the transformation data of the sample data as a result, determining the sample data
Target labels and target labels probability.
Preferably, the label determination sub-module 3032 is specifically configured to:For each label, by the sample data
The probability of the label corresponding with the transformation data of the sample data, which is weighted, to be averaged, and the weighting for obtaining the label is flat
Equal probability;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as
The target labels of the sample data;The target labels that the maximum weighted average probability is determined as the sample data are general
Rate.
Preferably, the conversion module 302 is specifically configured to:Each sample data is carried out according to default mapping mode
Transformation, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of:Rotation, translation
And shearing.
The data screening device of the embodiment of the present invention is for realizing corresponding data sieve in previous embodiment one, embodiment two
Choosing method, and there is advantageous effect corresponding with embodiment of the method, details are not described herein.
Example IV
With reference to Fig. 4, a kind of structure diagram of terminal for garbled data of the embodiment of the present invention four is shown.
The terminal of the embodiment of the present invention may include:Memory, processor and storage are on a memory and can be in processor
The computer program of upper operation realizes any one heretofore described data screening when computer program is executed by processor
The step of method.
Fig. 4 is a kind of block diagram of data screening terminal 600 shown according to an exemplary embodiment.For example, terminal 600 can
To be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, Medical Devices are good for
Body equipment, personal digital assistant etc..
With reference to Fig. 4, terminal 600 may include following one or more components:Processing component 602, memory 604, power supply
Component 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, and
Communication component 616.
The integrated operation of 602 usual control device 600 of processing component, such as with display, call, data communication, phase
Machine operates and record operates associated operation.Processing component 602 may include that one or more processors 620 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just
Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate
Interaction between media component 608 and processing component 602.
Memory 604 is configured as storing various types of data to support the operation in terminal 600.These data are shown
Example includes instruction for any application program or method that are operated in terminal 600, contact data, and telephone book data disappears
Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 606 provides electric power for the various assemblies of terminal 600.Power supply module 606 may include power management system
System, one or more power supplys and other generated with for terminal 600, management and the associated component of distribution electric power.
Multimedia component 608 is included in the screen of one output interface of offer between the terminal 600 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action
Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers
Body component 608 includes a front camera and/or rear camera.When terminal 600 is in operation mode, such as screening-mode or
When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and
Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike
Wind (MIC), when terminal 600 is in operation mode, when such as call model, logging mode and speech recognition mode, microphone by with
It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set
Part 616 is sent.In some embodiments, audio component 610 further includes a loud speaker, is used for exports audio signal.
I/O interfaces 612 provide interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include but be not limited to:Home button, volume button, start button and lock
Determine button.
Sensor module 614 includes one or more sensors, and the state for providing various aspects for terminal 600 is commented
Estimate.For example, sensor module 614 can detect the state that opens/closes of terminal 600, and the relative positioning of component, for example, it is described
Component is the display and keypad of terminal 600, and sensor module 614 can be with 600 1 components of detection terminal 600 or terminal
Position change, the existence or non-existence that user contacts with terminal 600,600 orientation of device or acceleration/deceleration and terminal 600
Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 616 is configured to facilitate the communication of wired or wireless way between terminal 600 and other equipment.Terminal
600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary implementation
In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.
In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example
Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology,
Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 600 can be believed by one or more application application-specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing data screening method, specifically
Data screening method includes:From the multiple noise datas of extracting data to be screened as sample data;To each sample data
Conversion process is carried out, the transformation data of each sample data are obtained;By advance trained image classification model, to each institute
It states sample data and each transformation data carries out Tag Estimation, determine the target labels and target labels of each sample data
Probability;According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained
Target database.
Preferably, the target labels and target labels probability according to each sample data, to each sample number
According to the step of being screened, obtaining target database, including:Each sample data is grouped according to target labels;Its
In, each corresponding target labels of grouping;The sample data in same grouping is ranked up according to target labels probability;Its
In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping
Sample data generates target database.
Preferably, described by advance trained image classification model, to each sample data and each transformation
Data carry out Tag Estimation, the step of determining the target labels and target labels probability of each sample data, including:By pre-
First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains
The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes:Data correspond to
Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result,
Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general
Rate.
Preferably, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data
Recognition result, the step of determining the target labels of the sample data and the target labels probability, including:For each mark
The probability of the corresponding label of the transformation data of the sample data and the sample data is weighted averagely, obtains by label
To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally
The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample
The target labels probability of notebook data.
Preferably, described that each sample data is converted, obtain the step of the transformation data of each sample data
Suddenly, including:Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;
Wherein, default transform method includes at least one of:Rotation, translation and shearing.
In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of
Such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of terminal 600 to complete above-mentioned data screening side
Method.For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic
Band, floppy disk and optical data storage devices etc..When the instruction in storage medium is executed by the processor of terminal so that terminal can
The step of executing any one heretofore described data screening method.
Terminal provided in an embodiment of the present invention, periodically carries out data screening, when screening from user in bolting house twice every interior
The extracting data sample data of data of generation, that is, to be screened converts to carry out data augmentation each sample data, is led to
The data and sample data crossed after augmentation determine the target labels and target labels probability of each sample data, according to each sample number
According to target labels and target labels probability, each sample data is screened, obtain target database.The embodiment of the present invention carries
The data screening scheme of confession treats garbled data without user and screening is marked one by one manually, can be according to computer program certainly
It is dynamic to carry out data screening, it simple operation and takes short, human resources can either be saved, and data screening efficiency can be promoted.
For device embodiments, since it is basically similar to the method embodiment, so fairly simple, the correlation of description
Place illustrates referring to the part of embodiment of the method.
Provided herein data screening scheme not with the intrinsic phase of any certain computer, virtual system or miscellaneous equipment
It closes.Various general-purpose systems can also be used together with teaching based on this.As described above, construction has present invention side
Structure required by the system of case is obvious.In addition, the present invention is not also directed to any certain programmed language.It should be bright
In vain, various programming languages can be utilized to realize the content of invention described herein, and is retouched above to what language-specific was done
State is to disclose the preferred forms of the present invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect
Shield the present invention claims the more features of feature than being expressly recited in each claim.More precisely, such as right
As claim reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows tool
Thus claims of body embodiment are expressly incorporated in the specific implementation mode, wherein each claim conduct itself
The separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment
Change and they are arranged in the one or more equipment different from the embodiment.It can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it may be used any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit requires, abstract and attached drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of arbitrary
It mode can use in any combination.
The all parts embodiment of the present invention can be with hardware realization, or to run on one or more processors
Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) come realize in data screening scheme according to the ... of the embodiment of the present invention some or
The some or all functions of person's whole component.The present invention is also implemented as one for executing method as described herein
Divide either whole equipment or program of device (for example, computer program and computer program product).Such this hair of realization
Bright program can may be stored on the computer-readable medium, or can be with the form of one or more signal.It is such
Signal can be downloaded from internet website and be obtained, and either provided on carrier signal or provided in any other forms.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch
To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame
Claim.
Claims (12)
1. a kind of data screening method, which is characterized in that the method includes:
From the multiple noise datas of extracting data to be screened as sample data;
Conversion process is carried out to each sample data, obtains the transformation data of each sample data;
It is pre- into row label to each sample data and each transformation data by advance trained image classification model
It surveys, determines the target labels and target labels probability of each sample data;
According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained
Target database.
2. according to the method described in claim 1, it is characterized in that, the target labels and mesh according to each sample data
The step of marking label probability, each sample data screened, obtaining target database, including:
Each sample data is grouped according to target labels;Wherein, the corresponding target labels of each grouping;
The sample data in same grouping is ranked up according to target labels probability;Wherein, sort preceding sample data
Target labels probability value is big;
The sample data for screening the preceding preset quantity that obtains sorting in each grouping, generates target database.
3. right according to the method described in claim 1, it is characterized in that, described by advance trained image classification model
Each sample data and each transformation data carry out Tag Estimation, determine the target labels and target of each sample data
The step of label probability, including:
It is pre- into row label to each sample data and each transformation data by advance trained image classification model
It surveys, respectively obtains the tag recognition result of each sample data and each transformation data;Wherein, tag recognition result packet
It includes:The corresponding each label of data and the corresponding probability of each label;
For each sample data, the tag recognition according to the sample data is as a result, transformation data with the sample data
Tag recognition as a result, determining the target labels and target labels probability of the sample data.
4. according to the method described in claim 3, it is characterized in that, the tag recognition according to the sample data is as a result, and institute
The tag recognition of the transformation data of sample data is stated as a result, determining that the target labels of the sample data and the target labels are general
The step of rate, including:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into
Row weighted average obtains the weighted average probability of the label;
Determine the maximum value in the weighted average probability of each label;
By the corresponding label of maximum weighted average probability, it is determined as the target labels of the sample data;By the maximum weighted
Average probability is determined as the target labels probability of the sample data.
5. according to the method described in claim 1, it is characterized in that, described convert each sample data, obtain each
The step of transformation data of the sample data, including:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein,
Default transform method includes at least one of:Rotation, translation and shearing.
6. a kind of data screening device, which is characterized in that described device includes:
Extraction module is configured as from the multiple noise datas of extracting data to be screened as sample data;
Conversion module is configured as carrying out conversion process to each sample data, obtains the transformation number of each sample data
According to;
Determining module is configured as by advance trained image classification model, to each sample data and each change
It changes data and carries out Tag Estimation, determine the target labels and target labels probability of each sample data;
Screening module is configured as target labels and target labels probability according to each sample data, to each sample
Data are screened, and target database is obtained.
7. device according to claim 6, which is characterized in that the screening module includes:
It is grouped submodule, is configured as each sample data being grouped according to target labels;Wherein, each grouping corresponds to
One target labels;
Sorting sub-module is configured as being ranked up the sample data in same grouping according to target labels probability;Wherein, it arranges
The target labels probability value of the preceding sample data of sequence is big;
Submodule is generated, is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target
Database.
8. device according to claim 6, which is characterized in that the determining module includes:
It identifies submodule, is configured as through advance trained image classification model, to each sample data and each described
It converts data and carries out Tag Estimation, respectively obtain the tag recognition result of each sample data and each transformation data;
Wherein, tag recognition result includes:The corresponding each label of data and the corresponding probability of each label;
Label determination sub-module, be configured as be directed to each sample data, according to the sample data tag recognition as a result, and
The tag recognition of the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general
Rate.
9. device according to claim 8, which is characterized in that the label determination sub-module is specifically configured to:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into
Row weighted average obtains the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;It will most
The corresponding label of weighted average probability greatly, is determined as the target labels of the sample data;By the maximum weighted average probability
It is determined as the target labels probability of the sample data.
10. device according to claim 6, which is characterized in that the conversion module is specifically configured to:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein,
Default transform method includes at least one of:Rotation, translation and shearing.
11. a kind of terminal, which is characterized in that including:It memory, processor and is stored on the memory and can be at the place
The computer program run on reason device is realized when the computer program is executed by the processor as appointed in claim 1 to 5
The step of data screening method described in one.
12. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program realizes the data screening method as described in any one of claim 1 to 5 when the computer program is executed by processor
The step of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810220055.1A CN108595497B (en) | 2018-03-16 | 2018-03-16 | Data screening method, apparatus and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810220055.1A CN108595497B (en) | 2018-03-16 | 2018-03-16 | Data screening method, apparatus and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108595497A true CN108595497A (en) | 2018-09-28 |
CN108595497B CN108595497B (en) | 2019-09-27 |
Family
ID=63626547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810220055.1A Active CN108595497B (en) | 2018-03-16 | 2018-03-16 | Data screening method, apparatus and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108595497B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109544150A (en) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | A kind of method of generating classification model and device calculate equipment and storage medium |
CN109598307A (en) * | 2018-12-06 | 2019-04-09 | 北京达佳互联信息技术有限公司 | Data screening method, apparatus, server and storage medium |
CN109657710A (en) * | 2018-12-06 | 2019-04-19 | 北京达佳互联信息技术有限公司 | Data screening method, apparatus, server and storage medium |
CN110147850A (en) * | 2019-05-27 | 2019-08-20 | 北京达佳互联信息技术有限公司 | Method, apparatus, equipment and the storage medium of image recognition |
CN110348993A (en) * | 2019-06-28 | 2019-10-18 | 北京淇瑀信息科技有限公司 | Wind is discussed and select model workers determination method, determining device and the electronic equipment of type label |
CN110807767A (en) * | 2019-10-24 | 2020-02-18 | 北京旷视科技有限公司 | Target image screening method and target image screening device |
WO2021139274A1 (en) * | 2020-06-09 | 2021-07-15 | 平安科技(深圳)有限公司 | Document classification method and apparatus based on deep learning model, and computer device |
CN113139628A (en) * | 2021-06-22 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
CN113837670A (en) * | 2021-11-26 | 2021-12-24 | 北京芯盾时代科技有限公司 | Risk recognition model training method and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005509978A (en) * | 2001-11-16 | 2005-04-14 | チェン,ユアン,ヤン | Ambiguous neural network with supervised and unsupervised cluster analysis |
US7512273B2 (en) * | 2004-10-21 | 2009-03-31 | Microsoft Corporation | Digital ink labeling |
CN102880875A (en) * | 2012-10-12 | 2013-01-16 | 西安电子科技大学 | Semi-supervised learning face recognition method based on low-rank representation (LRR) graph |
CN104463202A (en) * | 2014-11-28 | 2015-03-25 | 苏州大学 | Multi-class image semi-supervised classifying method and system |
CN106650721A (en) * | 2016-12-28 | 2017-05-10 | 吴晓军 | Industrial character identification method based on convolution neural network |
CN106960219A (en) * | 2017-03-10 | 2017-07-18 | 百度在线网络技术(北京)有限公司 | Image identification method and device, computer equipment and computer-readable medium |
CN107526785A (en) * | 2017-07-31 | 2017-12-29 | 广州市香港科大霍英东研究院 | File classification method and device |
US9911033B1 (en) * | 2016-09-05 | 2018-03-06 | International Business Machines Corporation | Semi-supervised price tag detection |
-
2018
- 2018-03-16 CN CN201810220055.1A patent/CN108595497B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005509978A (en) * | 2001-11-16 | 2005-04-14 | チェン,ユアン,ヤン | Ambiguous neural network with supervised and unsupervised cluster analysis |
US7512273B2 (en) * | 2004-10-21 | 2009-03-31 | Microsoft Corporation | Digital ink labeling |
CN102880875A (en) * | 2012-10-12 | 2013-01-16 | 西安电子科技大学 | Semi-supervised learning face recognition method based on low-rank representation (LRR) graph |
CN104463202A (en) * | 2014-11-28 | 2015-03-25 | 苏州大学 | Multi-class image semi-supervised classifying method and system |
US9911033B1 (en) * | 2016-09-05 | 2018-03-06 | International Business Machines Corporation | Semi-supervised price tag detection |
CN106650721A (en) * | 2016-12-28 | 2017-05-10 | 吴晓军 | Industrial character identification method based on convolution neural network |
CN106960219A (en) * | 2017-03-10 | 2017-07-18 | 百度在线网络技术(北京)有限公司 | Image identification method and device, computer equipment and computer-readable medium |
CN107526785A (en) * | 2017-07-31 | 2017-12-29 | 广州市香港科大霍英东研究院 | File classification method and device |
Non-Patent Citations (3)
Title |
---|
方向: "伪标签:教你玩转无标签数据的半监督学习方法", 《阿里云》 * |
汪忠国 等: "稀疏混合图随机跳跃WEB对象多标签半监督分类", 《计算机科学与探索》 * |
追梦飞阳: "数据增强在卷积神经网络中的应用", 《CSDN》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109544150A (en) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | A kind of method of generating classification model and device calculate equipment and storage medium |
CN109598307A (en) * | 2018-12-06 | 2019-04-09 | 北京达佳互联信息技术有限公司 | Data screening method, apparatus, server and storage medium |
CN109657710A (en) * | 2018-12-06 | 2019-04-19 | 北京达佳互联信息技术有限公司 | Data screening method, apparatus, server and storage medium |
CN109657710B (en) * | 2018-12-06 | 2022-01-21 | 北京达佳互联信息技术有限公司 | Data screening method and device, server and storage medium |
CN110147850B (en) * | 2019-05-27 | 2021-12-07 | 北京达佳互联信息技术有限公司 | Image recognition method, device, equipment and storage medium |
CN110147850A (en) * | 2019-05-27 | 2019-08-20 | 北京达佳互联信息技术有限公司 | Method, apparatus, equipment and the storage medium of image recognition |
CN110348993A (en) * | 2019-06-28 | 2019-10-18 | 北京淇瑀信息科技有限公司 | Wind is discussed and select model workers determination method, determining device and the electronic equipment of type label |
CN110348993B (en) * | 2019-06-28 | 2023-12-22 | 北京淇瑀信息科技有限公司 | Determination method and determination device for label for wind assessment model and electronic equipment |
CN110807767A (en) * | 2019-10-24 | 2020-02-18 | 北京旷视科技有限公司 | Target image screening method and target image screening device |
WO2021139274A1 (en) * | 2020-06-09 | 2021-07-15 | 平安科技(深圳)有限公司 | Document classification method and apparatus based on deep learning model, and computer device |
CN113139628A (en) * | 2021-06-22 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
CN113139628B (en) * | 2021-06-22 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
CN113837670A (en) * | 2021-11-26 | 2021-12-24 | 北京芯盾时代科技有限公司 | Risk recognition model training method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108595497B (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108595497B (en) | Data screening method, apparatus and terminal | |
CN108399409A (en) | Image classification method, device and terminal | |
CN108171254A (en) | Image tag determines method, apparatus and terminal | |
CN104737523B (en) | The situational model in mobile device is managed by assigning for the situation label of data clustering | |
CN109299387A (en) | A kind of information push method based on intelligent recommendation, device and terminal device | |
CN108256549B (en) | Image classification method, device and terminal | |
CN108614858B (en) | Image classification model optimization method, apparatus and terminal | |
CN107609185B (en) | Method, device, equipment and computer-readable storage medium for similarity calculation of POI | |
CN106372059A (en) | Information input method and information input device | |
CN108256555A (en) | Picture material recognition methods, device and terminal | |
CN108563683A (en) | Label addition method, device and terminal | |
CN109918684A (en) | Model training method, interpretation method, relevant apparatus, equipment and storage medium | |
CN110363084A (en) | A kind of class state detection method, device, storage medium and electronics | |
CN105528403B (en) | Target data identification method and device | |
CN108536669A (en) | Literal information processing method, device and terminal | |
CN109871843A (en) | Character identifying method and device, the device for character recognition | |
CN107305549A (en) | Language data processing method, device and the device for language data processing | |
CN109859770A (en) | Music separation method, device and computer readable storage medium | |
CN110390086A (en) | A kind of method, apparatus and storage medium generating text | |
CN106649661A (en) | Method and device for establishing knowledge base | |
CN102741840A (en) | Method and apparatus for modelling personalized contexts | |
CN108960283A (en) | Classification task incremental processing method and device, electronic equipment and storage medium | |
CN109858558A (en) | Training method, device, electronic equipment and the storage medium of disaggregated model | |
CN109741108A (en) | Streaming application recommended method, device and electronic equipment based on context aware | |
CN106909560A (en) | Point of interest sort method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |