CN108595497B - Data screening method, apparatus and terminal - Google Patents

Data screening method, apparatus and terminal Download PDF

Info

Publication number
CN108595497B
CN108595497B CN201810220055.1A CN201810220055A CN108595497B CN 108595497 B CN108595497 B CN 108595497B CN 201810220055 A CN201810220055 A CN 201810220055A CN 108595497 B CN108595497 B CN 108595497B
Authority
CN
China
Prior art keywords
data
sample data
target labels
probability
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810220055.1A
Other languages
Chinese (zh)
Other versions
CN108595497A (en
Inventor
张志伟
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201810220055.1A priority Critical patent/CN108595497B/en
Publication of CN108595497A publication Critical patent/CN108595497A/en
Application granted granted Critical
Publication of CN108595497B publication Critical patent/CN108595497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The embodiment of the invention provides a kind of data screening method, apparatus and terminals, wherein the data screening method includes: that multiple noise datas are extracted from data to be screened as sample data;Conversion process is carried out to each sample data, obtains the transformation data of each sample data;By preparatory trained image classification model, Tag Estimation is carried out to each sample data and each transformation data, determines the target labels and target labels probability of each sample data;According to the target labels and target labels probability of each sample data, each sample data is screened, obtain target database data screening scheme provided in an embodiment of the present invention, it treats garbled data manually without user and screening is marked one by one, data screening can be carried out automatically according to computer program, it is convenient and time-consuming short to operate, and can either save human resources, and be able to ascend data screening efficiency.

Description

Data screening method, apparatus and terminal
Technical field
The present invention relates to noise data screening technique fields, more particularly to a kind of data screening method, apparatus and terminal.
Background technique
Recently, deep learning achieves breakthrough in the related contents understanding such as natural language processing, text translation field Progress.However these development depend critically upon the scale of training data, so data are by these technical applications to actual production Most important bottleneck in environment.
By taking current data sorting task as an example, the data volume of each general labeling requirement is magnitude as " thousand ". Traditional method uses full monitoring data training pattern then to reuse that is, firstly the need of enough labeled data are obtained This part labeled data training pattern.But the mode based on artificial labeled data obtains extensive mark in internet data Data exist following insufficient:
The first, the data of " thousand " magnitude seem seldom, but the amount of data to be marked is but very huge.Under normal circumstances Just there is a training data in the labeled data of 10-20 or so, this means that the mark human cost of each labeling requirement It increases sharply.
The second, general label system is comparatively very huge, and the use of each label is manually marked in this way Method will consume a large amount of human resources.Moreover, the data generated daily in internet environment continually, hardly may be used All data can manually be marked, mark difficulty is big.
Summary of the invention
The embodiment of the present invention provides a kind of data screening method, apparatus and terminal, existing in the prior art right to solve The data generated daily in internet environment carry out data screening after being labeled, difficulty is big and consumption human cost is high asks Topic.
According to one aspect of the present invention, a kind of data screening method is provided, wherein the described method includes: from wait sieve It selects and extracts multiple noise datas in data as sample data;Conversion process is carried out to each sample data, is obtained each described The transformation data of sample data;By preparatory trained image classification model, to each sample data and each transformation Data carry out Tag Estimation, determine the target labels and target labels probability of each sample data;According to each sample number According to target labels and target labels probability, each sample data is screened, obtain target database.
Optionally, the target labels and target labels probability according to each sample data, to each sample number According to the step of being screened, obtaining target database, comprising: each sample data to be grouped according to target labels;Its In, the corresponding target labels of each grouping;The sample data in same grouping is ranked up according to target labels probability;Its In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping Sample data generates target database.
Optionally, described by preparatory trained image classification model, to each sample data and each transformation The step of data carry out Tag Estimation, determine the target labels and target labels probability of each sample data, comprising: by pre- First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes: that data are corresponding Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result, Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate.
Optionally, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data Recognition result determines the target labels of the sample data and the step of the target labels probability, comprising: be directed to each mark Label, the probability of the corresponding label of the transformation data of the sample data and the sample data is weighted and averaged, is obtained To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample The target labels probability of notebook data.
Optionally, described that each sample data is converted, obtain the step of the transformation data of each sample data Suddenly, comprising: each sample data is converted according to default mapping mode, obtains the transformation data of each sample data; Wherein, default transform method includes at least one of: rotation, translation and shearing.
According to another aspect of the present invention, a kind of data screening device is provided, wherein described device includes: extraction mould Block is configured as extracting multiple noise datas from data to be screened as sample data;Conversion module is configured as to each institute It states sample data and carries out conversion process, obtain the transformation data of each sample data;Determining module is configured as by preparatory Trained image classification model carries out Tag Estimation to each sample data and each transformation data, determines each described The target labels and target labels probability of sample data;Screening module is configured as the target mark according to each sample data Label and target labels probability, screen each sample data, obtain target database.
Optionally, the screening module includes: grouping submodule, is configured as each sample data according to target mark Label are grouped;Wherein, the corresponding target labels of each grouping;Sorting sub-module is configured as according to target labels probability Sample data in same grouping is ranked up;Wherein, the target labels probability value for the preceding sample data that sorts is big;It generates Submodule is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target database.
Optionally, the determining module includes: identification submodule, is configured as through preparatory trained image classification mould Type carries out Tag Estimation to each sample data and each transformation data, respectively obtains each sample data and each The tag recognition result of the transformation data;Wherein, tag recognition result includes: the corresponding each label of data and each label pair The probability answered;Label determines submodule, is configured as each sample data, the tag recognition knot according to the sample data The tag recognition of the transformation data of fruit and the sample data is as a result, determine the target labels and target mark of the sample data Sign probability.
Optionally, the label determines that submodule is specifically configured to: each label is directed to, by the sample data and institute The probability for stating the corresponding label of transformation data of sample data is weighted and averaged, and the weighted average for obtaining the label is general Rate;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as described The target labels of sample data;The maximum weighted average probability is determined as to the target labels probability of the sample data.
Optionally, the conversion module is specifically configured to: being become to each sample data according to default mapping mode It changes, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of: rotation, translate with And shearing.
In accordance with a further aspect of the present invention, a kind of terminal is provided, comprising: memory, processor and be stored in described deposit On reservoir and the computer program that can run on the processor, the computer program are realized when being executed by the processor The step of any one heretofore described data screening method.
According to another aspect of the invention, a kind of computer readable storage medium, the computer-readable storage are provided Computer program is stored on medium, the computer program realizes any one heretofore described when being executed by processor The step of data screening method.
Compared with prior art, the invention has the following advantages that
Data screening scheme provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice Sample data is extracted in the data, that is, data to be screened generated in choosing interval, each sample data is converted to carry out data Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention The data screening scheme that embodiment provides, treats garbled data manually without user and screening is marked one by one, can be according to calculating Machine program carries out data screening automatically, and it is convenient and time-consuming short to operate, and can either save human resources, and be able to ascend data screening Efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various advantage and benefit are for ordinary skill people Member will become clear.Attached drawing is only used for showing preferred embodiment, and is not to be construed as limiting the invention.And In entire attached drawing, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the step flow chart of a kind of according to embodiments of the present invention one data screening method;
Fig. 2 is the step flow chart of a kind of according to embodiments of the present invention two data screening method;
Fig. 3 is a kind of structural block diagram of according to embodiments of the present invention three data screening device;
Fig. 4 is a kind of structural block diagram of according to embodiments of the present invention four terminal.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Embodiment one
Referring to Fig.1, a kind of step flow chart of data screening method of the embodiment of the present invention one is shown.
The data screening method of the embodiment of the present invention may comprise steps of:
Step 101: multiple noise datas are extracted from data to be screened as sample data.
Data screening mode provided in an embodiment of the present invention can be adapted for the big rule generated in operating to user's history Mode noise data are screened, and noise data can be image.Such as: different user uploads image to platform, server according to Prefixed time interval periodically screens image caused by user, user's operation image generated in prefixed time interval It is then data to be screened.Prefixed time interval can be for one day, two days or 12 hours etc., in the embodiment of the present invention not to this It is particularly shown.Single data screening process is illustrated in the embodiment of the present invention, during specific implementation, each data Process described in executable embodiment of the present invention when screening.
A pre-existing trained image classification model in the embodiment of the present invention, comprising more in the image classification model A label and the corresponding training data of each label need to be by the pre-selection of management service training when executing data screening operation Good image classification model carries out Tag Estimation to data.
The noise data number extracted from data to be screened can be carried out according to actual needs by those skilled in the art Adjustment.Such as: it can extract necessarily or the noise data of hundred million orders of magnitude is as sample data.It extracts and makes an uproar from data to be screened When sound data, it can extract at random.
Step 102: conversion process being carried out to each sample data, obtains the transformation data of each sample data.
Wherein, the mapping mode of sample data can include but is not limited to: any side such as rotation, translation and shearing Formula.
Step 103: by preparatory trained image classification model, label being carried out to each sample data and each transformation data Prediction, determines the target labels and target labels probability of each sample data.
Respectively by the transformation data of each sample data and each sample data, input in trained image classification model in advance Tag Estimation is carried out, the tag recognition result of each data of input can be obtained.For specifically according to trained image The concrete mode of disaggregated model prediction data label does not do specific limit to this in the embodiment of the present invention referring to the relevant technologies System.
It wherein, include: the probability of at least one label and each label in the tag recognition result of each data;Label Probability it is higher, then illustrate data belong to the label instruction data category a possibility that it is bigger.
It, can be according to the sample data and the sample number when determining the target labels and target labels probability of a sample data According to transformation data tag recognition as a result, according to ballot mode, determine a final label i.e. target labels.
Step 104: according to the target labels of each sample data and target labels probability, each sample data is screened, Obtain target database.
When being screened to sample data, each sample data can be grouped according to said target label;Then The data that preset quantity is screened out out from each grouping, the sample data screened constitute target database.
After this data screening, only retain the sample data in target data Kuku, for the sample number screened out It will be dropped according to being not extracted by out in data to be screened as the data of sample data.Sample data in target database is then It can be used for expanding image classification model.
Preset quantity can be configured according to actual needs by those skilled in the art, in the embodiment of the present invention not to this Do concrete restriction.Preset quantity is smaller, then the sample data quantity screened out is more, and the sample data volume of reservation is fewer, accordingly The precision of sample data is higher in ground target database.
Data screening method provided in an embodiment of the present invention, periodically carries out data screening, and when screening sieves from user twice Sample data is extracted in the data, that is, data to be screened generated in choosing interval, each sample data is converted to carry out data Augmentation determines the target labels and target labels probability of each sample data by data after augmentation and sample data, according to The target labels and target labels probability of each sample data, screen each sample data, obtain target database.The present invention The data screening method that embodiment provides, treats garbled data manually without user and screening is marked one by one, can be according to calculating Machine program carries out data screening automatically, and it is convenient and time-consuming short to operate, and can either save human resources, and be able to ascend data screening Efficiency.
Embodiment two
Referring to Fig. 2, a kind of step flow chart of data screening method of the embodiment of the present invention two is shown.
The data screening method of the embodiment of the present invention can specifically include following steps:
Step 201: multiple noise datas are extracted from data to be screened as sample data.
User can upload noise data such as image on platform in real time during historical operation, after managing the platform Platform server can periodically screen the noise data generated in user's history operating process.Screening the period can be by this field skill Art personnel are configured according to actual needs.Adjacent bolting house twice is then to be screened every the noise data that middle user's operation generates Data.
During specific implementation, multiple noise datas can be extracted at random from data to be screened as sample data, are extracted The quantity of sample data can be ten million magnitude or hundred million magnitudes.Such as: the noise data number that user generates daily in platform Amount is tens, but since database volume is limited, then need to extract several hundred million or several ten million noise datas as sample number According to abandoning remaining non-extracted noise data.
Wherein, the sample data extracted may make up a database, and database may be expressed as: DBnoise
Step 202: conversion process being carried out to each sample data, obtains the transformation data of each sample data.
After being converted to sample data, the corresponding one or more transformation data of each sample data.
Sample data may be expressed as: samplei ori, converting data may be expressed as: samplei trans
Preferably, for a sample data, the total number of sample data transformation data corresponding with the sample data is Odd number.
Step 203: by preparatory trained image classification model, label being carried out to each sample data and each transformation data Prediction respectively obtains the tag recognition result of each sample data and each transformation data.
Before executing data screening process, preparatory training image disaggregated model is needed.It is wrapped in trained image classification model Containing multiple labels and the corresponding training data of each label, training data is clean data.For being trained based on training data The concrete mode of image classification model is not particularly limited this in the embodiment of the present invention referring to the relevant technologies.Image point The training of class model is substantially the continuous renewal to model parameter, until image classification model converges to preset standard.
Such as: loss function L (θ) can be calculated using stochastic gradient descent method for the parameter θ in image classification model GradientThe gradientIt is used to constantly update the parameter in image classification model, furthermore it is also possible to according to this GradientThe value of undated parameter θWherein, η is learning rate, the width updated for control parameter θ Degree.
Wherein, tag recognition result includes: the corresponding each label of data and the corresponding probability of each label.Sample data and Transformation data are referred to as data, enter data into image classification model and carry out Tag Estimation, it is defeated that image classification model will export institute Enter the corresponding tag recognition result of data.
Image classification model can carry out Tag Estimation to the data of input in the following way:
Firstly, determining the characteristic pattern of input data;
Secondly, characteristic pattern is carried out dimension-reduction treatment, intermediate features figure is obtained;
Again, intermediate features figure is averaged pond, obtains the corresponding feature vector of intermediate features figure;Wherein, feature vector In include multiple points, each pair of point answers a label and a probability, using the non-zero label of probability as the corresponding label of data It is exported for effective label, and exports the corresponding probability of each effective label.
Step 204: being directed to each sample data, the tag recognition according to sample data is as a result, transformation with sample data The tag recognition of data is as a result, determine the target labels and target labels probability of sample data.
After image classification model tag recognition, each sample data corresponds at least one label, final in this step It needs by way of ballot, determines the unique objects label and target labels probability of each sample data.It is a kind of preferably logical Cross ballot mode determine sample data target labels and target labels probability mode it is as follows:
Firstly, it is directed to each label of each sample data, the transformation data of sample data and sample data are corresponding The probability of the label is weighted and averaged, and obtains the weighted average probability of the label;
The weighted average probability of the single label of single sample data can be calculated by following formula:
Wherein, i is sample data mark, and j is tag identifier,Weighted average for the label j of sample data i is general Rate.In this formula, the probability of the j label of sample data and each transformation data is weighted and averaged, it is corresponding that the label can be obtained Weighted average probability value.#sampleiFor samplei oriWith samplei transThe sum of, S is the mark for including in image classification model Label set.
Secondly, determining the maximum value in the weighted average probability of each label;
Finally, the corresponding label of maximum weighted average probability is determined as the target labels of the sample data;It will most greatly Weight average determine the probability is the target labels probability of the sample data.
Repeat which, it may be determined that the target labels and target labels probability of each sample data.Determine each sample number According to target labels and target labels probability after, according to the target labels of each sample data and target labels probability, to each sample Data are screened, and target database is obtained.Specific screening process such as step 205 is to step
Step 205: each sample data is grouped according to target labels.
Wherein, each grouping corresponds to a target labels, includes at least one sample data in each grouping, for various kinds The corresponding transformation data of notebook data directly abandon, without being added in grouping.
Step 206: the sample data in same grouping being ranked up according to target labels probability.
Wherein, the target labels probability value for the preceding sample data that sorts is big.
Step 207: screening the sample data for the preceding preset quantity that obtains sorting in each grouping, obtain target database.
Wherein, preset quantity can be configured according to actual needs by those skilled in the art, in the embodiment of the present invention This is not particularly limited.
The target labels probability size of each sample data in same grouping is ranked up in this step, in each grouping Topk sample data is filtered out, target database is constituted.Only retain the sample data in target database, for being screened out Sample data and data to be screened in be not extracted by out and will be dropped as the noise data of sample data.In target database Sample data then can be used for expanding training image disaggregated model.
Data screening method provided in an embodiment of the present invention, except being had with data screening method shown in embodiment one Outside some beneficial effects, probability based on each label by way of soft ballot determines the target labels and target of sample data Label probability is able to ascend the accuracy of sample data target labels.
Embodiment three
Referring to Fig. 3, a kind of structural block diagram of data screening device of the embodiment of the present invention three is shown.
The data screening device of the embodiment of the present invention may include: extraction module 301, be configured as from data to be screened Multiple noise datas are extracted as sample data;Conversion module 302 is configured as carrying out at transformation each sample data Reason, obtains the transformation data of each sample data;Determining module 303 is configured as through preparatory trained image classification Model carries out Tag Estimation to each sample data and each transformation data, determines the target mark of each sample data Label and target labels probability;Screening module 304 is configured as general according to the target labels and target labels of each sample data Rate screens each sample data, obtains target database.
Preferably, the screening module 304 may include: grouping submodule 3041, be configured as each sample number It is grouped according to according to target labels;Wherein, the corresponding target labels of each grouping;Sorting sub-module 3042, is configured as The sample data in same grouping is ranked up according to target labels probability;Wherein, the target for the preceding sample data that sorts Label probability value is big;Submodule 3043 is generated, is configured as screening the sample for the preceding preset quantity that obtains sorting in each grouping Data generate target database.
Preferably, the determining module 303 may include: identification submodule 3031, be configured as by training in advance Image classification model, Tag Estimation is carried out to each sample data and each transformation data, respectively obtains each sample The tag recognition result of notebook data and each transformation data;Wherein, tag recognition result includes: the corresponding each label of data Probability corresponding with each label;Label determines submodule 3032, is configured as each sample data, according to the sample The tag recognitions of data is as a result, with the tag recognitions of the transformation data of the sample data as a result, determining the sample data Target labels and target labels probability.
Preferably, the label determines that submodule 3032 is specifically configured to: each label is directed to, by the sample data The probability of the label corresponding with the transformation data of the sample data is weighted and averaged, and the weighting for obtaining the label is flat Equal probability;Determine the maximum value in the weighted average probability of each label;By the corresponding label of maximum weighted average probability, it is determined as The target labels of the sample data;The target labels that the maximum weighted average probability is determined as the sample data are general Rate.
Preferably, the conversion module 302 is specifically configured to: being carried out to each sample data according to default mapping mode Transformation, obtains the transformation data of each sample data;Wherein, default transform method includes at least one of: rotation, translation And shearing.
The data screening device of the embodiment of the present invention sieves for realizing data corresponding in previous embodiment one, embodiment two Choosing method, and there is beneficial effect corresponding with embodiment of the method, details are not described herein.
Example IV
Referring to Fig. 4, a kind of structural block diagram of terminal for garbled data of the embodiment of the present invention four is shown.
The terminal of the embodiment of the present invention may include: memory, processor and storage on a memory and can be in processor The computer program of upper operation realizes any one heretofore described data screening when computer program is executed by processor The step of method.
Fig. 4 is a kind of block diagram of data screening terminal 600 shown according to an exemplary embodiment.For example, terminal 600 can To be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices are good for Body equipment, personal digital assistant etc..
Referring to Fig. 4, terminal 600 may include following one or more components: processing component 602, memory 604, power supply Component 606, multimedia component 608, audio component 610, input/output interface 612, sensor module 614 and communication component 616。
The integrated operation of the usual controlling terminal 600 of processing component 602, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 602 may include that one or more processors 620 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate Interaction between media component 608 and processing component 602.
Memory 604 is configured as storing various types of data to support the operation in terminal 600.These data are shown Example includes the instruction of any application or method for operating in terminal 600, contact data, and telephone book data disappears Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 606 provides electric power for the various assemblies of terminal 600.Power supply module 606 may include power management system System, one or more power supplys and other with for terminal 600 generate, manage, and distribute the associated component of electric power.
Multimedia component 608 includes the screen of one output interface of offer between the terminal 600 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 608 includes a front camera and/or rear camera.When terminal 600 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike Wind (MIC), when terminal 600 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set Part 616 is sent.In some embodiments, audio component 610 further includes a loudspeaker, is used for output audio signal.
Input/output interface 612 provides interface, above-mentioned peripheral interface between processing component 602 and peripheral interface module Module can be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, starting are pressed Button and locking press button.
Sensor module 614 includes one or more sensors, and the state for providing various aspects for terminal 600 is commented Estimate.For example, sensor module 614 can detecte the state that opens/closes of terminal 600, and the relative positioning of component, for example, it is described Component is the display and keypad of terminal 600, and sensor module 614 can also detect 600 1 components of terminal 600 or terminal Position change, the existence or non-existence that user contacts with terminal 600,600 orientation of terminal or acceleration/deceleration and terminal 600 Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 616 is configured to facilitate the communication of wired or wireless way between terminal 600 and other equipment.Terminal 600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 600 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing data screening method, specifically Data screening method includes: that multiple noise datas are extracted from data to be screened as sample data;To each sample data Conversion process is carried out, the transformation data of each sample data are obtained;By preparatory trained image classification model, to each institute It states sample data and each transformation data carries out Tag Estimation, determine the target labels and target labels of each sample data Probability;According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained Target database.
Preferably, the target labels and target labels probability according to each sample data, to each sample number According to the step of being screened, obtaining target database, comprising: each sample data to be grouped according to target labels;Its In, the corresponding target labels of each grouping;The sample data in same grouping is ranked up according to target labels probability;Its In, the target labels probability value for the preceding sample data that sorts is big;Screening obtains the preceding preset quantity that sorts in each grouping Sample data generates target database.
Preferably, described by preparatory trained image classification model, to each sample data and each transformation The step of data carry out Tag Estimation, determine the target labels and target labels probability of each sample data, comprising: by pre- First trained image classification model carries out Tag Estimation to each sample data and each transformation data, respectively obtains The tag recognition result of each sample data and each transformation data;Wherein, tag recognition result includes: that data are corresponding Each label and the corresponding probability of each label;For each sample data, according to the sample data tag recognition as a result, Tag recognition with the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate.
Preferably, according to the tag recognition of the sample data as a result, label with the transformation data of the sample data Recognition result determines the target labels of the sample data and the step of the target labels probability, comprising: be directed to each mark Label, the probability of the corresponding label of the transformation data of the sample data and the sample data is weighted and averaged, is obtained To the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;Maximum weighted is averaged generally The corresponding label of rate, is determined as the target labels of the sample data;The maximum weighted average probability is determined as the sample The target labels probability of notebook data.
Preferably, described that each sample data is converted, obtain the step of the transformation data of each sample data Suddenly, comprising: each sample data is converted according to default mapping mode, obtains the transformation data of each sample data; Wherein, default transform method includes at least one of: rotation, translation and shearing.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of terminal 600 to complete above-mentioned data screening side Method.For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic Band, floppy disk and optical data storage devices etc..When the instruction in storage medium is executed by the processor of terminal, enable the terminal to The step of executing any one heretofore described data screening method.
Terminal provided in an embodiment of the present invention, periodically carries out data screening, when screening from user in bolting house twice every interior Sample data is extracted in the data of data of generation, that is, to be screened, each sample data is converted to carry out data augmentation, is led to Data and sample data after crossing augmentation determine the target labels and target labels probability of each sample data, according to each sample number According to target labels and target labels probability, each sample data is screened, obtain target database.The embodiment of the present invention mentions The data screening scheme of confession treats garbled data manually without user and screening is marked one by one, can be according to computer program certainly Dynamic to carry out data screening, it is convenient and time-consuming short to operate, and can either save human resources, and be able to ascend data screening efficiency.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
Provided herein data screening scheme not with any certain computer, virtual system or the intrinsic phase of other equipment It closes.Various general-purpose systems can also be used together with teachings based herein.As described above, construction has present invention side Structure required by the system of case is obvious.In addition, the present invention is also not directed to any particular programming language.It should be bright It is white, it can use various programming languages and realize summary of the invention described herein, and retouched above to what language-specific was done State is in order to disclose the best mode of carrying out the invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, such as right As claim reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows tool Thus claims of body embodiment are expressly incorporated in the specific embodiment, wherein each claim conduct itself Separate embodiments of the invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any Can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) come realize some in data screening scheme according to an embodiment of the present invention or The some or all functions of person's whole component.The present invention is also implemented as one for executing method as described herein Point or whole device or device programs (for example, computer program and computer program product).Such this hair of realization Bright program can store on a computer-readable medium, or may be in the form of one or more signals.It is such Signal can be downloaded from an internet website to obtain, and is perhaps provided on the carrier signal or is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be construed to title.

Claims (8)

1. a kind of data screening method, which is characterized in that the described method includes:
Multiple noise datas are extracted from data to be screened as sample data;
Conversion process is carried out to each sample data, obtains the transformation data of each sample data;
By preparatory trained image classification model, it is pre- that label is carried out to each sample data and each transformation data It surveys, determines the target labels and target labels probability of each sample data;
According to the target labels and target labels probability of each sample data, each sample data is screened, is obtained Target database;
It is described by preparatory trained image classification model, to each sample data and each transformation data progress label The step of predicting, determining the target labels and target labels probability of each sample data, comprising:
By preparatory trained image classification model, it is pre- that label is carried out to each sample data and each transformation data It surveys, respectively obtains the tag recognition result of each sample data and each transformation data;Wherein, tag recognition result packet It includes: the corresponding each label of data and the corresponding probability of each label;
For each sample data, the tag recognition according to the sample data is as a result, transformation data with the sample data Tag recognition as a result, determining the target labels and target labels probability of the sample data;
Tag recognition according to the sample data is as a result, with the tag recognitions of the transformation data of the sample data as a result, really The step of target labels of the fixed sample data and the target labels probability, comprising:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into Row weighted average, obtains the weighted average probability of the label;
Determine the maximum value in the weighted average probability of each label;
By the corresponding label of maximum weighted average probability, it is determined as the target labels of the sample data;By the maximum weighted Average probability is determined as the target labels probability of the sample data.
2. the method according to claim 1, wherein the target labels and mesh according to each sample data The step of marking label probability, each sample data screened, obtaining target database, comprising:
Each sample data is grouped according to target labels;Wherein, the corresponding target labels of each grouping;
The sample data in same grouping is ranked up according to target labels probability;Wherein, sort preceding sample data Target labels probability value is big;
The sample data for screening the preceding preset quantity that obtains sorting in each grouping, generates target database.
3. being obtained each the method according to claim 1, wherein described convert each sample data The step of transformation data of the sample data, comprising:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein, Default transform method includes at least one of: rotation, translation and shearing.
4. a kind of data screening device, which is characterized in that described device includes:
Extraction module is configured as extracting multiple noise datas from data to be screened as sample data;
Conversion module is configured as carrying out conversion process to each sample data, obtains the transformation number of each sample data According to;
Determining module is configured as by preparatory trained image classification model, to each sample data and each change It changes data and carries out Tag Estimation, determine the target labels and target labels probability of each sample data;
Screening module is configured as target labels and target labels probability according to each sample data, to each sample Data are screened, and target database is obtained;
The determining module includes:
It identifies submodule, is configured as through preparatory trained image classification model, to each sample data and each described It converts data and carries out Tag Estimation, respectively obtain the tag recognition result of each sample data and each transformation data; Wherein, tag recognition result includes: the corresponding each label of data and the corresponding probability of each label;
Label determines submodule, is configured as each sample data, according to the sample data tag recognition as a result, and The tag recognition of the transformation data of the sample data is as a result, determine that the target labels of the sample data and target labels are general Rate;
The label determines that submodule is specifically configured to:
For each label, by the probability of the corresponding label of the transformation data of the sample data and the sample data into Row weighted average, obtains the weighted average probability of the label;Determine the maximum value in the weighted average probability of each label;It will most It is big to be weighted and averaged the corresponding label of probability, it is determined as the target labels of the sample data;By the maximum weighted average probability It is determined as the target labels probability of the sample data.
5. device according to claim 4, which is characterized in that the screening module includes:
It is grouped submodule, is configured as each sample data being grouped according to target labels;Wherein, each grouping corresponds to One target labels;
Sorting sub-module is configured as being ranked up the sample data in same grouping according to target labels probability;Wherein, it arranges The target labels probability value of the preceding sample data of sequence is big;
Submodule is generated, is configured as screening the sample data for the preceding preset quantity that obtains sorting in each grouping, generates target Database.
6. device according to claim 4, which is characterized in that the conversion module is specifically configured to:
Each sample data is converted according to default mapping mode, obtains the transformation data of each sample data;Wherein, Default transform method includes at least one of: rotation, translation and shearing.
7. a kind of terminal characterized by comprising memory, processor and be stored on the memory and can be at the place The computer program run on reason device is realized when the computer program is executed by the processor as appointed in claims 1 to 3 The step of data screening method described in one.
8. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, the computer program realize data screening method as claimed any one in claims 1 to 3 when being executed by processor The step of.
CN201810220055.1A 2018-03-16 2018-03-16 Data screening method, apparatus and terminal Active CN108595497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810220055.1A CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810220055.1A CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Publications (2)

Publication Number Publication Date
CN108595497A CN108595497A (en) 2018-09-28
CN108595497B true CN108595497B (en) 2019-09-27

Family

ID=63626547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810220055.1A Active CN108595497B (en) 2018-03-16 2018-03-16 Data screening method, apparatus and terminal

Country Status (1)

Country Link
CN (1) CN108595497B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109598307B (en) * 2018-12-06 2020-11-27 北京达佳互联信息技术有限公司 Data screening method and device, server and storage medium
CN109657710B (en) * 2018-12-06 2022-01-21 北京达佳互联信息技术有限公司 Data screening method and device, server and storage medium
CN110147850B (en) * 2019-05-27 2021-12-07 北京达佳互联信息技术有限公司 Image recognition method, device, equipment and storage medium
CN110348993B (en) * 2019-06-28 2023-12-22 北京淇瑀信息科技有限公司 Determination method and determination device for label for wind assessment model and electronic equipment
CN110807767A (en) * 2019-10-24 2020-02-18 北京旷视科技有限公司 Target image screening method and target image screening device
CN111507089B (en) * 2020-06-09 2022-09-09 平安科技(深圳)有限公司 Document classification method and device based on deep learning model and computer equipment
CN113139628B (en) * 2021-06-22 2021-09-17 腾讯科技(深圳)有限公司 Sample image identification method, device and equipment and readable storage medium
CN113837670A (en) * 2021-11-26 2021-12-24 北京芯盾时代科技有限公司 Risk recognition model training method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005509978A (en) * 2001-11-16 2005-04-14 チェン,ユアン,ヤン Ambiguous neural network with supervised and unsupervised cluster analysis
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7512273B2 (en) * 2004-10-21 2009-03-31 Microsoft Corporation Digital ink labeling
CN104463202B (en) * 2014-11-28 2017-09-19 苏州大学 A kind of multiclass image semisupervised classification method and system
CN106960219B (en) * 2017-03-10 2021-04-16 百度在线网络技术(北京)有限公司 Picture identification method and device, computer equipment and computer readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005509978A (en) * 2001-11-16 2005-04-14 チェン,ユアン,ヤン Ambiguous neural network with supervised and unsupervised cluster analysis
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device

Also Published As

Publication number Publication date
CN108595497A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108595497B (en) Data screening method, apparatus and terminal
CN108664989B (en) Image tag determines method, apparatus and terminal
CN108399409B (en) Image classification method, device and terminal
CN108256549B (en) Image classification method, device and terminal
CN104737523B (en) The situational model in mobile device is managed by assigning for the situation label of data clustering
CN104584513B (en) Select the apparatus and method for sharing the device of operation for content
CN108614858B (en) Image classification model optimization method, apparatus and terminal
CN109740018B (en) Method and device for generating video label model
CN109299387A (en) A kind of information push method based on intelligent recommendation, device and terminal device
CN109389162B (en) Sample image screening technique and device, electronic equipment and storage medium
CN108171254A (en) Image tag determines method, apparatus and terminal
CN108664829A (en) Equipment for providing information related with objects in images
CN104035995B (en) Group's label generating method and device
CN1655119A (en) Statistical models and methods to support the personalization of applications and services via consideration of preference encodings of a community of users
CN110266879A (en) Broadcast interface display methods, device, terminal and storage medium
CN106355429A (en) Image material recommendation method and device
CN111523324B (en) Named entity recognition model training method and device
CN106572272A (en) IVR voice menu determination method and apparatus
CN109871843A (en) Character identifying method and device, the device for character recognition
CN108563683A (en) Label addition method, device and terminal
CN108960283B (en) Classification task increment processing method and device, electronic equipment and storage medium
CN107230137A (en) Merchandise news acquisition methods and device
CN109509017A (en) User's retention ratio prediction technique and device based on big data analysis
CN109859770A (en) Music separation method, device and computer readable storage medium
CN109902738A (en) Network module and distribution method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant