CN109657710A - Data screening method, apparatus, server and storage medium - Google Patents

Data screening method, apparatus, server and storage medium Download PDF

Info

Publication number
CN109657710A
CN109657710A CN201811489982.XA CN201811489982A CN109657710A CN 109657710 A CN109657710 A CN 109657710A CN 201811489982 A CN201811489982 A CN 201811489982A CN 109657710 A CN109657710 A CN 109657710A
Authority
CN
China
Prior art keywords
initial data
term vector
class label
data
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811489982.XA
Other languages
Chinese (zh)
Other versions
CN109657710B (en
Inventor
张志伟
吴丽军
李铅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201811489982.XA priority Critical patent/CN109657710B/en
Publication of CN109657710A publication Critical patent/CN109657710A/en
Application granted granted Critical
Publication of CN109657710B publication Critical patent/CN109657710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure is directed to a kind of data screening method, apparatus, server and storage mediums, belong to internet area.This method comprises: classifying using disaggregated model to multiple initial data, the classification results of each initial data are obtained;Based on participle tool and term vector model, the second term vector of text information in the first term vector and each initial data of each class label is obtained;The second term vector of text information determines target data from the multiple initial data in the first term vector and each initial data based on each class label.By introducing participle tool and term vector model, the text information in initial data is indicated with the accessible vector form of computer, artificial mark bring cost is reduced, increases the utilization rate of the initial data of magnanimity.

Description

Data screening method, apparatus, server and storage medium
Technical field
This disclosure relates to internet area more particularly to a kind of data screening method, apparatus, server and storage medium.
Background technique
In the related technology, deep learning is used widely in fields such as natural language processing, text translations, wherein deep The accuracy for spending learning model depends on the scale of training data to need to obtain training data to the original from internet Beginning data are screened.
By taking image classification as an example, when being trained to depth model, it is necessary first to artificial by being carried out to initial data Mark, to obtain enough labeled data, then filters out training data from labeled data, due to needing the more of multiple labels Group training data, wherein the amount of training data of each label requires to reach " thousand " magnitude, it can implementation model training.
However, in above process, to obtain the training data of " thousand " magnitude, for each training data, needing standard Standby 10-20 labeled data, so that it is very big to the human cost of data mark investment, and since human resources are limited, nothing Method manually marks data as much as possible, so that utilizing to the initial data from internet of magnanimity insufficient.
Summary of the invention
The disclosure provides a kind of data screening method, apparatus, server and storage medium, can overcome data mark investment Human cost it is big, to data utilize insufficient problem.
According to the first aspect of the embodiments of the present disclosure, a kind of data screening method is provided, this method comprises:
Classified using disaggregated model to multiple initial data, obtains the classification results of each initial data, Mei Geyuan Beginning data include text information and image information, and for classifying to image information, which includes the disaggregated model At least one class label;
Based on participle tool and term vector model, obtain in the first term vector and each initial data of each class label Second term vector of text information;
Second term vector of text information in the first term vector and each initial data based on each class label, from In multiple initial data, target data is determined, the of the second term vector of the text information of the target data and class label Meet the first preset condition between one term vector.
In a kind of possible embodiment, it is based on participle tool and term vector model, obtains the first of each class label The second term vector of text information includes: in term vector and each initial data
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one word are inputted into the term vector model, export first term vector and this extremely The term vector of a few word;
The average vector of the term vector of at least one word is retrieved as second term vector.
In a kind of possible embodiment, the first term vector and each initial data Chinese based on each class label Second term vector of this information determines that target data includes from multiple initial data:
To each initial data, the second term vector each classification mark corresponding with the initial data of the initial data is obtained The COS distance of first term vector of label;
Initial data corresponding to the COS distance of default value will be less than, is determined as the target data.
In a kind of possible embodiment, which further includes at least one prediction probability, and each prediction probability is used In indicate an initial data belong to a class label a possibility that.
In a kind of possible embodiment, is classified using disaggregated model to multiple initial data, obtained each original The classification results of data include:
To each initial data, which is inputted into the disaggregated model, the initial data is exported and belongs to each classification The prediction probability of label, each prediction probability correspond to a class label;
Class label corresponding to the prediction probability of the second preset condition will be met, is retrieved as at least the one of the initial data A class label.
In a kind of possible embodiment, class label corresponding to the prediction probability of the second preset condition will be met, obtained At least one class label for being taken as the initial data includes:
When the maximum value in the prediction probability is greater than probability threshold value, will be greater than corresponding to the prediction probability of probability threshold value Class label is retrieved as at least one class label of the initial data;Or,
It, will be corresponding to the maximum value in the prediction probability when the maximum value in the prediction probability is less than or equal to probability threshold value Class label, be retrieved as the class label of the initial data.
According to the second aspect of an embodiment of the present disclosure, a kind of data screening device is provided, which includes:
Taxon is configured as executing and be classified using disaggregated model to multiple initial data, obtained each original The classification results of data, each initial data include text information and image information, the disaggregated model be used for image information into Row classification, which includes at least one class label;
Acquiring unit is configured as executing based on participle tool and term vector model, obtains the first of each class label Second term vector of text information in term vector and each initial data;
Determination unit is configured as executing the first term vector and each initial data Chinese based on each class label Second term vector of this information determines target data from multiple initial data, the second of the text information of the target data Meet the first preset condition between term vector and the first term vector of class label.
In a kind of possible embodiment, which is additionally configured to execute:
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one word are inputted into the term vector model, export first term vector and this extremely The term vector of a few word;
The average vector of the term vector of at least one word is retrieved as second term vector.
In a kind of possible embodiment, which is additionally configured to execute:
To each initial data, the second term vector each classification mark corresponding with the initial data of the initial data is obtained The COS distance of first term vector of label;
Initial data corresponding to the COS distance of default value will be less than, is determined as the target data.
In a kind of possible embodiment, which further includes at least one prediction probability, and each prediction probability is used In indicate an initial data belong to a class label a possibility that.
In a kind of possible embodiment, which includes:
Subelement is exported, is configured as executing that the initial data is inputted the disaggregated model to each initial data, exports The initial data belongs to the prediction probability of each class label, and each prediction probability corresponds to a class label;
Subelement is obtained, class label corresponding to the prediction probability of the second preset condition will be met by being configured as executing, It is retrieved as at least one class label of the initial data.
In a kind of possible embodiment, which is additionally configured to execute:
When the maximum value in the prediction probability is greater than probability threshold value, will be greater than corresponding to the prediction probability of probability threshold value Class label is retrieved as at least one class label of the initial data;Or,
It, will be corresponding to the maximum value in the prediction probability when the maximum value in the prediction probability is less than or equal to probability threshold value Class label, be retrieved as the class label of the initial data.
According to the third aspect of an embodiment of the present disclosure, a kind of server is provided, which includes:
Processor;
For storing the memory of the processor-executable instruction;
Wherein, which is configured as:
Classified using disaggregated model to multiple initial data, obtains the classification results of each initial data, Mei Geyuan Beginning data include text information and image information, and for classifying to image information, which includes the disaggregated model At least one class label;
Based on participle tool and term vector model, obtain in the first term vector and each initial data of each class label Second term vector of text information;
Second term vector of text information in the first term vector and each initial data based on each class label, from In multiple initial data, target data is determined, the of the second term vector of the text information of the target data and class label Meet the first preset condition between one term vector.
According to a fourth aspect of embodiments of the present disclosure, a kind of non-transitorycomputer readable storage medium is provided, when this is deposited When instruction in storage media is executed by the processor of server, enable the server to execute a kind of data screening method, the party Method includes:
Classified using disaggregated model to multiple initial data, obtains the classification results of each initial data, Mei Geyuan Beginning data include text information and image information, and for classifying to image information, which includes the disaggregated model At least one class label;
Based on participle tool and term vector model, obtain in the first term vector and each initial data of each class label Second term vector of text information;
Second term vector of text information in the first term vector and each initial data based on each class label, from In multiple initial data, target data is determined, the of the second term vector of the text information of the target data and class label Meet the first preset condition between one term vector.
According to a fifth aspect of the embodiments of the present disclosure, a kind of application program is provided, including one or more instructs, this Or a plurality of instruction can by the processor of server execute when, enable the server to execute a kind of data screening method, the party Method includes:
Classified using disaggregated model to multiple initial data, obtains the classification results of each initial data, Mei Geyuan Beginning data include text information and image information, and for classifying to image information, which includes the disaggregated model At least one class label;
Based on participle tool and term vector model, obtain in the first term vector and each initial data of each class label Second term vector of text information;
Second term vector of text information in the first term vector and each initial data based on each class label, from In multiple initial data, target data is determined, the of the second term vector of the text information of the target data and class label Meet the first preset condition between one term vector.
The technical scheme provided by this disclosed embodiment can include the following benefits:
Classify by using disaggregated model to initial data, so that the classification results of each initial data are obtained, and Based on participle tool and term vector model is based on, the first term vector of class label and the second term vector of initial data are obtained, To which the initial data for meeting the first preset condition is determined as target data, due to introducing participle tool and term vector mould Type enables the text information in initial data to indicate with the accessible vector form of computer, to reduce artificial mark Infuse bring cost, and avoid because human resources it is limited caused by the utilization rate to initial data limitation, to increase The big utilization rate of the initial data of magnanimity.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is a kind of flow chart of data screening method shown according to an exemplary embodiment.
Fig. 2 is a kind of flow chart of data screening method shown according to an exemplary embodiment.
Fig. 3 is a kind of schematic diagram of data screening method shown according to an exemplary embodiment.
Fig. 4 is a kind of logical construction block diagram of data screening device shown according to an exemplary embodiment.
Fig. 5 is a kind of logical construction block diagram of server shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Fig. 1 is a kind of flow chart of data screening method shown according to an exemplary embodiment, as shown in Figure 1, data Screening technique is in server, comprising the following steps:
In a step 101, server classifies to multiple initial data using disaggregated model, obtains each initial data Classification results, each initial data includes text information and image information, and the disaggregated model is for dividing image information Class, the classification results include at least one class label.
In a step 102, server is based on participle tool and term vector model, obtain the first word of each class label to Second term vector of text information in amount and each initial data.
In step 103, text in the first term vector and each initial data of the server based on each class label Second term vector of information determines target data from multiple initial data, the second word of the text information of the target data Meet the first preset condition between vector and the first term vector of class label.
The method that the embodiment of the present disclosure provides, classifies to initial data by using disaggregated model, to obtain every The classification results of a initial data, and based on participle tool and term vector model is based on, obtain the first term vector of class label With the second term vector of initial data, so that the initial data for meeting the first preset condition is determined as target data, due to drawing Enter participle tool and term vector model, enables the text information in initial data with the accessible vector form of computer Indicate, to reduce artificial mark bring cost, and avoid because human resources it is limited caused by initial data Utilization rate limitation, to increase the utilization rate of the initial data of magnanimity.
In a kind of possible embodiment, it is based on participle tool and term vector model, obtains the first of each class label The second term vector of text information includes: in term vector and each initial data
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one word are inputted into the term vector model, export first term vector and this extremely The term vector of a few word;
The average vector of the term vector of at least one word is retrieved as second term vector.
In a kind of possible embodiment, the first term vector and each initial data Chinese based on each class label Second term vector of this information determines that target data includes from multiple initial data:
To each initial data, the second term vector each classification mark corresponding with the initial data of the initial data is obtained The COS distance of first term vector of label;
Initial data corresponding to the COS distance of default value will be less than, is determined as the target data.
In a kind of possible embodiment, which further includes at least one prediction probability, and each prediction probability is used In indicate an initial data belong to a class label a possibility that.
In a kind of possible embodiment, is classified using disaggregated model to multiple initial data, obtained each original The classification results of data include:
To each initial data, which is inputted into the disaggregated model, the initial data is exported and belongs to each classification The prediction probability of label, each prediction probability correspond to a class label;
Class label corresponding to the prediction probability of the second preset condition will be met, is retrieved as at least the one of the initial data A class label.
In a kind of possible embodiment, class label corresponding to the prediction probability of the second preset condition will be met, obtained At least one class label for being taken as the initial data includes:
When the maximum value in the prediction probability is greater than probability threshold value, will be greater than corresponding to the prediction probability of probability threshold value Class label is retrieved as at least one class label of the initial data;Or,
It, will be corresponding to the maximum value in the prediction probability when the maximum value in the prediction probability is less than or equal to probability threshold value Class label, be retrieved as the class label of the initial data.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.
Fig. 2 is a kind of flow chart of data screening method shown according to an exemplary embodiment, as shown in Fig. 2, data Screening technique is in server, comprising the following steps:
In step 201, which is inputted disaggregated model, it is original to export this by server to each initial data Data belong to the prediction probability of each class label, and each prediction probability corresponds to a class label.
Wherein, each initial data may include text information and image information, the data scale of multiple initial data It can be millions, be also possible to hundred million grades, the embodiment of the present disclosure does not limit the data scale of multiple initial data specifically Fixed, optionally, multiple initial data can be from UGC (user generated content, user's original content) website The data obtained at random on platform are also possible to the data extracted at random in existing database, and the embodiment of the present disclosure is not to this The acquisition modes of multiple initial data are specifically limited.
Optionally, which can be classified by image information of the convolutional neural networks to input model, be led to It crosses multiple convolutional layers and obtains the characteristic pattern of each initial data, Nonlinear Processing is carried out to this feature figure using activation primitive, then Image input after Nonlinear Processing is differentiated into network, to export class label and prediction probability, wherein the activation primitive can To be sigmoid function, it can be tanh function, be also possible to ReLU function, the embodiment of the present disclosure is not to the activation primitive Way of realization is defined, for example, the activation primitive uses sigmoid, makes it possible to the section by variable mappings to (0,1), To obtain accurate class label in the initial data for the magnanimity that feature differs greatly.
Wherein, such distinguishing label can be the classification for indicating the image information of input model in the form of a label, for example, should Class label is " cat ", " dog ", " monkey " or " people " etc., to indicate the classification of image information;Wherein, the prediction probability can be with The mode of numerical value indicates a possibility that belonging to some class label size, for example, initial data belongs to the pre- of class label " people " Surveying probability can be 0.8, and that is to say that disaggregated model predicts that the initial data has 80% a possibility that is portrait.
In step 201, Fig. 3 is a kind of schematic diagram of data screening method shown according to an exemplary embodiment, ginseng See Fig. 3, it is assumed that the disaggregated model selects L class label, by taking i-th of initial data as an example, by i-th of initial data input The disaggregated model exports the L prediction probability that i-th of initial data belongs to each class label, wherein each prediction probability It is used to indicate a possibility that i-th of initial data belongs to a class label, L and i are positive integer, i-th of initial data For any data in multiple initial data, above-mentioned assorting process can be carried out to each initial data, therefore not to repeat here.
In step 202, when the maximum value in the prediction probability is greater than probability threshold value, server will be greater than probability threshold value Prediction probability corresponding to class label, be retrieved as at least one class label of initial data.
Optionally, which can be the numerical value of Server Default, be also possible to the number obtained according to preset rules According to based on above-mentioned example, which, which can be, is retrieved as the probability threshold value for the median of L prediction probability, or, this is pre- If rule, which can also be, is retrieved as the probability threshold value for L prediction probability average, the embodiment of the present disclosure is not to the probability threshold value Acquisition modes specifically limited.
It is a kind of possibility implementation of at least one class label for obtaining any initial data in above-mentioned steps 202, It that is to say through a disaggregated model, filter out that prediction probability is larger and the higher class label of classification accuracy, and one In a little embodiments, which can take following manner to be replaced: when the maximum value in the prediction probability is less than or equal to generally When rate threshold value, class label corresponding to the maximum value in the prediction probability is retrieved as the classification of the initial data by server Label that is to say, if L prediction probability of some initial data is all less than or equal to the probability threshold value, to avoid this original Data do not have corresponding class label, then obtain class label corresponding to the maximum value of prediction probability.
Therefore, obtained class label can be expressed using following functional expressions in step 202:Wherein, labeliFor at least one class corresponding to i-th of initial data Distinguishing label,For first of class label without screening of i-th of initial data, probthresholdFor probability threshold Value.
Correspondingly, obtained class label can be carried out using following functional expressions in the alternative of step 202 Expression:Wherein, labeliFor corresponding to i-th of initial data at least One class label, predictioniFor the unscreened class label of either one or two of i-th of initial data, argmax () letter Number is used to indicate the index position of the maximum value of input quantity.
In step 203, server extracts the text envelope of the initial data using participle tool to each initial data At least one word in breath.
Wherein, the participle tool is for extracting to the word in text information, for example, the text of i-th of initial data Information be " I likes chafing dish ", text information is handled by participle tool, so as to extract " I ", " liking " and " chafing dish " three words, wherein the participle tool can be jieba etc., and the embodiment of the present disclosure is not to the content of text information It is specifically limited with the implementation of participle tool.
In step 204, server is by each class label and at least one word input word vector model, output the The term vector of one term vector and at least one word.
Wherein, the term vector model (word embedding) can be embedded in by word come obtain input word word to Amount so that indicate text information with the accessible vector form of computer, for example, the term vector model can be Chinese word to Measure MODEL C hineseWord2Vector etc., wherein first term vector is L term vector corresponding to L class label, should At least one word is in above-mentioned steps 203, and server extracts the text information institute in each initial data according to participle tool Obtained word.
In step 205, the average vector of the term vector of at least one word is retrieved as the second term vector by server.
Wherein, the second term vector is term vector corresponding to text information in each initial data, second term vector Expression formula can be such thatWherein, Vectori dFor i-th of original number According to dimension be d the second term vector, #WordiFor i-th of initial data text information segmented after obtained word Number, Embedding are term vector model, and d is the corresponding dimension of term vector.
The word of 204-205 through the above steps, the term vector of available each initial data and each class label to Amount, so as to judge whether accord between the first term vector and the second term vector using COS distance by following step 206 Close the first preset condition, avoid the labour cost that artificial mark expends, also avoid due to human resources are limited and to original It is limited caused by beginning data user rate, to increase the utilization rate of the initial data of magnanimity.
In step 206, server obtains the second term vector and the original number of the initial data to each initial data According to the COS distance of the first term vector of corresponding each class label.
In above process, by taking i-th of initial data as an example, by step 201-202 to i-th of the initial data output Class label be " cat ", " dog " and " monkey ", it is assumed that the first term vector of class label " cat " is expressed as Cat_vectori, class First term vector of distinguishing label " dog " is expressed as Dog_vectori, the first term vector of class label " monkey " is expressed as Monkey_ vectori, and the second term vector of i-th of initial data is expressed as Vectori, then second term vector and 3 class label institutes The COS distance of corresponding first term vector are as follows: distancei=cos (Vectori,Cat_vectori)=0.9, distancei=cos (Vectori,Dog_vectori)=0.6, distancei=cos (Vectori,Monkey_vectori) =0.3.It is only illustrated by taking i-th of initial data as an example herein, in practice for each initial data, can be carried out class Like step, to obtain remaining between the second term vector of the initial data and the first term vector of each corresponding class label Chordal distance, therefore not to repeat here.
In step 207, server will be less than initial data corresponding to the COS distance of default value, be determined as target Data.
Wherein, which can be the threshold value of Server Default, for example, the default value can be 0.5, based on upper Example is stated, if the COS distance between the second term vector and class label " monkey " is less than the default value, that is to say, distancei=cos (Vectori,Monkey_vectori)=0.3 < 0.5 is then considered as the second word of i-th of initial data Meet the first preset condition between vector and the first term vector of class label " monkey ", so that i-th of initial data be obtained For target data, " monkey " is the corresponding class label of the target data.In above process, if some initial data owns COS distance is all larger than equal to the default value, then the initial data will not be acquired as target data, and it is pre- to be considered as classification results Sniffing misses, which is noise data.
The method that the embodiment of the present disclosure provides, classifies to initial data by using disaggregated model, to obtain every The classification results of a initial data, and based on participle tool and term vector model is based on, obtain the first term vector of class label With the second term vector of initial data, so that the initial data for meeting the first preset condition is determined as target data, due to drawing Enter participle tool and term vector model, enables the text information in initial data with the accessible vector form of computer Indicate, to reduce artificial mark bring cost, and avoid because human resources it is limited caused by initial data Utilization rate limitation, to increase the utilization rate of the initial data of magnanimity;Further, by by initial data interior prediction Probability is greater than the class label of probability threshold value, at least one class label of the initial data is retrieved as, to filter out original The noise data classified is difficult in beginning data;In addition, by by text information the term vector of at least one word it is average to Amount is retrieved as the first term vector, each initial data is described with a term vector, and then according to the first word The COS distance of the second term vector of vector sum determines target data, so that data screening is more accurate.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.
Fig. 4 is a kind of logical construction block diagram of data screening device shown according to an exemplary embodiment.Reference Fig. 4, The device includes taxon 401, acquiring unit 402 and determination unit 403:
Taxon 401 is configured as executing and be classified using disaggregated model to multiple initial data, obtains each original The classification results of beginning data, each initial data include text information and image information, which is used for image information Classify, which includes at least one class label;
Acquiring unit 402, is configured as executing based on participle tool and term vector model, obtains the of each class label Second term vector of text information in one term vector and each initial data;
Determination unit 403 is configured as executing the first term vector based on each class label and each initial data Second term vector of middle text information determines target data from multiple initial data, the text information of the target data Meet the first preset condition between second term vector and the first term vector of class label.
The device that the embodiment of the present disclosure provides, classifies to initial data by using disaggregated model, to obtain every The classification results of a initial data, and based on participle tool and term vector model is based on, obtain the first term vector of class label With the second term vector of initial data, so that the initial data for meeting the first preset condition is determined as target data, due to drawing Enter participle tool and term vector model, enables the text information in initial data with the accessible vector form of computer Indicate, to reduce artificial mark bring cost, and avoid because human resources it is limited caused by initial data Utilization rate limitation, to increase the utilization rate of the initial data of magnanimity.
In a kind of possible embodiment, which is additionally configured to execute:
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one word are inputted into the term vector model, export first term vector and this extremely The term vector of a few word;
The average vector of the term vector of at least one word is retrieved as second term vector.
In a kind of possible embodiment, which is additionally configured to execute:
To each initial data, the second term vector each classification mark corresponding with the initial data of the initial data is obtained The COS distance of first term vector of label;
Initial data corresponding to the COS distance of default value will be less than, is determined as the target data.
In a kind of possible embodiment, which further includes at least one prediction probability, and each prediction probability is used In indicate an initial data belong to a class label a possibility that.
In a kind of possible embodiment, the device based on Fig. 4 is formed, which includes:
Subelement is exported, is configured as executing that the initial data is inputted the disaggregated model to each initial data, exports The initial data belongs to the prediction probability of each class label, and each prediction probability corresponds to a class label;
Subelement is obtained, class label corresponding to the prediction probability of the second preset condition will be met by being configured as executing, It is retrieved as at least one class label of the initial data.
In a kind of possible embodiment, which is additionally configured to execute:
When the maximum value in the prediction probability is greater than probability threshold value, will be greater than corresponding to the prediction probability of probability threshold value Class label is retrieved as at least one class label of the initial data;Or,
It, will be corresponding to the maximum value in the prediction probability when the maximum value in the prediction probability is less than or equal to probability threshold value Class label, be retrieved as the class label of the initial data.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.
About the device in above-described embodiment, wherein each unit executes the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 5 is a kind of logical construction block diagram of server shown according to an exemplary embodiment, which can be because Configuration or performance are different and generate bigger difference, may include one or more processors (central Processing units, CPU) 501 and one or more memory 502, wherein it is stored in the memory 502 At least one instruction, at least one instruction are loaded by the processor 501 and are executed to realize that above-mentioned each embodiment of the method mentions The data screening method of confession.Certainly, which can also have wired or wireless network interface, keyboard and input defeated The components such as outgoing interface, to carry out input and output, which can also include other portions for realizing functions of the equipments Part, this will not be repeated here.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 502 of instruction, above-metioned instruction can be executed by the processor 501 of server 500 to complete above-mentioned data screening Method obtains the classification knot of each initial data this method comprises: classifying using disaggregated model to multiple initial data Fruit, each initial data include text information and image information, and the disaggregated model is for classifying to image information, the classification It as a result include at least one class label;Based on participle tool and term vector model, obtain the first word of each class label to Second term vector of text information in amount and each initial data;The first term vector and each original based on each class label The second term vector of text information determines target data from multiple initial data in beginning data, the text of the target data Meet the first preset condition between second term vector of information and the first term vector of class label.Optionally, above-metioned instruction is also It can be executed as the processor 501 of server 500 to complete other steps involved in the above exemplary embodiments.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..
In the exemplary embodiment, a kind of application program, including one or more instruction are additionally provided, this one or more Instruction can be executed by the processor 501 of server 500, to complete above-mentioned data screening method, this method comprises: using classification Model classifies to multiple initial data, obtains the classification results of each initial data, and each initial data includes text envelope Breath and image information, for the disaggregated model for classifying to image information, which includes at least one class label; Based on participle tool and term vector model, text information in the first term vector and each initial data of each class label is obtained The second term vector;In the first term vector and each initial data based on each class label the second word of text information to Amount, from multiple initial data, determines target data, the second term vector and class label of the text information of the target data The first term vector between meet the first preset condition.Optionally, above-metioned instruction can also be by the processor 501 of server 500 It executes to complete other steps involved in the above exemplary embodiments.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

Claims (10)

1. a kind of data screening method, which is characterized in that the described method includes:
Classified using disaggregated model to multiple initial data, obtains the classification results of each initial data, each original number According to including text information and image information, for classifying to image information, the classification results include the disaggregated model At least one class label;
Based on participle tool and term vector model, text in the first term vector and each initial data of each class label is obtained Second term vector of information;
Second term vector of text information in the first term vector and each initial data based on each class label, from institute It states in multiple initial data, determines target data, the second term vector of the text information of the target data and class label Meet the first preset condition between first term vector.
2. data screening method according to claim 1, which is characterized in that described to be based on participle tool and term vector mould Type, the second term vector for obtaining text information in the first term vector and each initial data of each class label include:
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one described word are inputted into the term vector model, export first term vector and institute State the term vector of at least one word;
The average vector of the term vector of at least one word is retrieved as second term vector.
3. data screening method according to claim 1, which is characterized in that based on each class label The second term vector of text information determines number of targets from the multiple initial data in one term vector and each initial data According to including:
To each initial data, the second term vector each classification mark corresponding with the initial data of the initial data is obtained The COS distance of first term vector of label;
Initial data corresponding to the COS distance of default value will be less than, is determined as the target data.
4. data screening method according to claim 1, which is characterized in that the classification results further include that at least one is pre- Probability is surveyed, each prediction probability is used to indicate an a possibility that initial data belongs to a class label.
5. data screening method according to claim 4, which is characterized in that described to use disaggregated model to multiple original numbers According to classifying, the classification results for obtaining each initial data include:
To each initial data, the initial data is inputted into the disaggregated model, the initial data is exported and belongs to each class The prediction probability of distinguishing label, each prediction probability correspond to a class label;
Class label corresponding to the prediction probability of the second preset condition will be met, is retrieved as the initial data at least one Class label.
6. data screening method according to claim 5, which is characterized in that the prediction that the second preset condition will be met Class label corresponding to probability, at least one class label for being retrieved as the initial data include:
When the maximum value in the prediction probability is greater than probability threshold value, class corresponding to the prediction probability of probability threshold value will be greater than Distinguishing label is retrieved as at least one class label of the initial data;Or,
It, will be corresponding to the maximum value in the prediction probability when the maximum value in the prediction probability is less than or equal to probability threshold value Class label, be retrieved as the class label of the initial data.
7. a kind of data screening device, which is characterized in that described device includes:
Taxon is configured as executing and be classified using disaggregated model to multiple initial data, obtains each initial data Classification results, each initial data includes text information and image information, and the disaggregated model is used to carry out image information Classification, the classification results include at least one class label;
Acquiring unit, be configured as executing obtaining based on participle tool and term vector model the first word of each class label to Second term vector of text information in amount and each initial data;
Determination unit is configured as executing text in the first term vector and each initial data based on each class label Second term vector of information determines target data from the multiple initial data, the of the text information of the target data Meet the first preset condition between two term vectors and the first term vector of class label.
8. data screening device according to claim 7, which is characterized in that the acquiring unit is additionally configured to execute:
To each initial data, using the participle tool, at least one word in the text information of the initial data is extracted Language;
Each class label and at least one described word are inputted into the term vector model, export first term vector and institute State the term vector of at least one word;
The average vector of the term vector of at least one word is retrieved as second term vector.
9. a kind of server characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing to realize such as claim 1 to the described in any item data sieves of claim 6 Operation performed by choosing method.
10. a kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processor of server When execution, enable the server to execute one kind to realize such as claim 1 to the described in any item data screenings of claim 6 Operation performed by method.
CN201811489982.XA 2018-12-06 2018-12-06 Data screening method and device, server and storage medium Active CN109657710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811489982.XA CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811489982.XA CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN109657710A true CN109657710A (en) 2019-04-19
CN109657710B CN109657710B (en) 2022-01-21

Family

ID=66112715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811489982.XA Active CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN109657710B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543920A (en) * 2019-09-12 2019-12-06 北京达佳互联信息技术有限公司 Performance detection method and device of image recognition model, server and storage medium
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140272822A1 (en) * 2013-03-14 2014-09-18 Canon Kabushiki Kaisha Systems and methods for generating a high-level visual vocabulary
CN105279517A (en) * 2015-09-30 2016-01-27 西安电子科技大学 Weak tag social image recognition method based on semi-supervision relation theme model
CN106529606A (en) * 2016-12-01 2017-03-22 中译语通科技(北京)有限公司 Method of improving image recognition accuracy
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106971154A (en) * 2017-03-16 2017-07-21 天津大学 Pedestrian's attribute forecast method based on length memory-type recurrent neural network
US20170255840A1 (en) * 2014-11-26 2017-09-07 Captricity, Inc. Analyzing content of digital images
US20170330054A1 (en) * 2016-05-10 2017-11-16 Baidu Online Network Technology (Beijing) Co., Ltd. Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
CN107391703A (en) * 2017-07-28 2017-11-24 北京理工大学 The method for building up and system of image library, image library and image classification method
CN107563444A (en) * 2017-09-05 2018-01-09 浙江大学 A kind of zero sample image sorting technique and system
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN108319672A (en) * 2018-01-25 2018-07-24 南京邮电大学 Mobile terminal malicious information filtering method and system based on cloud computing
CN108537240A (en) * 2017-03-01 2018-09-14 华东师范大学 Commodity image semanteme marking method based on domain body
CN108595497A (en) * 2018-03-16 2018-09-28 北京达佳互联信息技术有限公司 Data screening method, apparatus and terminal
CN108629043A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 Extracting method, device and the storage medium of webpage target information
CN108664989A (en) * 2018-03-27 2018-10-16 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108734212A (en) * 2018-05-17 2018-11-02 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of determining classification results
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140272822A1 (en) * 2013-03-14 2014-09-18 Canon Kabushiki Kaisha Systems and methods for generating a high-level visual vocabulary
US20170255840A1 (en) * 2014-11-26 2017-09-07 Captricity, Inc. Analyzing content of digital images
CN105279517A (en) * 2015-09-30 2016-01-27 西安电子科技大学 Weak tag social image recognition method based on semi-supervision relation theme model
US20170330054A1 (en) * 2016-05-10 2017-11-16 Baidu Online Network Technology (Beijing) Co., Ltd. Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106529606A (en) * 2016-12-01 2017-03-22 中译语通科技(北京)有限公司 Method of improving image recognition accuracy
CN108537240A (en) * 2017-03-01 2018-09-14 华东师范大学 Commodity image semanteme marking method based on domain body
CN106971154A (en) * 2017-03-16 2017-07-21 天津大学 Pedestrian's attribute forecast method based on length memory-type recurrent neural network
CN107391703A (en) * 2017-07-28 2017-11-24 北京理工大学 The method for building up and system of image library, image library and image classification method
CN107563444A (en) * 2017-09-05 2018-01-09 浙江大学 A kind of zero sample image sorting technique and system
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN108319672A (en) * 2018-01-25 2018-07-24 南京邮电大学 Mobile terminal malicious information filtering method and system based on cloud computing
CN108595497A (en) * 2018-03-16 2018-09-28 北京达佳互联信息技术有限公司 Data screening method, apparatus and terminal
CN108664989A (en) * 2018-03-27 2018-10-16 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device
CN108629043A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 Extracting method, device and the storage medium of webpage target information
CN108734212A (en) * 2018-05-17 2018-11-02 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of determining classification results

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ALI DIBA 等: "Deep visual words: Improved fisher vector for image classification", 《2017 FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS》 *
QIMIN CHENG 等: "A survey and analysis on automatic image annotation", 《PATTERN RECOGNITION》 *
张震宇: "基于语义距离的图像检索多样化研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
胡琦瑶: "基于弱监督深度学习的图像检索方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543920A (en) * 2019-09-12 2019-12-06 北京达佳互联信息技术有限公司 Performance detection method and device of image recognition model, server and storage medium
CN110543920B (en) * 2019-09-12 2022-04-22 北京达佳互联信息技术有限公司 Performance detection method and device of image recognition model, server and storage medium
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109657710B (en) 2022-01-21

Similar Documents

Publication Publication Date Title
KR102071582B1 (en) Method and apparatus for classifying a class to which a sentence belongs by using deep neural network
CN107391760B (en) User interest recognition methods, device and computer readable storage medium
CN107766929B (en) Model analysis method and device
CN109857860A (en) File classification method, device, computer equipment and storage medium
CN109471938A (en) A kind of file classification method and terminal
CN110309514A (en) A kind of method for recognizing semantics and device
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN109299344A (en) The generation method of order models, the sort method of search result, device and equipment
Lei et al. Patent analytics based on feature vector space model: A case of IoT
CN113407694B (en) Method, device and related equipment for detecting ambiguity of customer service robot knowledge base
CN107436875A (en) File classification method and device
CN107168992A (en) Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN109598307A (en) Data screening method, apparatus, server and storage medium
CN109933660B (en) API information search method towards natural language form based on handout and website
CN109872162A (en) A kind of air control classifying identification method and system handling customer complaint information
CN110163647A (en) A kind of data processing method and device
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN108090099B (en) Text processing method and device
CN108734212A (en) A kind of method and relevant apparatus of determining classification results
CN108959265A (en) Cross-domain texts sensibility classification method, device, computer equipment and storage medium
Li et al. Dating ancient paintings of Mogao Grottoes using deeply learnt visual codes
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN112836509A (en) Expert system knowledge base construction method and system
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant