CN109657710B - Data screening method and device, server and storage medium - Google Patents

Data screening method and device, server and storage medium Download PDF

Info

Publication number
CN109657710B
CN109657710B CN201811489982.XA CN201811489982A CN109657710B CN 109657710 B CN109657710 B CN 109657710B CN 201811489982 A CN201811489982 A CN 201811489982A CN 109657710 B CN109657710 B CN 109657710B
Authority
CN
China
Prior art keywords
original data
word vector
word
data
category label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811489982.XA
Other languages
Chinese (zh)
Other versions
CN109657710A (en
Inventor
张志伟
吴丽军
李铅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201811489982.XA priority Critical patent/CN109657710B/en
Publication of CN109657710A publication Critical patent/CN109657710A/en
Application granted granted Critical
Publication of CN109657710B publication Critical patent/CN109657710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a data screening method, a data screening device, a server and a storage medium, and belongs to the field of internet. The method comprises the following steps: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data. By introducing a word segmentation tool and a word vector model, text information in the original data can be represented in a vector form which can be processed by a computer, so that the cost caused by manual labeling is reduced, and the utilization rate of massive original data is increased.

Description

Data screening method and device, server and storage medium
Technical Field
The present disclosure relates to the field of internet, and in particular, to a data screening method, apparatus, server, and storage medium.
Background
In the related art, deep learning is widely applied in the fields of natural language processing, text translation and the like, wherein the accuracy of a deep learning model depends on the scale of training data, and original data from the internet needs to be screened for obtaining the training data.
Taking image classification as an example, when a depth model is trained, firstly, original data needs to be manually labeled to obtain enough labeled data, and then training data is screened out from the labeled data.
However, in the above process, in order to obtain training data of the order of "thousand", 10 to 20 pieces of labeled data need to be prepared for each piece of training data, so that the labor cost for labeling data is very large, and because human resources are limited, as much data as possible cannot be labeled manually, so that the massive raw data from the internet is not fully utilized.
Disclosure of Invention
The present disclosure provides a data screening method, an apparatus, a server and a storage medium, which can overcome the problems of high labor cost for data labeling and insufficient data utilization.
According to a first aspect of the embodiments of the present disclosure, there is provided a data screening method, including:
classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
In one possible implementation, based on the word segmentation tool and the word vector model, obtaining a first word vector of each category label and a second word vector of text information in each original data includes:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting each category label and the at least one word into the word vector model, and outputting the first word vector and the word vector of the at least one word;
and obtaining the average vector of the word vectors of the at least one word as the second word vector.
In one possible embodiment, determining the target data from the plurality of raw data based on the first word vector of the respective category label and the second word vector of the text information in each raw data comprises:
for each original data, obtaining the cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data;
and determining the original data corresponding to the cosine distance smaller than the preset value as the target data.
In a possible embodiment, the classification result further comprises at least one prediction probability, each prediction probability being used to indicate a likelihood that one original data belongs to one class label.
In one possible embodiment, classifying the plurality of raw data by using the classification model, and obtaining the classification result of each raw data includes:
for each original data, inputting the original data into the classification model, and outputting the prediction probability of the original data belonging to each class label, wherein each prediction probability corresponds to one class label;
and acquiring at least one class label of the original data according to the class label corresponding to the prediction probability meeting the second preset condition.
In a possible implementation manner, obtaining, as at least one class label of the original data, a class label corresponding to the prediction probability meeting the second preset condition includes:
when the maximum value in the prediction probability is larger than a probability threshold, acquiring at least one category label of the original data, wherein the category label corresponds to the prediction probability larger than the probability threshold; or the like, or, alternatively,
and when the maximum value in the prediction probability is smaller than or equal to the probability threshold, acquiring the class label corresponding to the maximum value in the prediction probability as the class label of the original data.
According to a second aspect of the embodiments of the present disclosure, there is provided a data filtering apparatus, the apparatus including:
the classification unit is configured to classify a plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
the obtaining unit is configured to obtain first word vectors of various category labels and second word vectors of text information in each original data based on a word segmentation tool and a word vector model;
and the determining unit is configured to determine target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
In one possible embodiment, the obtaining unit is further configured to perform:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting each category label and the at least one word into the word vector model, and outputting the first word vector and the word vector of the at least one word;
and obtaining the average vector of the word vectors of the at least one word as the second word vector.
In a possible embodiment, the determining unit is further configured to perform:
for each original data, obtaining the cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data;
and determining the original data corresponding to the cosine distance smaller than the preset value as the target data.
In a possible embodiment, the classification result further comprises at least one prediction probability, each prediction probability being used to indicate a likelihood that one original data belongs to one class label.
In one possible embodiment, the classification unit comprises:
an output subunit configured to perform, for each raw data, inputting the raw data into the classification model, and outputting prediction probabilities of the raw data belonging to each class label, each prediction probability corresponding to one class label;
and the obtaining subunit is configured to execute a category label corresponding to the prediction probability meeting a second preset condition, and obtain at least one category label of the original data.
In one possible embodiment, the obtaining subunit is further configured to perform:
when the maximum value in the prediction probability is larger than a probability threshold, acquiring at least one category label of the original data, wherein the category label corresponds to the prediction probability larger than the probability threshold; or the like, or, alternatively,
and when the maximum value in the prediction probability is smaller than or equal to the probability threshold, acquiring the class label corresponding to the maximum value in the prediction probability as the class label of the original data.
According to a third aspect of embodiments of the present disclosure, there is provided a server, including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to:
classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a server, enable the server to perform a data screening method, the method comprising:
classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
According to a fifth aspect of embodiments of the present disclosure, there is provided an application program comprising one or more instructions which, when executed by a processor of a server, enable the server to perform a method of data screening, the method comprising:
classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the method comprises the steps of classifying original data by adopting a classification model to obtain a classification result of each original data, acquiring a first word vector of a class label and a second word vector of the original data based on a word segmentation tool and a word vector model, determining the original data meeting a first preset condition as target data, and introducing the word segmentation tool and the word vector model to enable text information in the original data to be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation on the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow chart illustrating a method of data screening in accordance with an exemplary embodiment.
FIG. 2 is a flow chart illustrating a method of data screening in accordance with an exemplary embodiment.
FIG. 3 is a schematic diagram illustrating a method of data screening in accordance with an exemplary embodiment.
Fig. 4 is a block diagram illustrating a logical structure of a data filtering apparatus according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a logical structure of a server in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating a data filtering method according to an exemplary embodiment, where the data filtering method is used in a server, as shown in fig. 1, and includes the following steps:
in step 101, the server classifies a plurality of original data by using a classification model to obtain a classification result of each original data, each original data includes text information and image information, the classification model is used for classifying the image information, and the classification result includes at least one category label.
In step 102, the server obtains a first word vector of each category label and a second word vector of text information in each original data based on the word segmentation tool and the word vector model.
In step 103, the server determines target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, and the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.
According to the method provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and the word segmentation tool and the word vector model are introduced, so that the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.
In one possible implementation, based on the word segmentation tool and the word vector model, obtaining a first word vector of each category label and a second word vector of text information in each original data includes:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting each category label and the at least one word into the word vector model, and outputting the first word vector and the word vector of the at least one word;
and obtaining the average vector of the word vectors of the at least one word as the second word vector.
In one possible embodiment, determining the target data from the plurality of raw data based on the first word vector of the respective category label and the second word vector of the text information in each raw data comprises:
for each original data, obtaining the cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data;
and determining the original data corresponding to the cosine distance smaller than the preset value as the target data.
In a possible embodiment, the classification result further comprises at least one prediction probability, each prediction probability being used to indicate a likelihood that one original data belongs to one class label.
In one possible embodiment, classifying the plurality of raw data by using the classification model, and obtaining the classification result of each raw data includes:
for each original data, inputting the original data into the classification model, and outputting the prediction probability of the original data belonging to each class label, wherein each prediction probability corresponds to one class label;
and acquiring at least one class label of the original data according to the class label corresponding to the prediction probability meeting the second preset condition.
In a possible implementation manner, obtaining, as at least one class label of the original data, a class label corresponding to the prediction probability meeting the second preset condition includes:
when the maximum value in the prediction probability is larger than a probability threshold, acquiring at least one category label of the original data, wherein the category label corresponds to the prediction probability larger than the probability threshold; or the like, or, alternatively,
and when the maximum value in the prediction probability is smaller than or equal to the probability threshold, acquiring the class label corresponding to the maximum value in the prediction probability as the class label of the original data.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
Fig. 2 is a flowchart illustrating a data filtering method according to an exemplary embodiment, where the data filtering method is used in a server, as shown in fig. 2, and includes the following steps:
in step 201, the server inputs the raw data into a classification model for each raw data, and outputs a prediction probability that the raw data belongs to each class label, wherein each prediction probability corresponds to one class label.
Each piece of original data may include text information and image information, the data scale of the plurality of pieces of original data may be ten million levels or hundred million levels, the data scale of the plurality of pieces of original data is not specifically limited in the embodiment of the present disclosure, optionally, the plurality of pieces of original data may be data randomly acquired from a UGC (user generated content) website platform, or data randomly extracted from an existing database, and the embodiment of the present disclosure does not specifically limit the manner of acquiring the plurality of pieces of original data.
Optionally, the classification model may classify image information of an input model through a convolutional neural network, obtain a feature map of each raw data through a plurality of convolutional layers, perform nonlinear processing on the feature map by using an activation function, and input the image after the nonlinear processing into a decision network, so as to output a class label and a prediction probability, where the activation function may be a sigmoid function, and may be a tanh function or a ReLU function.
The category label may be in the form of a label indicating the category of the image information of the input model, for example, the category label is "cat", "dog", "monkey" or "person" to indicate the category of the image information; the prediction probability may be a numerical indication of the probability of belonging to a certain class label, for example, the prediction probability of the original data belonging to the class label "human" may be 0.8, that is, the classification model predicts that the original data has a probability of being a human image of 80%.
In step 201, fig. 3 is a schematic diagram of a data screening method shown according to an exemplary embodiment, referring to fig. 3, assuming that the classification model selects L class labels, taking an ith original data as an example, inputting the ith original data into the classification model, and outputting L prediction probabilities that the ith original data belongs to each class label, where each prediction probability is used to indicate a possibility that the ith original data belongs to one class label, L and i are positive integers, the ith original data is any one of a plurality of original data, and each original data may be subjected to the above classification process, which is not described herein again.
In step 202, when the maximum value of the prediction probabilities is greater than the probability threshold, the server obtains at least one category label of the original data, where the category label corresponds to the prediction probability greater than the probability threshold.
Optionally, the probability threshold may be a default numerical value of the server, or may be data obtained according to a preset rule, based on the above example, the preset rule may be to obtain a median of the L prediction probabilities as the probability threshold, or the preset rule may also be to obtain an average of the L prediction probabilities as the probability threshold, and the embodiment of the present disclosure does not specifically limit an obtaining manner of the probability threshold.
In the above step 202, it is a possible implementation manner to obtain at least one category label of any original data, that is, a category label with a higher prediction probability and a higher classification accuracy is screened out through a classification model, and in some embodiments, the step 202 may be replaced by the following method: when the maximum value in the prediction probabilities is less than or equal to the probability threshold, the server obtains the class label corresponding to the maximum value in the prediction probabilities as the class label of the original data, that is, if all the L prediction probabilities of a certain original data are less than or equal to the probability threshold, in order to avoid that the original data does not have the corresponding class label, the server obtains the class label corresponding to the maximum value of the prediction probabilities.
Thus, the class label obtained in step 202 can be expressed by the following functional formula:
Figure BDA0001895447380000081
wherein, labeliAt least one category label corresponding to the ith original data,
Figure BDA0001895447380000082
the i-th original data is the l-th class label without screening, probthresholdIs a probability threshold.
Accordingly, the class label obtained in the alternative of step 202 can be expressed by the following function:
Figure BDA0001895447380000083
wherein, labeliFor at least one class label, prediction, corresponding to the ith original dataiFor any unfiltered category label of the ith original data, the argmax () function is used to indicate the index position of the maximum value of the input quantity.
In step 203, the server extracts at least one word in the text information of each original data by using a word segmentation tool.
The word segmentation tool is used for extracting words in the text information, for example, the text information of the ith original data is 'i like a hot pot', the text information is processed through the word segmentation tool, so that three words of 'i', 'like' and 'hot pot' can be extracted, wherein the word segmentation tool can be jieba and the like, and the content of the text information and the implementation mode of the word segmentation tool are not specifically limited in the embodiment of the disclosure.
In step 204, the server inputs each category label and the at least one word into a word vector model, outputting a first word vector and a word vector for the at least one word.
The word Vector model may obtain word vectors of input words through word embedding (word embedding), so that text information is represented in a Vector form that can be processed by a computer, for example, the word Vector model may be a chinese word Vector model chinese word2Vector, and the like, where the first word Vector is L word vectors corresponding to L category labels, and the at least one word is a word obtained by the server extracting text information in each piece of original data according to the word segmentation tool in step 203.
In step 205, the server obtains an average vector of the word vectors of the at least one word as a second word vector.
The second word vector is a word vector corresponding to the text information in each original data, and an expression of the second word vector may be as follows:
Figure BDA0001895447380000091
wherein, Vectori dIs a second Word vector with dimension d of ith original data, # WordiThe number of words obtained after word segmentation is performed on the text information of the ith original data, Embelling is a word vector model, and d is a dimension corresponding to a word vector.
Through the step 204 and the step 205, the word vector of each original data and the word vector of each category label can be obtained, so that whether the first word vector and the second word vector meet the first preset condition can be judged by adopting the cosine distance through the following step 206, the labor cost consumed by manual labeling is avoided, the limitation on the utilization rate of the original data due to limited human resources is also avoided, and the utilization rate of massive original data is increased.
In step 206, the server obtains, for each original data, a cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data.
In the above process, taking the ith original data as an example, the category labels output to the ith original data through the steps 201 and 202 are "cat" and "dog"and" monkey ", assuming the first word vector of the category label" Cat "is denoted as Cat _ vectoriThe first word vector of the category label "Dog" is denoted as Dog vectoriThe first word vector of the category label "Monkey" is denoted as Monkey vectoriAnd the second word Vector of the ith original data is expressed as VectoriThen the cosine distance between the second word vector and the first word vector corresponding to the 3 category labels is: distancei=cos(Vectori,Cat_vectori)=0.9,distancei=cos(Vectori,Dog_vectori)=0.6,distancei=cos(Vectori,Monkey_vectori) 0.3. Here, only the ith original data is taken as an example for explanation, and in fact, similar steps may be performed on each original data, so as to obtain the cosine distance between the second word vector of the original data and the first word vector of each corresponding category label, which is not described herein again.
In step 207, the server determines the original data corresponding to the cosine distance smaller than the preset value as the target data.
Wherein the preset value may be a default threshold of the server, for example, the preset value may be 0.5, based on the above example, if the cosine distance between the second word vector and the category label "monkey" is smaller than the preset value, that is, distancei=cos(Vectori,Monkey_vectori)=0.3<0.5, the second word vector of the ith original data and the first word vector of the category label "monkey" are considered to meet a first preset condition, so that the ith original data is acquired as the target data, and the "monkey" is the category label corresponding to the target data. In the above process, if all cosine distances of some original data are greater than or equal to the preset value, the original data are not acquired as target data and are considered as a classification result prediction error, and the original data are noise data.
According to the method provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and the word segmentation tool and the word vector model are introduced, so that the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased; further, at least one category label of the original data is obtained by the category label of which the internal prediction probability of the original data is greater than the probability threshold, so that noise data which are difficult to classify in the original data are filtered; in addition, the average vector of the word vectors of at least one word in the text information is obtained as the first word vector, so that each original data can be described by one word vector, and further, the target data is determined according to the cosine distance between the first word vector and the second word vector, so that the data screening is more accurate.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
Fig. 4 is a block diagram illustrating a logical structure of a data filtering apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a classification unit 401, an acquisition unit 402, and a determination unit 403:
a classification unit 401 configured to perform classification on a plurality of original data by using a classification model, to obtain a classification result of each original data, where each original data includes text information and image information, the classification model is used to classify the image information, and the classification result includes at least one category label;
an obtaining unit 402 configured to perform obtaining a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
a determining unit 403 configured to perform determining, from the plurality of original data, target data based on the first word vector of the respective category label and the second word vector of the text information in each original data, the target data having a first preset condition between the second word vector of the text information of the target data and the first word vector of the category label.
According to the device provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and due to the introduction of the word segmentation tool and the word vector model, the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.
In a possible implementation, the obtaining unit 402 is further configured to perform:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting each category label and the at least one word into the word vector model, and outputting the first word vector and the word vector of the at least one word;
and obtaining the average vector of the word vectors of the at least one word as the second word vector.
In a possible implementation, the determining unit 403 is further configured to perform:
for each original data, obtaining the cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data;
and determining the original data corresponding to the cosine distance smaller than the preset value as the target data.
In a possible embodiment, the classification result further comprises at least one prediction probability, each prediction probability being used to indicate a likelihood that one original data belongs to one class label.
In a possible implementation, based on the apparatus composition of fig. 4, the classification unit 401 includes:
an output subunit configured to perform, for each raw data, inputting the raw data into the classification model, and outputting prediction probabilities of the raw data belonging to each class label, each prediction probability corresponding to one class label;
and the obtaining subunit is configured to execute a category label corresponding to the prediction probability meeting a second preset condition, and obtain at least one category label of the original data.
In one possible embodiment, the obtaining subunit is further configured to perform:
when the maximum value in the prediction probability is larger than a probability threshold, acquiring at least one category label of the original data, wherein the category label corresponds to the prediction probability larger than the probability threshold; or the like, or, alternatively,
and when the maximum value in the prediction probability is smaller than or equal to the probability threshold, acquiring the class label corresponding to the maximum value in the prediction probability as the class label of the original data.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
Fig. 5 is a block diagram illustrating a logical structure of a server according to an exemplary embodiment, where the server 500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 and one or more memories 502, where the memory 502 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 501 to implement the data filtering method provided by the above method embodiments. Of course, the server 500 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 500 may also include other components for implementing the functions of the device, which is not described herein again.
In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, such as a memory 502, comprising instructions executable by a processor 501 of a server 500 to perform the data screening method described above, the method comprising: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition. Optionally, the instructions may also be executed by the processor 501 of the server 500 to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided an application program comprising one or more instructions executable by the processor 501 of the server 500 to perform the data filtering method described above, the method comprising: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition. Optionally, the instructions may also be executed by the processor 501 of the server 500 to perform other steps involved in the exemplary embodiments described above.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (8)

1. A method of data screening, the method comprising:
classifying image information in a plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein a first preset condition is met between the second word vector of the text information of the target data and the first word vector of the category label;
the classifying the image information in the plurality of original data by using the classification model to obtain the classification result of each original data comprises:
for each original data, inputting the image information in the original data into the classification model, and outputting a prediction probability that the image information belongs to each class label, wherein each prediction probability corresponds to one class label and is used for indicating the possibility that the image information belongs to the class label;
when the maximum value in the prediction probabilities is smaller than or equal to a probability threshold value, acquiring a category label corresponding to the maximum value in the prediction probabilities as a category label of the image information;
the determining, from the plurality of original data, target data based on the first word vector of the respective category label and the second word vector of the text information in each of the original data includes:
for each original data, obtaining the cosine distance between a second word vector of text information in the original data and a first word vector of each category label corresponding to image information in the original data;
and determining original data corresponding to the cosine distance smaller than a preset value as the target data.
2. The data screening method of claim 1, wherein the obtaining a first word vector of each category label and a second word vector of text information in each original data based on the word segmentation tool and the word vector model comprises:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting the each category label and the at least one word into the word vector model, and outputting the first word vector and a word vector of the at least one word;
obtaining an average vector of the word vectors of the at least one word as the second word vector.
3. The data screening method of claim 1, wherein the classifying the image information in the plurality of original data by using the classification model to obtain the classification result of each original data, further comprises:
and when the maximum value in the prediction probabilities is larger than the probability threshold, acquiring the category label corresponding to the prediction probability larger than the probability threshold as at least one category label of the image information in the original data.
4. An apparatus for data screening, the apparatus comprising:
the image processing device comprises a classification unit, a classification unit and a processing unit, wherein the classification unit is configured to classify image information in a plurality of original data by adopting a classification model to obtain a classification result of each original data, each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;
the obtaining unit is configured to obtain a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;
a determining unit configured to perform determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein a first preset condition is met between the second word vector of the text information of the target data and the first word vector of the category label;
the classification unit includes:
an output subunit, configured to perform, for each of the raw data, inputting the image information in the raw data into the classification model, and outputting a prediction probability that the image information belongs to each class label, where each prediction probability corresponds to one class label for indicating a possibility that the image information belongs to the class label;
an obtaining subunit, configured to perform, when a maximum value of the prediction probabilities is smaller than or equal to a probability threshold, obtaining a category label corresponding to the maximum value of the prediction probabilities as a category label of the image information;
the determining unit is further configured to perform:
for each original data, obtaining the cosine distance between a second word vector of text information in the original data and a first word vector of each category label corresponding to image information in the original data;
and determining original data corresponding to the cosine distance smaller than a preset value as the target data.
5. The data screening apparatus according to claim 4, wherein the obtaining unit is further configured to perform:
for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;
inputting the each category label and the at least one word into the word vector model, and outputting the first word vector and a word vector of the at least one word;
obtaining an average vector of the word vectors of the at least one word as the second word vector.
6. The data screening apparatus of claim 4, wherein the obtaining subunit is further configured to perform:
and when the maximum value in the prediction probabilities is larger than the probability threshold, acquiring the category label corresponding to the prediction probability larger than the probability threshold as at least one category label of the image information in the original data.
7. A server, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform operations to implement the data screening method of any one of claims 1 to 3.
8. A non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a server, enable the server to perform operations performed to implement the data screening method of any one of claims 1 to 3.
CN201811489982.XA 2018-12-06 2018-12-06 Data screening method and device, server and storage medium Active CN109657710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811489982.XA CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811489982.XA CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN109657710A CN109657710A (en) 2019-04-19
CN109657710B true CN109657710B (en) 2022-01-21

Family

ID=66112715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811489982.XA Active CN109657710B (en) 2018-12-06 2018-12-06 Data screening method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN109657710B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543920B (en) * 2019-09-12 2022-04-22 北京达佳互联信息技术有限公司 Performance detection method and device of image recognition model, server and storage medium
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106971154A (en) * 2017-03-16 2017-07-21 天津大学 Pedestrian's attribute forecast method based on length memory-type recurrent neural network
CN107391703A (en) * 2017-07-28 2017-11-24 北京理工大学 The method for building up and system of image library, image library and image classification method
CN108595497A (en) * 2018-03-16 2018-09-28 北京达佳互联信息技术有限公司 Data screening method, apparatus and terminal
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9342991B2 (en) * 2013-03-14 2016-05-17 Canon Kabushiki Kaisha Systems and methods for generating a high-level visual vocabulary
US9652688B2 (en) * 2014-11-26 2017-05-16 Captricity, Inc. Analyzing content of digital images
CN105279517A (en) * 2015-09-30 2016-01-27 西安电子科技大学 Weak tag social image recognition method based on semi-supervision relation theme model
CN106021364B (en) * 2016-05-10 2017-12-12 百度在线网络技术(北京)有限公司 Foundation, image searching method and the device of picture searching dependency prediction model
CN106529606A (en) * 2016-12-01 2017-03-22 中译语通科技(北京)有限公司 Method of improving image recognition accuracy
CN108537240A (en) * 2017-03-01 2018-09-14 华东师范大学 Commodity image semanteme marking method based on domain body
CN107563444A (en) * 2017-09-05 2018-01-09 浙江大学 A kind of zero sample image sorting technique and system
CN108197109B (en) * 2017-12-29 2021-04-23 北京百分点科技集团股份有限公司 Multi-language analysis method and device based on natural language processing
CN108319672B (en) * 2018-01-25 2023-04-18 南京邮电大学 Mobile terminal bad information filtering method and system based on cloud computing
CN108664989B (en) * 2018-03-27 2019-11-01 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108629043B (en) * 2018-05-14 2023-05-12 平安科技(深圳)有限公司 Webpage target information extraction method, device and storage medium
CN108734212A (en) * 2018-05-17 2018-11-02 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of determining classification results

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106971154A (en) * 2017-03-16 2017-07-21 天津大学 Pedestrian's attribute forecast method based on length memory-type recurrent neural network
CN107391703A (en) * 2017-07-28 2017-11-24 北京理工大学 The method for building up and system of image library, image library and image classification method
CN108595497A (en) * 2018-03-16 2018-09-28 北京达佳互联信息技术有限公司 Data screening method, apparatus and terminal
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device

Also Published As

Publication number Publication date
CN109657710A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
CN109598307B (en) Data screening method and device, server and storage medium
CN110135505B (en) Image classification method and device, computer equipment and computer readable storage medium
CN109522970B (en) Image classification method, device and system
CN111160959B (en) User click conversion prediction method and device
CN110929640B (en) Wide remote sensing description generation method based on target detection
WO2022042297A1 (en) Text clustering method, apparatus, electronic device, and storage medium
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN112446441B (en) Model training data screening method, device, equipment and storage medium
CN112182269B (en) Training of image classification model, image classification method, device, equipment and medium
CN111694954B (en) Image classification method and device and electronic equipment
CN111401343B (en) Method for identifying attributes of people in image and training method and device for identification model
CN109657710B (en) Data screening method and device, server and storage medium
CN113312899B (en) Text classification method and device and electronic equipment
WO2019218482A1 (en) Big data-based population screening method and apparatus, terminal device and readable storage medium
CN113011532A (en) Classification model training method and device, computing equipment and storage medium
CN113762005A (en) Method, device, equipment and medium for training feature selection model and classifying objects
CN111967383A (en) Age estimation method, and training method and device of age estimation model
CN110008972B (en) Method and apparatus for data enhancement
CN109033078B (en) The recognition methods of sentence classification and device, storage medium, processor
CN113836297B (en) Training method and device for text emotion analysis model
CN112732908B (en) Test question novelty evaluation method and device, electronic equipment and storage medium
JP5633424B2 (en) Program and information processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant