CN109657710B

CN109657710B - Data screening method and device, server and storage medium

Info

Publication number: CN109657710B
Application number: CN201811489982.XA
Authority: CN
Inventors: 张志伟; 吴丽军; 李铅
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2022-01-21
Anticipated expiration: 2038-12-06
Also published as: CN109657710A

Abstract

The disclosure relates to a data screening method, a data screening device, a server and a storage medium, and belongs to the field of internet. The method comprises the following steps: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data. By introducing a word segmentation tool and a word vector model, text information in the original data can be represented in a vector form which can be processed by a computer, so that the cost caused by manual labeling is reduced, and the utilization rate of massive original data is increased.

Description

Data screening method and device, server and storage medium

Technical Field

The present disclosure relates to the field of internet, and in particular, to a data screening method, apparatus, server, and storage medium.

Background

In the related art, deep learning is widely applied in the fields of natural language processing, text translation and the like, wherein the accuracy of a deep learning model depends on the scale of training data, and original data from the internet needs to be screened for obtaining the training data.

Taking image classification as an example, when a depth model is trained, firstly, original data needs to be manually labeled to obtain enough labeled data, and then training data is screened out from the labeled data.

However, in the above process, in order to obtain training data of the order of "thousand", 10 to 20 pieces of labeled data need to be prepared for each piece of training data, so that the labor cost for labeling data is very large, and because human resources are limited, as much data as possible cannot be labeled manually, so that the massive raw data from the internet is not fully utilized.

Disclosure of Invention

The present disclosure provides a data screening method, an apparatus, a server and a storage medium, which can overcome the problems of high labor cost for data labeling and insufficient data utilization.

According to a first aspect of the embodiments of the present disclosure, there is provided a data screening method, including:

classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;

acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;

and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.

In one possible implementation, based on the word segmentation tool and the word vector model, obtaining a first word vector of each category label and a second word vector of text information in each original data includes:

for each original data, extracting at least one word in the text information of the original data by adopting the word segmentation tool;

inputting each category label and the at least one word into the word vector model, and outputting the first word vector and the word vector of the at least one word;

and obtaining the average vector of the word vectors of the at least one word as the second word vector.

In one possible embodiment, determining the target data from the plurality of raw data based on the first word vector of the respective category label and the second word vector of the text information in each raw data comprises:

for each original data, obtaining the cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data;

and determining the original data corresponding to the cosine distance smaller than the preset value as the target data.

In a possible embodiment, the classification result further comprises at least one prediction probability, each prediction probability being used to indicate a likelihood that one original data belongs to one class label.

In one possible embodiment, classifying the plurality of raw data by using the classification model, and obtaining the classification result of each raw data includes:

for each original data, inputting the original data into the classification model, and outputting the prediction probability of the original data belonging to each class label, wherein each prediction probability corresponds to one class label;

and acquiring at least one class label of the original data according to the class label corresponding to the prediction probability meeting the second preset condition.

In a possible implementation manner, obtaining, as at least one class label of the original data, a class label corresponding to the prediction probability meeting the second preset condition includes:

when the maximum value in the prediction probability is larger than a probability threshold, acquiring at least one category label of the original data, wherein the category label corresponds to the prediction probability larger than the probability threshold; or the like, or, alternatively,

and when the maximum value in the prediction probability is smaller than or equal to the probability threshold, acquiring the class label corresponding to the maximum value in the prediction probability as the class label of the original data.

According to a second aspect of the embodiments of the present disclosure, there is provided a data filtering apparatus, the apparatus including:

the classification unit is configured to classify a plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;

the obtaining unit is configured to obtain first word vectors of various category labels and second word vectors of text information in each original data based on a word segmentation tool and a word vector model;

and the determining unit is configured to determine target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.

In one possible embodiment, the obtaining unit is further configured to perform:

In a possible embodiment, the determining unit is further configured to perform:

In one possible embodiment, the classification unit comprises:

an output subunit configured to perform, for each raw data, inputting the raw data into the classification model, and outputting prediction probabilities of the raw data belonging to each class label, each prediction probability corresponding to one class label;

and the obtaining subunit is configured to execute a category label corresponding to the prediction probability meeting a second preset condition, and obtain at least one category label of the original data.

In one possible embodiment, the obtaining subunit is further configured to perform:

According to a third aspect of embodiments of the present disclosure, there is provided a server, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a server, enable the server to perform a data screening method, the method comprising:

According to a fifth aspect of embodiments of the present disclosure, there is provided an application program comprising one or more instructions which, when executed by a processor of a server, enable the server to perform a method of data screening, the method comprising:

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the method comprises the steps of classifying original data by adopting a classification model to obtain a classification result of each original data, acquiring a first word vector of a class label and a second word vector of the original data based on a word segmentation tool and a word vector model, determining the original data meeting a first preset condition as target data, and introducing the word segmentation tool and the word vector model to enable text information in the original data to be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation on the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a method of data screening in accordance with an exemplary embodiment.

FIG. 2 is a flow chart illustrating a method of data screening in accordance with an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a method of data screening in accordance with an exemplary embodiment.

Fig. 4 is a block diagram illustrating a logical structure of a data filtering apparatus according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a logical structure of a server in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a data filtering method according to an exemplary embodiment, where the data filtering method is used in a server, as shown in fig. 1, and includes the following steps:

in step 101, the server classifies a plurality of original data by using a classification model to obtain a classification result of each original data, each original data includes text information and image information, the classification model is used for classifying the image information, and the classification result includes at least one category label.

In step 102, the server obtains a first word vector of each category label and a second word vector of text information in each original data based on the word segmentation tool and the word vector model.

In step 103, the server determines target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, and the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition.

According to the method provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and the word segmentation tool and the word vector model are introduced, so that the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 2 is a flowchart illustrating a data filtering method according to an exemplary embodiment, where the data filtering method is used in a server, as shown in fig. 2, and includes the following steps:

in step 201, the server inputs the raw data into a classification model for each raw data, and outputs a prediction probability that the raw data belongs to each class label, wherein each prediction probability corresponds to one class label.

Each piece of original data may include text information and image information, the data scale of the plurality of pieces of original data may be ten million levels or hundred million levels, the data scale of the plurality of pieces of original data is not specifically limited in the embodiment of the present disclosure, optionally, the plurality of pieces of original data may be data randomly acquired from a UGC (user generated content) website platform, or data randomly extracted from an existing database, and the embodiment of the present disclosure does not specifically limit the manner of acquiring the plurality of pieces of original data.

Optionally, the classification model may classify image information of an input model through a convolutional neural network, obtain a feature map of each raw data through a plurality of convolutional layers, perform nonlinear processing on the feature map by using an activation function, and input the image after the nonlinear processing into a decision network, so as to output a class label and a prediction probability, where the activation function may be a sigmoid function, and may be a tanh function or a ReLU function.

The category label may be in the form of a label indicating the category of the image information of the input model, for example, the category label is "cat", "dog", "monkey" or "person" to indicate the category of the image information; the prediction probability may be a numerical indication of the probability of belonging to a certain class label, for example, the prediction probability of the original data belonging to the class label "human" may be 0.8, that is, the classification model predicts that the original data has a probability of being a human image of 80%.

In step 201, fig. 3 is a schematic diagram of a data screening method shown according to an exemplary embodiment, referring to fig. 3, assuming that the classification model selects L class labels, taking an ith original data as an example, inputting the ith original data into the classification model, and outputting L prediction probabilities that the ith original data belongs to each class label, where each prediction probability is used to indicate a possibility that the ith original data belongs to one class label, L and i are positive integers, the ith original data is any one of a plurality of original data, and each original data may be subjected to the above classification process, which is not described herein again.

In step 202, when the maximum value of the prediction probabilities is greater than the probability threshold, the server obtains at least one category label of the original data, where the category label corresponds to the prediction probability greater than the probability threshold.

Optionally, the probability threshold may be a default numerical value of the server, or may be data obtained according to a preset rule, based on the above example, the preset rule may be to obtain a median of the L prediction probabilities as the probability threshold, or the preset rule may also be to obtain an average of the L prediction probabilities as the probability threshold, and the embodiment of the present disclosure does not specifically limit an obtaining manner of the probability threshold.

In the above step 202, it is a possible implementation manner to obtain at least one category label of any original data, that is, a category label with a higher prediction probability and a higher classification accuracy is screened out through a classification model, and in some embodiments, the step 202 may be replaced by the following method: when the maximum value in the prediction probabilities is less than or equal to the probability threshold, the server obtains the class label corresponding to the maximum value in the prediction probabilities as the class label of the original data, that is, if all the L prediction probabilities of a certain original data are less than or equal to the probability threshold, in order to avoid that the original data does not have the corresponding class label, the server obtains the class label corresponding to the maximum value of the prediction probabilities.

Thus, the class label obtained in step 202 can be expressed by the following functional formula:

wherein, label_iAt least one category label corresponding to the ith original data,

the i-th original data is the l-th class label without screening, prob_thresholdIs a probability threshold.

Accordingly, the class label obtained in the alternative of step 202 can be expressed by the following function:

wherein, label_iFor at least one class label, prediction, corresponding to the ith original data_iFor any unfiltered category label of the ith original data, the argmax () function is used to indicate the index position of the maximum value of the input quantity.

In step 203, the server extracts at least one word in the text information of each original data by using a word segmentation tool.

The word segmentation tool is used for extracting words in the text information, for example, the text information of the ith original data is 'i like a hot pot', the text information is processed through the word segmentation tool, so that three words of 'i', 'like' and 'hot pot' can be extracted, wherein the word segmentation tool can be jieba and the like, and the content of the text information and the implementation mode of the word segmentation tool are not specifically limited in the embodiment of the disclosure.

In step 204, the server inputs each category label and the at least one word into a word vector model, outputting a first word vector and a word vector for the at least one word.

The word Vector model may obtain word vectors of input words through word embedding (word embedding), so that text information is represented in a Vector form that can be processed by a computer, for example, the word Vector model may be a chinese word Vector model chinese word2Vector, and the like, where the first word Vector is L word vectors corresponding to L category labels, and the at least one word is a word obtained by the server extracting text information in each piece of original data according to the word segmentation tool in step 203.

In step 205, the server obtains an average vector of the word vectors of the at least one word as a second word vector.

The second word vector is a word vector corresponding to the text information in each original data, and an expression of the second word vector may be as follows:

wherein, Vector_i ^dIs a second Word vector with dimension d of ith original data, # Word_iThe number of words obtained after word segmentation is performed on the text information of the ith original data, Embelling is a word vector model, and d is a dimension corresponding to a word vector.

Through the step 204 and the step 205, the word vector of each original data and the word vector of each category label can be obtained, so that whether the first word vector and the second word vector meet the first preset condition can be judged by adopting the cosine distance through the following step 206, the labor cost consumed by manual labeling is avoided, the limitation on the utilization rate of the original data due to limited human resources is also avoided, and the utilization rate of massive original data is increased.

In step 206, the server obtains, for each original data, a cosine distance between the second word vector of the original data and the first word vector of each category label corresponding to the original data.

In the above process, taking the ith original data as an example, the category labels output to the ith original data through the

steps

201 and 202 are "cat" and "dog"and" monkey ", assuming the first word vector of the category label" Cat "is denoted as Cat _ vector_iThe first word vector of the category label "Dog" is denoted as Dog vector_iThe first word vector of the category label "Monkey" is denoted as Monkey vector_iAnd the second word Vector of the ith original data is expressed as Vector_iThen the cosine distance between the second word vector and the first word vector corresponding to the 3 category labels is: distance_i＝cos(Vector_i,Cat_vector_i)＝0.9，distance_i＝cos(Vector_i,Dog_vector_i)＝0.6，distance_i＝cos(Vector_i,Monkey_vector_i) 0.3. Here, only the ith original data is taken as an example for explanation, and in fact, similar steps may be performed on each original data, so as to obtain the cosine distance between the second word vector of the original data and the first word vector of each corresponding category label, which is not described herein again.

In step 207, the server determines the original data corresponding to the cosine distance smaller than the preset value as the target data.

Wherein the preset value may be a default threshold of the server, for example, the preset value may be 0.5, based on the above example, if the cosine distance between the second word vector and the category label "monkey" is smaller than the preset value, that is, distance_i＝cos(Vector_i,Monkey_vector_i)＝0.3<0.5, the second word vector of the ith original data and the first word vector of the category label "monkey" are considered to meet a first preset condition, so that the ith original data is acquired as the target data, and the "monkey" is the category label corresponding to the target data. In the above process, if all cosine distances of some original data are greater than or equal to the preset value, the original data are not acquired as target data and are considered as a classification result prediction error, and the original data are noise data.

According to the method provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and the word segmentation tool and the word vector model are introduced, so that the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased; further, at least one category label of the original data is obtained by the category label of which the internal prediction probability of the original data is greater than the probability threshold, so that noise data which are difficult to classify in the original data are filtered; in addition, the average vector of the word vectors of at least one word in the text information is obtained as the first word vector, so that each original data can be described by one word vector, and further, the target data is determined according to the cosine distance between the first word vector and the second word vector, so that the data screening is more accurate.

Fig. 4 is a block diagram illustrating a logical structure of a data filtering apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a classification unit 401, an acquisition unit 402, and a determination unit 403:

a classification unit 401 configured to perform classification on a plurality of original data by using a classification model, to obtain a classification result of each original data, where each original data includes text information and image information, the classification model is used to classify the image information, and the classification result includes at least one category label;

an obtaining unit 402 configured to perform obtaining a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;

a determining unit 403 configured to perform determining, from the plurality of original data, target data based on the first word vector of the respective category label and the second word vector of the text information in each original data, the target data having a first preset condition between the second word vector of the text information of the target data and the first word vector of the category label.

According to the device provided by the embodiment of the disclosure, the classification model is adopted to classify the original data, so that the classification result of each original data is obtained, the first word vector of the category label and the second word vector of the original data are obtained based on the word segmentation tool and the word vector model, so that the original data meeting the first preset condition is determined as the target data, and due to the introduction of the word segmentation tool and the word vector model, the text information in the original data can be represented in a vector form capable of being processed by a computer, so that the cost brought by manual labeling is reduced, the limitation of the utilization rate of the original data caused by limited human resources is avoided, and the utilization rate of massive original data is increased.

In a possible implementation, the obtaining unit 402 is further configured to perform:

In a possible implementation, the determining unit 403 is further configured to perform:

In a possible implementation, based on the apparatus composition of fig. 4, the classification unit 401 includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Fig. 5 is a block diagram illustrating a logical structure of a server according to an exemplary embodiment, where the server 500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 and one or more memories 502, where the memory 502 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 501 to implement the data filtering method provided by the above method embodiments. Of course, the server 500 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 500 may also include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, such as a memory 502, comprising instructions executable by a processor 501 of a server 500 to perform the data screening method described above, the method comprising: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition. Optionally, the instructions may also be executed by the processor 501 of the server 500 to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided an application program comprising one or more instructions executable by the processor 501 of the server 500 to perform the data filtering method described above, the method comprising: classifying the plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label; acquiring a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model; and determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein the second word vector of the text information of the target data and the first word vector of the category label meet a first preset condition. Optionally, the instructions may also be executed by the processor 501 of the server 500 to perform other steps involved in the exemplary embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of data screening, the method comprising:

classifying image information in a plurality of original data by adopting a classification model to obtain a classification result of each original data, wherein each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;

determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein a first preset condition is met between the second word vector of the text information of the target data and the first word vector of the category label;

the classifying the image information in the plurality of original data by using the classification model to obtain the classification result of each original data comprises:

for each original data, inputting the image information in the original data into the classification model, and outputting a prediction probability that the image information belongs to each class label, wherein each prediction probability corresponds to one class label and is used for indicating the possibility that the image information belongs to the class label;

when the maximum value in the prediction probabilities is smaller than or equal to a probability threshold value, acquiring a category label corresponding to the maximum value in the prediction probabilities as a category label of the image information;

the determining, from the plurality of original data, target data based on the first word vector of the respective category label and the second word vector of the text information in each of the original data includes:

for each original data, obtaining the cosine distance between a second word vector of text information in the original data and a first word vector of each category label corresponding to image information in the original data;

and determining original data corresponding to the cosine distance smaller than a preset value as the target data.

2. The data screening method of claim 1, wherein the obtaining a first word vector of each category label and a second word vector of text information in each original data based on the word segmentation tool and the word vector model comprises:

inputting the each category label and the at least one word into the word vector model, and outputting the first word vector and a word vector of the at least one word;

obtaining an average vector of the word vectors of the at least one word as the second word vector.

3. The data screening method of claim 1, wherein the classifying the image information in the plurality of original data by using the classification model to obtain the classification result of each original data, further comprises:

and when the maximum value in the prediction probabilities is larger than the probability threshold, acquiring the category label corresponding to the prediction probability larger than the probability threshold as at least one category label of the image information in the original data.

4. An apparatus for data screening, the apparatus comprising:

the image processing device comprises a classification unit, a classification unit and a processing unit, wherein the classification unit is configured to classify image information in a plurality of original data by adopting a classification model to obtain a classification result of each original data, each original data comprises text information and image information, the classification model is used for classifying the image information, and the classification result comprises at least one class label;

the obtaining unit is configured to obtain a first word vector of each category label and a second word vector of text information in each original data based on a word segmentation tool and a word vector model;

a determining unit configured to perform determining target data from the plurality of original data based on the first word vector of each category label and the second word vector of the text information in each original data, wherein a first preset condition is met between the second word vector of the text information of the target data and the first word vector of the category label;

the classification unit includes:

an output subunit, configured to perform, for each of the raw data, inputting the image information in the raw data into the classification model, and outputting a prediction probability that the image information belongs to each class label, where each prediction probability corresponds to one class label for indicating a possibility that the image information belongs to the class label;

an obtaining subunit, configured to perform, when a maximum value of the prediction probabilities is smaller than or equal to a probability threshold, obtaining a category label corresponding to the maximum value of the prediction probabilities as a category label of the image information;

the determining unit is further configured to perform:

5. The data screening apparatus according to claim 4, wherein the obtaining unit is further configured to perform:

6. The data screening apparatus of claim 4, wherein the obtaining subunit is further configured to perform:

7. A server, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform operations to implement the data screening method of any one of claims 1 to 3.

8. A non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a server, enable the server to perform operations performed to implement the data screening method of any one of claims 1 to 3.