WO2022247448A1 - Data processing method and apparatus, computing device, and computer readable storage medium - Google Patents

Data processing method and apparatus, computing device, and computer readable storage medium Download PDF

Info

Publication number
WO2022247448A1
WO2022247448A1 PCT/CN2022/083841 CN2022083841W WO2022247448A1 WO 2022247448 A1 WO2022247448 A1 WO 2022247448A1 CN 2022083841 W CN2022083841 W CN 2022083841W WO 2022247448 A1 WO2022247448 A1 WO 2022247448A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
weight distribution
processed
sample weight
data
Prior art date
Application number
PCT/CN2022/083841
Other languages
French (fr)
Chinese (zh)
Inventor
张诗杰
朱森华
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022247448A1 publication Critical patent/WO2022247448A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to the field of artificial intelligence, and more particularly, to a data processing method, device, computing device, and computer-readable storage medium.
  • Dataset bias is a widespread problem that has a huge negative impact in machine learning, especially deep learning, and is difficult to detect and easily overlooked. Especially for scenarios with high requirements for model security, if the training is based on a biased data set, the resulting model may cause serious accidents in actual use.
  • the bias of the data set is checked by guessing or based on experience, but this solution needs to consume a lot of human resources, which is not only inefficient, but also has low accuracy and cannot meet actual needs.
  • Exemplary embodiments of the present disclosure provide a data processing method including a scheme for assessing bias in a data set, enabling a more precise check of the bias in the data set.
  • a data processing method includes: constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; Divided into the first data set and the second data set, the first data set has the first sample weight distribution, the second data set has the second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the The sample weights of the data items to be processed in the processing data set are determined; the classification model is trained based on the first data set and the first sample weight distribution; and the classification model is trained based on the second data set and the second sample weight distribution. Evaluate, to obtain an evaluation result indicating the significance of the bias for the dataset under processing with the distribution of sample weights.
  • the significance of bias of a data set can be more accurately assessed.
  • This evaluation scheme is convenient for users to adjust the data set and other processing.
  • the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation, and is highly efficient and highly automated.
  • updating the sample weight distribution includes: updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
  • it further includes: using the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
  • the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can check the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and effective and accurate Weight distribution of recommended samples with high degree.
  • it further includes: adding or deleting the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
  • the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased data set can be used to train a more robust and unbiased task-specific model to meet actual needs.
  • updating the sample weight distribution includes at least one of the following: using a predetermined rule to update the sample weight distribution, using a random method to update the sample weight distribution, obtaining user modification to the sample weight distribution to update the sample Weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
  • constructing the irrelevant data set based on the data set to be processed includes: removing the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, to obtain The remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
  • the data set to be processed is an image data set
  • constructing an unrelated data set based on the data set to be processed comprises: performing image segmentation on a target data item to be processed in the data set to be processed to obtain a A background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
  • the background image is used as a representative of bias, so that the data set can be checked for bias.
  • the data item to be processed in the data set to be processed is a video sequence
  • constructing an irrelevant data set based on the data set to be processed includes: The gradient information between a frame of images determines the binary image of the video sequence; generates the background image of the video sequence based on the binary image; and uses the background image of the video sequence to construct an irrelevant data item in the irrelevant data set.
  • the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
  • it also includes: obtaining the class activation map CAM by inputting the target-independent data item into the trained classification model; by superimposing the CAM and the target-independent data item to obtain an overlay result; and displaying the overlay result .
  • the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be presented visually. In this way, users can more intuitively and comprehensively know the bias of the data set.
  • This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
  • a data processing device in a second aspect, includes: a construction unit configured to construct an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed.
  • the dividing unit is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first same This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;
  • the training unit is configured to train the classification model based on the first data set and the first sample weight distribution and an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of the bias of the data set to be processed with the sample weight distribution.
  • an updating unit is further included, configured to: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
  • the update unit is configured to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
  • the update unit is configured to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
  • an adjustment unit is further included, configured to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
  • the update unit is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution by using a predetermined rule, update the sample weight distribution in a random manner, and obtain the user's weight on the sample The distribution is modified to update the sample weight distribution, or the sample weight distribution is optimized by the genetic algorithm to update the sample weight distribution.
  • the construction unit is configured to: remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to obtain the target data item to be processed and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
  • the data set to be processed is an image data set
  • the construction unit is configured to: perform image segmentation on the target data item to be processed in the data set to be processed to obtain the target data item to be processed a corresponding background image; and using the background image to construct an unrelated data item in the unrelated data set.
  • the data item to be processed in the data set to be processed is a video sequence
  • the construction unit is configured to: The gradient information of the video sequence is determined to determine the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
  • an update unit configured to: obtain a CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items ; and a display unit configured to display an overlay result.
  • a computing device including a processor and a memory, the memory stores instructions executed by the processor, and when the instructions are executed by the processor, the computing device realizes: based on the data set to be processed Constructing an irrelevant data set, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; the irrelevant data sets are divided into the first data set and the second data set Data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the data items to be processed in the data set to be processed The sample weight is determined; based on the first data set and the first sample weight distribution, the classification model is trained; and based on the second data set and the second sample weight distribution, the classification model is evaluated to obtain an evaluation result, and the evaluation result Indicates the significance of bias for the dataset to be processed with a distribution of sample weights.
  • the computing device when the instructions are executed by the processor, the computing device is enabled to implement: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
  • the instructions when executed by the processor, cause the computing device to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
  • the computing device when the instructions are executed by the processor, the computing device is enabled to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
  • the computing device when the instruction is executed by the processor, the computing device is enabled to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
  • the device when the instruction is executed by the processor, the device is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, and update the sample weight distribution in a random manner , obtain the modification of the sample weight distribution by the user to update the sample weight distribution, or optimize the sample weight distribution through the genetic algorithm to update the sample weight distribution.
  • the computing device when the instructions are executed by the processor, the computing device is caused to: remove the part associated with the tag of the target data item to be processed from the target data item to be processed in the data set to be processed , to obtain the remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
  • the data set to be processed is an image data set
  • the instructions when executed by a processor, cause the computing device to: perform image segmentation on a target data item to be processed in the data set to be processed , to obtain a background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
  • the data item to be processed in the data set to be processed is a video sequence
  • the computing device realizes: based on a frame image and a frame image in the video sequence The gradient information between the previous frame image of the frame image determines the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and the background image of the video sequence is used to construct an irrelevant data item in the irrelevant data set .
  • the instructions when executed by a processor, cause the computing device to: obtain a CAM by inputting a target-independent data item into a trained classification model; superimposing, obtaining the superimposing result; and displaying the superimposing result.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the operation according to the method in the above-mentioned first aspect or any embodiment is realized .
  • a chip or a chip system in a fifth aspect, includes a processing circuit configured to perform operations according to the method in the above first aspect or any one of the embodiments.
  • a computer program or computer program product is provided.
  • the computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause the device to implement operations according to the method in the first aspect or any of the above-mentioned embodiments .
  • FIG. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure
  • FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure
  • Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
  • Artificial Intelligence uses computers to simulate certain human thinking processes and intelligent behaviors.
  • the history of artificial intelligence research has a natural and clear vein from focusing on “reasoning”, to focusing on “knowledge”, and then focusing on “learning”.
  • Artificial intelligence has been widely applied to various industries such as security, medical care, transportation, education, and finance.
  • Machine learning is a branch of artificial intelligence, which studies how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their own performance. In other words, machine learning studies how to improve the performance of specific algorithms during empirical learning.
  • Deep learning is a type of machine learning technology based on deep neural network algorithms. Its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
  • Data Bias data bias
  • factors in the data that are related to the task but not non-causal, such as sample imbalance, artificial markers in the data, etc. Such factors can be considered as data bias.
  • Dataset bias refers to the presence of spurious features in a dataset that some machine learning models may learn.
  • image data set there may be some information related to the data acquisition model and acquisition parameters in the image, which has nothing to do with the acquisition task.
  • the machine learning model may make speculations based on this information and directly guess the classification results, instead of learning the image features that are really related to the target task.
  • a machine learning model When a machine learning model is trained on an image dataset with dataset bias, it may not be able to learn objectively and realistically for the training task as expected. As a result, it is difficult for the learned machine learning model to complete the target tasks as expected in the actual use environment, resulting in serious performance degradation; or even if the performance does not decrease, the reasons for errors may be unacceptable, and even lead to ethical lawsuits.
  • a model for predicting lipstick hardly affects the prediction results after covering the mouth. It can be seen that the model does not actually learn mouth-related features.
  • Another example is a medical image recognition model that infers the collection location based on the markers placed by the doctor, which affects the prediction results.
  • the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the impact of data set bias can be effectively determined, and the data set can be adjusted based on this to ensure that the adjusted data set The model will not be negatively affected by data bias.
  • Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure.
  • the system 100 may be shown in FIG. 1 , and the system architecture 100 includes an input/output (Input/Output, I/O) module 110 , a data set processing module 120 and a model training module 130 .
  • the system 100 may further include a model storage module 140 and a data storage module 150 .
  • the various modules shown in FIG. 1 can communicate with each other.
  • the input/output module 110 can be used to acquire data sets to be processed. For example, a data set to be processed input by a user may be received.
  • the data set to be processed may be stored in the data storage module 150 .
  • the data storage module 150 may be a data storage resource corresponding to an object storage service (Object Storage Service, OBS) provided by a cloud service provider.
  • OBS Object Storage Service
  • the data set to be processed includes a large number of data items to be processed, and each data item to be processed has a label.
  • the data set to be processed contains a plurality of data items to be processed marked with .
  • tags may be marked manually, or may be obtained through machine learning, which is not limited in the present disclosure.
  • Tags can also be called task tags, annotation information, or other names, which will not be listed in this article.
  • the annotation information may be annotated by an annotator for a specific part of the data item to be processed based on experience.
  • the annotation information may be annotated through an image recognition model and an annotation model.
  • tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face.
  • tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face.
  • a medical image such as an ultrasound image
  • whether there is a lesion can be marked for the detected part.
  • the data item to be processed may include a tag-related part and a tag-independent part.
  • the face area in the image is the part related to the label, and the face area in the image The rest of the area is not related to the label.
  • the label is aimed at the eyes (for example, the pupil color is marked by "black”, "brown”, etc.)
  • the eye area in the image is the part related to the label, while other areas in the image are related to the eye area. Tags don't matter.
  • the data items to be processed in the data set to be processed may be any type of data, such as images, videos, voices, texts, and so on.
  • images such as images, videos, voices, texts, and so on.
  • images such as images, videos, voices, texts, and so on.
  • the embodiment of the present disclosure does not limit the source of the data items to be processed.
  • images may be collected from open source data sets, for example, they may be collected by different image acquisition devices, for example, they may be collected by the same image acquisition device Captured at different times, for example, may be image frames in a video sequence captured by an image capture device, or any combination of the above, or others.
  • the input/output module 110 may be implemented as an input module and an output module that are independent of each other, or may also be implemented as a coupling module having both input and output functions.
  • a graphical user interface Graphical User Interface, GUI
  • a command line interface Common-Line Interface, CLI
  • the data set processing module 120 can obtain the data set to be processed from the input/output module 110 , or alternatively, can obtain the data set to be processed from the data storage module 150 . Further, the data set processing module 120 can construct an irrelevant data set based on the data set to be processed.
  • the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
  • the unrelated data set may be stored in the data storage module 150 .
  • the data item to be processed has a label
  • the data item to be processed includes a part related to the label and a part not related to the label. Then, the part related to the label in the data item to be processed can be removed, and only the part irrelevant to the label in the data item to be processed can be reserved as an irrelevant data item, and the label of the unrelated data item is the label of the data item to be processed .
  • This process may also be called splitting, division, separation or other names, etc., which is not limited in the present disclosure.
  • the part associated with its label can be removed from the target data item to be processed to obtain the target data item to be processed the remainder of the . Then use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
  • the data item to be processed is a face image
  • the label represents the skin color of the face, such as "white”.
  • the face area in the face image can be removed, and the remaining part after removing the face area can be used as the corresponding irrelevant data item, and the irrelevant data item still has the label "white" of the face skin color.
  • the irrelevant data item can be obtained by means of image segmentation.
  • the part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.
  • image segmentation is performed on the target data item to be processed (target image) in the data set to be processed to obtain a background image corresponding to the target image, and then use the background image to construct an irrelevant data item.
  • the embodiment of the present disclosure does not limit the specific algorithm used for image segmentation.
  • one or more of the algorithms listed below can be used to perform, and other algorithms can also be used to perform: threshold-based image segmentation algorithm, based on Image segmentation algorithm based on region, image segmentation algorithm based on edge detection, image segmentation algorithm based on wavelet analysis and wavelet transform, image segmentation algorithm based on genetic algorithm, image segmentation algorithm based on active contour model, image segmentation algorithm based on deep learning, etc.
  • image segmentation algorithms based on deep learning include but are not limited to: segmentation algorithms based on feature encoder based, segmentation algorithms based on regional proposal based, segmentation algorithms based on RNN, based on upsampling/inversion Convolution segmentation algorithm, segmentation algorithm based on feature resolution enhancement, feature enhancement based segmentation algorithm, segmentation algorithm using Conditional Random Field (CRF)/Marcov Random Field (MRF), etc. .
  • the data item to be processed in the data set to be processed is a video sequence.
  • Different data items to be processed may have the same or different durations.
  • the first data item to be processed in the data set to be processed is a first video sequence, the length of which is m1 frames, including m1 frame images.
  • the second data item to be processed in the data set to be processed is a second video sequence with a length of m2 frames, including m2 frames of images. m1 and m2 may or may not be equal.
  • video segmentation is performed on the target data item to be processed (target video sequence) in the data set to be processed to obtain a background image corresponding to the target video sequence, and then use the background image to construct an irrelevant data item.
  • image segmentation may be performed for each frame image in the target video sequence, and the background regions after segmentation of each frame image are fused to obtain a background image corresponding to the target video sequence.
  • the background image corresponding to the target video sequence may be obtained based on the gradient between two adjacent frames in the target video sequence.
  • the binary image corresponding to the video sequence may be obtained based on the gradient information of the video sequence. The background image of the video sequence is then generated based on this binary image, as described below in conjunction with FIG. 2 .
  • Fig. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure.
  • the data set processing module 200 can be used as an implementation of the data set processing module 120 in Figure 1, and the data set processing module 200 can be used to determine irrelevant data sets based on the data sets to be processed, wherein The data item is a video sequence, and the irrelevant data item in the irrelevant data set may be a background image corresponding to the video sequence.
  • the dataset processing module 200 may include a gradient calculation submodule 210 , a gradient superposition submodule 220 , a thresholding submodule 230 , a morphological processing submodule 240 and a separation submodule 250 .
  • the gradient calculation sub-module 210 can be used to calculate the gradient information between a frame image and the previous frame image in the target video sequence.
  • the target video sequence includes m1 frame images, which are frame 0, image 1, ..., frame m1-1 respectively. Then the gradient information between every two adjacent frames of images can be calculated, specifically, between the 1st frame image and the 0th frame image, between the 2nd frame image and the 1st frame image, ... the m1-1th frame image The gradient information between the m1-2th frame image.
  • the embodiments of the present disclosure do not limit the specific manner of calculating the gradient information, for example, the frame difference may be calculated.
  • the gradient of the feature vectors of two frames of images along a specific dimension (such as the time dimension T) can be calculated, so that fixed background parts, such as image borders, can be extracted from video sequences through motion information.
  • the difference between the image and the grayscaled image can be calculated, so that the colored part in the video frame image can be extracted, so that the color mark can be avoided as the foreground part, such as some color marks or text added in the later stage after video capture.
  • the gradient superposition sub-module 220 can be used to superimpose the gradient information obtained by the gradient calculation sub-module 210 to obtain a gradient superposition map.
  • the manner of superposition by the gradient superposition sub-module 220 may include but not limited to weighted summation (such as average value), maximum value, minimum value or others.
  • the thresholding sub-module 230 may be configured to perform thresholding processing on the gradient overlay image obtained by the gradient overlay sub-module 220 to obtain an initial binary image.
  • the morphological processing sub-module 240 may perform morphological processing on the initial binary image obtained by the thresholding sub-module 230 to obtain a binary image corresponding to the video sequence.
  • the pixel value of this pixel can be reset to 0.
  • morphological processing may include, but not limited to, morphological dilation, morphological erosion, and the like.
  • the morphological processing submodule 240 may perform several times of morphological expansion on the initial binary image obtained by the thresholding submodule 230, and then perform the same number of morphological erosions to obtain a binary image.
  • the separation sub-module 250 can obtain the background image corresponding to the video sequence based on the binary image obtained by the morphological processing sub-module 240 .
  • a matting operation may be performed on a binary image to obtain a background image.
  • the background image can be obtained by matrix dot product.
  • the background image corresponding to the video sequence can be obtained by fully considering the similarity of the background among the frame images in the video sequence.
  • the background image is used as a representative of bias, so that the data set can be checked for bias. Understandably, if the dataset is not biased, then the features of the background image should not have any relationship to the labels associated with the foreground regions.
  • the constructed irrelevant data set can be divided into two parts: the first part of irrelevant data items and the second part of irrelevant data items, wherein the first part of irrelevant data items can be used to train the model, and the second part of irrelevant data items Data items can be used to test the model.
  • the embodiment of the present disclosure does not limit the division method.
  • the irrelevant data set may be divided into the first part and the second part according to 9:1 or 1:1 or other ratios.
  • the set composed of the first part of irrelevant data items may be called an irrelevant training set, and the set composed of the second part of irrelevant data items may be called an irrelevant test set.
  • the first part of the set of irrelevant data items may include an irrelevant training set and an irrelevant verification set.
  • the unrelated data set can be divided into an unrelated training set, an unrelated verification set and an unrelated test set according to 7:2:1.
  • the set composed of the first part of irrelevant data items is called the first data set (or training set), and the set composed of the second part of irrelevant data items is called the second data set (or test set).
  • the dataset processing module 120 may preprocess the dataset to be processed first, and then construct an irrelevant dataset based on the preprocessed dataset to be processed. Preprocessing includes but is not limited to: cluster analysis, data denoising, etc.
  • the model training module 130 may include a training submodule 132 and an evaluation submodule 134 .
  • the training sub-module 132 can be used to train the classification model.
  • the classification model may be trained based on the first part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the first part.
  • the first part of irrelevant data items used for training may be all of the irrelevant data sets, so that more data items may be used for training, making the trained classification model more robust.
  • the first part of irrelevant data items used for training may be part of an irrelevant data set. As mentioned above, the irrelevant data set is divided into the first part of irrelevant data items and the second part of irrelevant data items.
  • the set of the first part of irrelevant data items used for training is called a training set, and correspondingly, the first part of irrelevant data items may be training items.
  • the training here may be to train an initial classification model or may be to update a previously trained classification model, wherein the initial classification model may be a classification model that has not been trained.
  • the previously trained classification model may be obtained after training the initial classification model.
  • the training sub-module 132 can obtain an initial classification model or a previously trained classification model from the model storage module 140 .
  • the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training and the label of each irrelevant data item in the first part from the data set processing module 120 or the data storage module 150 . Or, the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training from the data set processing module 120, and obtain the information of each irrelevant data item in the first part of irrelevant data items from the input/output module 110. Label.
  • the training submodule 132 can preprocess the training set, including but not limited to: feature extraction, cluster analysis, edge detection, Image denoising, etc.
  • the training data item after feature extraction can be characterized as an S-dimensional feature vector, where S is greater than 1.
  • the classification model can be a convolutional neural network (Convolutional Neural Network, CNN) model, which can optionally include an input layer, a convolutional layer, Deconvolution layer, pooling layer, fully connected layer, output layer, etc.
  • CNN convolutional Neural Network
  • the classification model includes a large number of parameters, which can represent the calculation formula or the weight of the calculation factor in the model, and the parameters can be updated iteratively through training.
  • the classification model also includes hyper-parameters, which are used to guide the construction or training of classifications, such as the number of iterations of model training, learning rate, batch size, model The number of layers, the number of neurons in each layer, etc.
  • the hyperparameters can be parameters obtained by training the model through the training set, or they can be preset parameters, and the preset parameters will not be updated through the training of the model.
  • the process of training the classification model by the training sub-module 132 can refer to the existing training process.
  • the training process can be: input the training data items in the training set to the classification model, use the label corresponding to the training data as a reference, and use the loss function (loss function) to obtain the relationship between the output of the classification model and the corresponding label.
  • the loss value of and adjust the parameters of the classification model according to the loss value.
  • Each training data item in the training set iteratively trains the classification model, and the parameters of the classification model are continuously adjusted until the classification model can output an output value closer to the label corresponding to the training data item with higher accuracy according to the input training data item , such as the loss function is minimal or smaller than a reference threshold.
  • the loss function in the training process is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value).
  • the loss function is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value).
  • the process of training the classification model because it is hoped that the output of the classification model is as close as possible to the real value (that is, the corresponding label), it is possible to compare the predicted value and the real value of the current classification model, and then according to the difference between the two to update the parameters in the classification model.
  • Each training uses the loss function to judge the difference between the value predicted by the current classification model and the real value, and updates the parameters of the classification model until the classification model can predict a value very close to the real value, then the classification model is considered to be trained .
  • the "classification model” in the embodiments of the present disclosure may also be called a machine learning model, a convolutional classification model, a background classification model, a data bias model, or other names, or may also be referred to as a "model” for short. Publicity is not limited to this.
  • the trained classification model may be stored in the model storage module 140 .
  • model storage module 140 may be part of model training module 130 .
  • the evaluation sub-module 134 can be used to evaluate the classification model. Specifically, the evaluation result of the trained classification model may be determined based on the second part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the second part. The evaluation results can be used to characterize the significance of data bias in the data set to be processed.
  • the set of the second part of irrelevant data items may be a test set, and correspondingly, the second part of irrelevant data items may be test data items.
  • the evaluation process may include: inputting a test data item into a trained classification model, obtaining a prediction result about the test data item, and determining an evaluation result based on a comparison result of the prediction result with a label of the test data item.
  • the evaluation result may include at least one of the following: correct rate, precision rate, recall rate, F1 index, precision-recall rate (Precision-Recall, P-R) curve, average precision (Average Precision, AP) index , false positive rate, false negative rate, etc.
  • a confusion matrix may be constructed, which shows the number of positive examples (Positive, also called positive) and negative examples (Negative, also called negative), real values, predicted values, and the like.
  • the accuracy rate refers to the proportion of correctly classified samples to the total samples. For example, the number of test data items in the test set is N2, and the number of predicted results consistent with the label is N21, then the accuracy rate can be expressed as N21/N2.
  • the correct rate also known as the precision rate, refers to the proportion of samples that are actually positive among the samples that are predicted to be positive. For example, the number of test data items in the test set is N2, if the number of positive examples in the prediction result is N22, and the number of positive examples in the N22 test data items is N23, then the correct rate can be expressed as N23/ N22.
  • Recall refers to the proportion of samples that are actually positive that are predicted to be positive.
  • the number of test data items in the test set is N2
  • the number marked as positive examples is N31.
  • the recall rate can be expressed as N32 /N31.
  • the P-R curve defines the horizontal axis as the recall rate and the vertical axis as the precision rate.
  • a point on the P-R curve represents: under a certain threshold, the model judges the result greater than the threshold as a positive sample, and the result smaller than the threshold is judged as a negative sample. At this time, the recall rate and precision rate corresponding to the result are returned.
  • the entire P-R curve is generated by shifting the threshold from high to low. Near the origin represents the precision and recall of the model when the threshold is maximum.
  • the F1 index also known as the F1 score (score) is the harmonic mean of precision and recall.
  • the ratio of twice the product of the precision rate and the recall rate to the sum of the precision rate and the recall rate can be used as the F1 index.
  • the evaluation result may include a positive example characterization value, such as a first accuracy rate and/or a first recall rate.
  • the first correct rate indicates the proportion of samples that are actually positive among the samples that are predicted to be positive.
  • the first recall rate represents the proportion of the samples that are actually positive that are predicted to be positive.
  • the evaluation result may include a negative example characterization value, such as a second accuracy rate and/or a second recall rate.
  • the second correct rate indicates the proportion of samples that are actually negative among the samples that are predicted to be negative.
  • the second recall rate represents the proportion of samples that are actually negative that are predicted to be negative.
  • the evaluation result may include a first predicted mean value and/or a second predicted mean value.
  • the first predicted mean represents the average of predicted values for samples that are actually positive.
  • the second predicted mean represents the average of predicted values for samples that are actually negative.
  • the evaluation result may include mean difference, which is used to represent the difference between the first predicted mean and the second predicted mean, such as by the difference between the first predicted mean and the second predicted mean or by the difference between the first predicted mean and the second predicted mean Ratio, etc. to represent the mean difference.
  • the evaluation result can be presented to the user by the input/output module 110 .
  • it may be presented through a graphical user interface, which is convenient for users to view.
  • the bias significance of the data set can be characterized in a quantitative form.
  • This quantitative evaluation scheme can provide users with a clear reference, which is convenient for users to adjust the data set and other processing.
  • the input/output module 110 can also visually present representations of dataset biases through the graphical user interface.
  • a Class Activation Map (CAM) is obtained. Then an overlay result is obtained by overlaying the CAM and the target-independent data item, and the overlay result is displayed.
  • the class activation map is the class activation heat map.
  • the embodiments of the present disclosure can use the CAM to characterize the attention areas of the classification model, specifically, which areas (ie, the attention areas of the model) cause bias.
  • CAM can be obtained by using Gradient-based CAM (Grad-CAM).
  • Gradient-based CAM Gradient-based CAM
  • the output of the last convolutional layer of the classification model that is, the feature map of the last layer
  • the extracted feature maps of the last layer can be weighted and summed to obtain CAM.
  • the weighted and summed results can also be used as a CAM after being processed by a Rectified Linear Unit (ReLU) activation function.
  • the weights for weighted summation here can be the weights of the top fully connected layer.
  • the partial derivative of the output of the last layer of softmax (Softmax) of the classification model to all pixels of the last layer feature map can be calculated, and then the global average in the width and height dimensions can be taken as the corresponding weight.
  • Embodiments of the present disclosure do not limit the manner in which the CAM and the target-independent data item (such as the background image) are superimposed.
  • weighted summation may be used for superimposition.
  • the weights of the CAM and the background graphics may be equal.
  • the embodiments of the present disclosure provide a solution for quantitatively evaluating and visually presenting data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be visually presented. In this way, users can more intuitively and comprehensively know the bias of the data set.
  • This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
  • the model training module 130 can also be used to adjust the data set to be processed based on the classification model.
  • the data set to be processed may have an initial sample weight distribution, correspondingly, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution.
  • the initial sample weight of the target data item to be processed is a
  • the sample weight of the irrelevant data item generated based on the target data item to be processed is also a.
  • model training module 130 can be used to obtain the weight distribution of recommended samples based on the iterative training of the classification model, as described below in conjunction with FIG. 3 .
  • FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure.
  • a first data set having a first distribution of sample weights and a second data set having a second distribution of sample weights are determined.
  • an unrelated data set may be constructed based on the data set to be processed, and the unrelated data set may be divided into a first data set and a second data set, as described in the above embodiments.
  • the data items to be processed in the data set to be processed may have initial sample weights, that is, the data set to be processed may have an initial sample weight distribution.
  • the initial sample weight may be input by the user through the input/output module 110 .
  • initialization sample weights may be determined through an initialization process.
  • the sample weight can be used to indicate the sampling probability of the data item to be processed. For example, assuming that the sample weight of the i-th data item to be processed is w i , then the sampling probability of the i-th data item to be processed is
  • the initial sample weight distribution may indicate that the sampling probabilities of each data item to be processed in the data set to be processed are equal. Assuming that the data set to be processed includes N data items to be processed, and the initial sample weight of each data item to be processed is 1, then the sampling probability of each data item to be processed is initialized to 1/N.
  • the first sample weight distribution and the second sample weight distribution can be correspondingly determined.
  • the first data set is sampled based on the first sample weight distribution, and the classification model is trained iteratively.
  • the classification model trained in S320 is evaluated based on the second data set, and an evaluation result is obtained.
  • the evaluation result may be obtained based on the comparison of the predicted result of the trained classification model for the irrelevant data item in the second data set with the label of the irrelevant data item.
  • irrelevant data items can be input into the trained classification model to obtain prediction results about the irrelevant data items, and the evaluation results are determined based on the comparison results of the prediction results of the irrelevant data items and the labels of the irrelevant data items.
  • the evaluation result may include at least one of the following: accuracy rate, precision rate, recall rate, F1 index, precision rate-recall rate curve, average precision index, false positive rate, false negative rate, and the like.
  • the process may proceed to 360 .
  • the preset threshold can be set based on the processing accuracy and application scenarios of the data set to be processed.
  • the preset threshold may be related to the specific meaning of the evaluation result. For example, the evaluation result includes a correct rate, and the preset threshold may be set to, for example, 30% or 50% or other numerical values.
  • the sample weight distribution is updated.
  • it may return to 310 to continue execution, that is to say, rebuild the first data set and the second data set.
  • an irrelevant data item may belong to the first data set in the previous cycle, but the irrelevant data item may belong to the first data set or the second data set in the next cycle.
  • it may return to 320 to continue execution, that is to say, the irrelevant data items in the first data set and the second data set do not change, but the first sample weight distribution and/or the second sample weight distribution are updated.
  • the first data set may be re-sampled based on the updated first sample weight distribution, and the classification model may be re-trained iteratively. And the retrained classification model is evaluated based on the second data set, and the evaluation result is obtained again.
  • 310 to 350 or 320 to 350 may be iteratively performed until the evaluation result indicates that the bias is not significant (for example, the evaluation result is not greater than the preset threshold).
  • the sample weight distribution may be updated in a random manner.
  • the sample weights of some data items to be processed can be randomly updated, for example, the sample weight of a data item to be processed is updated from 1 to 2, and the sample weight of another data item to be processed is updated from 1 to 3 and many more. It can be understood that the random method has uncertainty, which may make the process of obtaining the weight distribution of the recommended samples take a long time.
  • a predetermined rule may be used to update the sample weight distribution.
  • the second sample weight distribution may be updated. For example, if the evaluation result indicates that the prediction result of the classification model for the irrelevant data item in the second data set is different from the label of the irrelevant data item, then the sample weight of the irrelevant data item may be increased. For example, the sample weight of the irrelevant data item is updated from a1 to a1+1 or 2*a1 or others. In this example, the weight distribution of the first sample may remain unchanged or the weight distribution of the first sample may be updated in other ways.
  • the second data set may be exchanged with the first data set before performing the next cycle. For example, in the next cycle, the classification model will be trained based on the second data set of the previous cycle and the updated second sample weight distribution.
  • the distribution of sample weights may be optimized through a genetic algorithm to update the distribution of sample weights.
  • the sample weight distribution can be used as the genetic initial value of the genetic algorithm, and the objective function can be constructed based on the evaluation result obtained at 330, so that the genetic algorithm can be used to optimize the sample weight distribution, and the optimized sample weight distribution is updated immediately The subsequent sample weight distribution.
  • the embodiment of the present disclosure does not limit the construction method of the objective function of the genetic algorithm.
  • the evaluation result includes the mean difference and the correct rate of the positive sample and the negative sample, then the sum of the mean difference and the correct rate can be used as the objective function. It is understandable that other methods can also be used to construct the objective function, which will not be listed here.
  • the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation and has a high degree of automation.
  • user modifications to the sample weight distribution may be acquired to update the sample weight distribution.
  • the user can empirically infer what modification to the sample weight distribution should be made by referring to the evaluation results and/or the displayed overlay results (as described above), and then input the modification through the input/output module 110 to update the sample weight distribution.
  • this method can fully consider the user's needs, and update the sample weight distribution based on the user's modification, so that the obtained recommended sample weight distribution can better meet the user's expectations and improve user satisfaction.
  • the sample weight distribution obtained from the current evaluation result may be used as the recommended sample weight distribution.
  • the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can view the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and an effective high-quality Referential recommended sample weight distribution.
  • the input/output module 110 can also present the recommended sample weight distribution for the user as a reference for further adjustment of the data set to be processed.
  • the recommended sample weight distribution is presented visually through a graphical user interface.
  • the data set processing module 120 may add or delete the data set to be processed based on the obtained recommended sample weight distribution, so as to construct an unbiased data set.
  • the data set processing module 120 may copy the data items to be processed with a large recommended sample weight, so as to expand the number of data items to be processed in the data set to be processed.
  • the data set processing module 120 may delete the unprocessed data items whose recommended sample weights are small, so as to reduce the number of unprocessed data items in the unprocessed data set.
  • a user's deletion instruction for some data items to be processed may be obtained via the input/output module 110, so as to delete some data items to be processed.
  • Other data items input by the user may be obtained via the input/output module 110 to be added to the current data set to be processed.
  • users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set.
  • other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
  • the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
  • system 100 shown in FIG. 1 may be a system capable of interacting with users, and the system 10 may be a software system, a hardware system, or a system combining hardware and software.
  • the system 100 can be implemented as a computing device or a part of a computing device, where the computing device includes but not limited to a desktop computer, a mobile terminal, a wearable device, a server, a cloud server, and the like.
  • the system 100 shown in FIG. 1 can be implemented as an artificial intelligence platform (AI platform).
  • AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users.
  • Various AI models or AI sub-models for solving different problems can be built into the AI platform, and the AI platform can establish applicable AI models according to the needs input by users. That is, users only need to determine their own needs in the AI platform, and follow the prompts to prepare the data set and upload it to the AI platform, and the AI platform can train the user an AI model that can be used to realize the user's needs.
  • the AI model in the embodiments of the present disclosure can be used to evaluate the data bias of the data set to be processed input by the user.
  • FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure.
  • the system 100 is fully deployed in the cloud environment 410 .
  • the cloud environment 410 is an entity that provides cloud services to users by using basic resources in the cloud computing mode.
  • the cloud environment 410 includes a cloud data center 412 and a cloud service platform 414.
  • the cloud data center 412 includes a large number of basic resources (comprising computing resources, storage resources and network resources) owned by the cloud service provider.
  • the computing resources included in the cloud data center 412 can be A large number of computing devices (such as servers).
  • the system 100 can be independently deployed on servers or virtual machines in the cloud data center 412, and the system 100 can also be deployed on multiple servers in the cloud data center 412 in a distributed manner, or distributed in the cloud data center 412 on multiple virtual machines, or distributedly deployed on servers and virtual machines in the cloud data center 412.
  • the system 100 can be abstracted into an AI development cloud service 424 by the cloud service provider on the cloud service platform 414 and provided to the user. Settlement based on usage conditions), the cloud environment 410 utilizes the system platform 100 deployed in the cloud data center 412 to provide the AI development cloud service 424 to the user.
  • the user can upload the data set to be processed through an application program interface (application program interface, API) or GUI.
  • the system 100 in the cloud environment 410 receives the data set to be processed uploaded by the user, and can perform operations such as data set processing, model training, and data set adjustment.
  • the system 100 can return the evaluation result of the model, the weight distribution of recommended samples, etc. to the user through API or GUI.
  • the system 100 under the cloud environment 410 when the system 100 under the cloud environment 410 is abstracted into an AI development cloud service 424 and provided to users, it can be divided into two parts, such as data set bias evaluation cloud service and data set adjustment cloud service .
  • the user can only purchase the data set bias evaluation cloud service.
  • the cloud service platform 414 can construct an irrelevant data set based on the data set to be processed uploaded by the user, obtain a classification model through training, and return the evaluation result of the classification model to the user , so that the user is informed of the significance of bias in the dataset being processed.
  • the user can also further purchase the data set adjustment cloud service on the cloud service platform 414.
  • the cloud service platform 414 can iteratively train the classification model based on the sample weight distribution, update the sample weight distribution, and return the recommended sample weight distribution to the user. So that the user can add or delete the data set to be processed with reference to the weight distribution of the recommended samples to construct an unbiased data set.
  • FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure.
  • the system 100 is distributed and deployed in different environments, which may include but not limited to at least two of cloud environment 510 , edge environment 520 and terminal computing device 530 .
  • System 100 may be logically divided into multiple sections, each section having a different function.
  • the system 100 includes an input/output module 110 , a data set processing module 120 , a model training module 130 , a model storage module 140 and a data storage module 150 .
  • Each part of the system 100 can be deployed in any two or three environments of the terminal computing device 530 , the edge environment 520 and the cloud environment 510 .
  • Various parts of the system 100 deployed in different environments cooperate to provide users with various functions.
  • the input/output module 110 and the data storage module 150 of the system 100 are deployed in the terminal computing device 530, the data set processing module 120 of the system 100 is deployed in the edge computing device of the edge environment 520, and the data set processing module 120 of the system 100 is deployed in the cloud environment 510
  • the model training module 130 and the model storage module 140 of the deployment system 100 are deployed.
  • the user sends the data set to be processed to the input/output module 110 in the terminal computing device 530 , and the terminal computing device 530 stores the data set to be processed to the data storage module 150 .
  • the data set processing module 120 in the edge computing device of the edge environment 520 constructs an irrelevant data set based on the data set to be processed from the terminal computing device 530 .
  • the model training module 130 in the cloud environment 510 trains a classification model based on an unrelated dataset from the edge environment 520 .
  • the cloud environment 510 may also store the trained classification model to the model storage module 140. It should be understood that this application does not limit which parts of the system 100 are deployed in which environment, and can be adapted according to the computing capability of the terminal computing device 530, the resource occupancy of the edge environment 520 and the cloud environment 510, or specific application requirements during actual application. Sexual deployment.
  • the edge environment 520 is an environment including a collection of edge computing devices that are closer to the terminal computing device 530 , and the edge computing devices include but are not limited to: edge servers, edge small stations with computing capabilities, and the like. It can be understood that the system 100 may also be independently deployed on one edge server in the edge environment 520 , or may be deployed on multiple edge servers in the edge environment 520 in a distributed manner.
  • the terminal computing device 530 includes, but is not limited to: a terminal server, a smart phone, a notebook computer, a tablet computer, a personal desktop computer, a smart camera, and the like. It can be understood that the system 100 may also be independently deployed on one terminal computing device 530 , or may be deployed on multiple terminal computing devices 530 in a distributed manner.
  • FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure.
  • the computing device 600 in FIG. 6 may be implemented as a device in the cloud environment 510 in FIG. 5 , a device in the edge environment 520 , or a terminal computing device 530 .
  • the computing device 600 shown in FIG. 6 can also be regarded as a computing device cluster, that is, the computing device 600 includes one or more of the aforementioned devices in the cloud environment 510, devices in the edge environment 520, and terminal computing devices 530. devices.
  • the computing device 600 includes a memory 610 , a processor 620 , a communication interface 630 and a bus 640 , wherein the bus 640 is used for communication between various components of the computing device 600 .
  • the memory 610 may be a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a hard disk, a flash memory or any combination thereof.
  • the memory 610 can store programs, and when the programs stored in the memory 610 are executed by the processor 620, the processor 620 and the communication interface 630 are used to perform the processes that can be performed by the various modules in the system 100 as described above. It should be understood that the processor 620 and the communication interface 630 may also be used to execute part or all of the content in the embodiments of the data processing method described below in this specification.
  • the memory can also store datasets and classification models.
  • a part of the storage resources in the memory 610 is divided into a data storage module for storing data sets, such as data sets to be processed, irrelevant data sets, etc., and a part of the storage resources in the memory 610 is divided into a model storage module for Store classification models.
  • the processor 620 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a graphics processing unit (Graphics Processing Unit, GPU) or any combination thereof.
  • Processor 620 may include one or more chips.
  • the processor 620 may include an accelerator, such as a Neural Processing Unit (Neural Processing Unit, NPU).
  • NPU Neural Processing Unit
  • the communication interface 630 uses a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .
  • a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .
  • Bus 640 may include pathways for communicating information between various components of computing device 600 (eg, memory 610 , processor 620 , communication interface 630 ).
  • FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure.
  • the method 700 shown in FIG. 7 can be executed by the system 100 .
  • an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
  • the data set to be processed includes a plurality of data items to be processed, and each data item to be processed has a label.
  • Data items to be processed may include tag-related parts and tag-independent parts.
  • the part associated with the tag of the target data item to be processed may be removed from the target data item to be processed in the data set to be processed to obtain the remainder of the target data item to be processed. Use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
  • the data set to be processed is an image data set, that is, the data item to be processed is an image.
  • image segmentation may be performed on the target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed.
  • a background image is used to construct an unrelated data item in an unrelated data set.
  • the part of the image associated with the label is the foreground area
  • the other areas in the image except the foreground area are the background area
  • the foreground-background separation can be used to determine irrelevant data items based only on the background area.
  • the data items to be processed in the data set to be processed are video sequences. Then the binary image of the video sequence can be determined based on the gradient information between one frame image in the video sequence and the previous frame image of the one frame image. And based on the binary image, the background image of the video sequence is generated. The background image of the video sequence is then used to construct an unrelated data item in the unrelated data set.
  • FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure. Specifically, what is shown in FIG. 8 is the process of constructing irrelevant data items based on the data items to be processed (video sequences).
  • the gradient information between two adjacent frames of images in the target video sequence is calculated.
  • the gradient of the feature vectors of the two frames of images along the time dimension may be calculated, so as to obtain gradient information.
  • the static and unchanging background parts in the video sequence can be obtained, such as image borders and the like.
  • a gradient overlay map is obtained based on the overlay of the gradient information.
  • the gradient information obtained in step 810 may be weighted and summed, maximized or minimized, etc., to complete superposition and obtain a gradient superposition map.
  • thresholding is performed on the gradient overlay image to obtain an initial binary image.
  • the initial binary image is subjected to several times of morphological expansion, and then the same number of times of morphological erosion is performed, so as to obtain the binary image.
  • a background image is obtained based on the binary image, and the background image is used as an irrelevant data item corresponding to the video sequence.
  • a matting operation may be performed on a binary image, for example, a matrix dot product may be used to obtain a background image.
  • the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
  • the label of the irrelevant data item is determined based on the label of the data item to be processed. Specifically, if the target data item to be processed has label A, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation), then the label of the target unrelated data item is also label A.
  • the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
  • the sample weights of the irrelevant data items are determined based on the sample weights of the data items to be processed. Specifically, the target data item to be processed has a sample weight w, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation, etc.), then the sample weight of the target unrelated data item is also the sample weight w.
  • the manner of dividing the first data set and the second data set is not limited. For example, it may be divided in a manner of 9:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is about 9:1. For example, it may be divided in a manner of 1:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is approximately 1:1.
  • the first data set can also be further divided into the first sub-data set and the second sub-data set, for example, the ratio of the number of irrelevant data items in the first sub-data set to the number of irrelevant data items in the second sub-data set is about 7:2. It can be understood that the ratios listed here are only for illustration, and are not intended to limit the embodiments of the present disclosure.
  • the classification model is trained based on the first data set and the first sample weight distribution.
  • the first data set may be sampled based on the first sample weight distribution, and the classification model may be trained based on the first data set based on labels of irrelevant data items in the first data set.
  • the classification model can be trained by using the first data set as a training set.
  • the first data set may be preprocessed, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, and the like.
  • the embodiment of the present disclosure does not limit the specific structure of the classification model, for example, it may be a convolutional neural network, including at least a convolutional layer and a fully connected layer.
  • the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias for the processing data set having the sample weight distribution.
  • the second data set can be used as a test set to obtain an evaluation result.
  • the evaluation result may be obtained based on a prediction result of the classification model for the irrelevant data item in the second data set and a comparison result between labels of the unrelated data item in the second data set.
  • the evaluation result may include a first accuracy rate for positive samples in the second data set and a second accuracy rate for negative samples in the second data set.
  • the sample weight distribution of the data set to be processed may be updated.
  • the sample weight distribution of the data set to be processed is updated. Further, after this, return to 720 to obtain the first data set and the second data set again, and repeatedly execute 730 and 740 until the evaluation result obtained in block 740 indicates that the bias is not significant (or there is no significant bias). bias), for example, the evaluation result is not greater than a preset threshold. Subsequently, the sample weight distribution when the evaluation result is not greater than the preset threshold may be used as the recommended sample weight distribution, and the recommended sample weight distribution may be output.
  • the embodiment of the present disclosure does not limit the specific method of updating the sample weight distribution.
  • at least one of the following methods can be used to update: use predetermined rules to update the sample weight distribution, use a random method to update the sample weight distribution, and obtain the user's sample weight distribution. Modify the weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
  • updating the sample weight distribution may update the first sample weight distribution of the first data set, so that when returning to execution 720, the first sample weight of the first data set in 720 executed again The distribution is updated and thus the classification model trained at 730 is also updated.
  • updating the sample weight distribution may update the first sample weight distribution of the first data set and update the second sample weight distribution of the second data set.
  • the distribution of sample weights for a data set to be processed can be updated, and an unrelated data set can be repartitioned.
  • the sample weight distribution of the data set to be processed may be updated, so as to adaptively update the first sample weight distribution and the second sample weight distribution, but irrelevant data items in the first data set and the second data set remain unchanged. In this way, when returning to execute 720 , the first data set in 720 executed again is updated or the first sample weight distribution of the first data set is updated, and then the classification model trained at 730 is also updated.
  • updating the sample weight distribution may update the second sample weight distribution of the second data set.
  • the first sample weight distribution may include invariance.
  • the first data set and the second data set in 720 executed last time may be exchanged.
  • the first data set at the time of return execution 730 is the second data set during the previous execution. In this way, a more comprehensive consideration of the data set to be processed can be realized, so that the evaluation result of the classification model for the significance of the bias is more accurate.
  • FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure.
  • an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
  • the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
  • a classification model is trained based on the first data set and the first sample weight distribution.
  • the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of the bias of the processed data set having the sample weight distribution.
  • the second sample weight distribution for the second data set is updated.
  • sample weights of all irrelevant data items in the second data set may be updated, or the sample weights of some irrelevant data items in the second data set may be updated.
  • the weight distribution of the second sample may be updated based on a prediction result of the classification model for irrelevant data items in the second data set at 940 .
  • the sample weights of irrelevant data items in the second data set with correct predictions may be increased, or the sample weights of irrelevant data items in the second data set with wrong predictions may be adjusted small. For example, assuming that the sample weight of the first irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the first irrelevant data item in the second data set into the classification model is consistent with its label, then the second data The sample weight of the first irrelevant data item in the set is increased, for example, from 2 to 3 or 4 or other values.
  • the second data can be The sample weight of the second irrelevant data item in the set is reduced, for example, from 2 to 1.
  • the first data set having the first sample weight distribution is exchanged with the second data set having the updated second sample weight distribution.
  • the first data set after exchange is the second data set in block 920
  • the first sample weight distribution of the first data set after exchange is the second sample weight distribution updated in block 960
  • the second data set after the swap is the first data set in block 920
  • the second sample weight distribution of the second data set after the swap is the first sample weight distribution in block 920.
  • execution returns to 930 . That is, the classification model is retrained using the first data set after the exchange in 970 .
  • the recommended sample weight distribution is output.
  • the sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
  • the recommended sample weight distribution may be determined based on the first sample weight distribution and the second sample weight distribution.
  • the focus area of data set bias can be presented in a visual manner, specifically, a class activation map can be obtained by inputting target-independent data items into a trained classification model. Then the overlay result is obtained by overlaying the class activation map with the target-independent data item, and the overlay result is displayed.
  • the overlay results can be obtained by weighting and summing the heatmap, so that by displaying the overlay results, it is possible to visually see which are the attention areas of the classification model, and these attention areas are important factors that cause bias.
  • the recommended sample weight distribution after obtaining the recommended sample weight distribution, it may optionally further include adjusting the data set to be processed based on the recommended sample weight distribution to obtain an unbiased data set.
  • an unbiased data set can be constructed by adding or deleting the data set to be processed.
  • data items to be processed with a large recommended sample weight may be copied to expand the number of data items to be processed in the data set to be processed.
  • data items to be processed with small recommended sample weights may be deleted, so as to reduce the number of data items to be processed in the data set to be processed.
  • a user's deletion instruction for some data items to be processed may be obtained, so as to delete some data items to be processed.
  • Other data items entered by the user can be obtained to be added to the current pending data set.
  • users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set.
  • other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
  • the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
  • Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
  • Apparatus 1000 may be implemented by software, hardware or a combination of both.
  • the device 1000 may be a software or hardware device that implements part or all of the functions in the system 100 shown in FIG. 1 .
  • the device 1000 includes a construction unit 1010 , a division unit 1020 , a training unit 1030 and an evaluation unit 1040 .
  • the construction unit 1010 is configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
  • the division unit 1020 is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight The distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed.
  • the training unit 1030 is configured to train the classification model based on the first data set and the first sample weight distribution.
  • the evaluation unit 1040 is configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias in the data set to be processed with the sample weight distribution.
  • the device 1000 may further include an update unit 1050 , an adjustment unit 1060 and a display unit 1070 .
  • the update unit 1050 is configured to update the sample weight distribution of the data set to be processed if the evaluation result obtained by the evaluation unit 1040 is greater than a preset threshold.
  • the updating unit 1050 may be configured to update a part of the sample weight distribution, so that the second sample weight distribution is updated without updating the first sample weight distribution.
  • the update unit 1050 may be configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, update the sample weight distribution in a random manner, and acquire user modification to the sample weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
  • the updating unit 1050 may be configured to use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
  • the adjustment unit 1060 is configured to add or delete the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
  • the update unit 1050 is further configured to: obtain a class activation map by inputting target-independent data items into the trained classification model; and obtain a superposition result by superimposing the activation map and target-independent data items.
  • the display unit 1070 is configured to display the recommended sample weight distribution and/or the superposition result.
  • the construction unit 1010 may be configured to remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, so as to obtain the remaining part of the target data item to be processed ; and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
  • the data set to be processed is an image data set
  • the construction unit 1010 may be configured to perform image segmentation on the target data item to be processed in the data set to be processed, so as to obtain a background image corresponding to the target data item to be processed; and A background image is used to construct an unrelated data item in an unrelated data set.
  • the data item to be processed in the data set to be processed is a video sequence
  • the construction unit 1010 may be configured to determine the video A binary image of the sequence; a background image of the video sequence is generated based on the binary image; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
  • the division of units in the embodiments of the present disclosure is schematic, and it is only a logical function division. In actual implementation, there may be other division methods.
  • the functional units in the disclosed embodiments can be integrated into one
  • a processor may also exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the data processing device 1000 shown in FIG. 10 can be used to implement the above data processing process shown in conjunction with FIGS. 7 to 9 .
  • the present disclosure can also be implemented as a computer program product.
  • a computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure.
  • the present disclosure may be implemented as a computer-readable storage medium, on which computer-readable program instructions are stored, and when a processor executes the instructions, the processor is made to execute the above-mentioned data processing process.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM) or flash memory, Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (Digital Versatile Discs, DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable Read Only Memory
  • SRAM Static Random Access Memory
  • CD-ROM Portable Compact Disc Read-Only Memory
  • DVDs Digital Versatile Disk
  • memory sticks floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer readable program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or in the form of a or any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet).
  • electronic circuits such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby implementing various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer readable program instructions.

Abstract

Provided are a data processing method and apparatus, a computing device, and a computer readable storage medium. In the method, an irrelevant dataset having a tag is constructed on the basis of a dataset to be processed; the irrelevant dataset is divided into a first dataset having a first sample weight distribution and a second dataset having a second sample weight distribution, the first and second sample weight distributions being determined on the basis of a sample weight of a data item to be processed in the dataset to be processed; a classification model is trained on the basis of the first dataset and the first sample weight distribution; the classification model is evaluated on the basis of the second dataset and the second sample weight distribution to obtain an evaluation result which indicates the bias significance of the dataset to be processed having sample weight distributions.

Description

数据处理方法、装置、计算设备和计算机可读存储介质Data processing method, apparatus, computing device, and computer-readable storage medium 技术领域technical field
本公开涉及人工智能领域,并且更具体地,涉及一种数据处理方法、装置、计算设备和计算机可读存储介质。The present disclosure relates to the field of artificial intelligence, and more particularly, to a data processing method, device, computing device, and computer-readable storage medium.
背景技术Background technique
数据集偏见是一种在机器学习,尤其是深度学习中负面影响巨大,且难以察觉、易被忽略的广泛问题。尤其对模型安全性要求较高的场景,如果基于含有偏见的数据集进行训练,那么得到的模型在实际使用中可能会导致严重的事故。Dataset bias is a widespread problem that has a huge negative impact in machine learning, especially deep learning, and is difficult to detect and easily overlooked. Especially for scenarios with high requirements for model security, if the training is based on a biased data set, the resulting model may cause serious accidents in actual use.
目前通过猜测或者基于经验对数据集偏见进行检查,但是这种方案需要消耗大量的人力资源,不仅效率低,而且准确度低,不能满足实际需求。At present, the bias of the data set is checked by guessing or based on experience, but this solution needs to consume a lot of human resources, which is not only inefficient, but also has low accuracy and cannot meet actual needs.
发明内容Contents of the invention
本公开的示例实施例提供了一种数据处理方法,该方法包括对数据集偏见进行评估的方案,能够对数据集偏见进行更精确的检查。Exemplary embodiments of the present disclosure provide a data processing method including a scheme for assessing bias in a data set, enabling a more precise check of the bias in the data set.
第一方面,提供了一种数据处理方法。该方法包括:基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的;将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的;基于第一数据集和第一样本权重分布,对分类模型进行训练;以及基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。In a first aspect, a data processing method is provided. The method includes: constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; Divided into the first data set and the second data set, the first data set has the first sample weight distribution, the second data set has the second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the The sample weights of the data items to be processed in the processing data set are determined; the classification model is trained based on the first data set and the first sample weight distribution; and the classification model is trained based on the second data set and the second sample weight distribution. Evaluate, to obtain an evaluation result indicating the significance of the bias for the dataset under processing with the distribution of sample weights.
如此,通过本公开的实施例的方式,可以更准确地评估数据集的偏见显著性。这种评估的方案便于用户对数据集进行调整等处理。As such, by way of embodiments of the present disclosure, the significance of bias of a data set can be more accurately assessed. This evaluation scheme is convenient for users to adjust the data set and other processing.
在第一方面的一些实施例中,还包括:如果评估结果大于预设阈值,更新待处理数据集的样本权重分布;基于更新后的样本权重分布,重复执行所述训练和所述评估,直到评估结果不大于预设阈值。In some embodiments of the first aspect, further comprising: if the evaluation result is greater than a preset threshold, updating the sample weight distribution of the data set to be processed; based on the updated sample weight distribution, repeatedly performing the training and the evaluation until The evaluation result is not greater than the preset threshold.
如此,本公开的实施例能够基于经训练的分类模型,对待处理数据集的样本权重分布进行更新,从而得到推荐样本权重分布,该过程不需要用户参与,效率高,自动化程度高。In this way, the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation, and is highly efficient and highly automated.
在第一方面的一些实施例中,其中更新样本权重分布包括:更新样本权重分布的部分,使得更新第二样本权重分布而不更新第一样本权重分布。In some embodiments of the first aspect, wherein updating the sample weight distribution includes: updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
在第一方面的一些实施例中,还包括:将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布。In some embodiments of the first aspect, it further includes: using the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
如此,本公开的实施例能够基于迭代地训练分类模型来更新样本权重分布,能够查看到数据集偏见随样本权重分布更新的变化,这样能够迭代地对待处理数据集进行检测,得到有效的且准确度高的推荐样本权重分布。In this way, the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can check the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and effective and accurate Weight distribution of recommended samples with high degree.
在第一方面的一些实施例中,还包括:基于推荐样本权重分布,对待处理数据集进行增加或删除,以构建无偏数据集。In some embodiments of the first aspect, it further includes: adding or deleting the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
如此,本公开的实施例中,能够基于推荐样本权重分布对待处理数据集进行增加或删除,从而能够构建无偏数据集。进一步地,该无偏数据集可以用于训练更加稳健、无偏的特定任务的模型,进而满足实际需求。In this way, in the embodiments of the present disclosure, the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased data set can be used to train a more robust and unbiased task-specific model to meet actual needs.
在第一方面的一些实施例中,其中更新样本权重分布包括以下至少一项:采用预定的规则更新样本权重分布,采用随机的方式更新样本权重分布,获取用户对样本权重分布的修改以更新样本权重分布,或者通过遗传算法对样本权重分布进行优化以更新样本权重分布。In some embodiments of the first aspect, updating the sample weight distribution includes at least one of the following: using a predetermined rule to update the sample weight distribution, using a random method to update the sample weight distribution, obtaining user modification to the sample weight distribution to update the sample Weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
在第一方面的一些实施例中,其中基于待处理数据集构建无关数据集包括:从待处理数据集的目标待处理数据项中去除与目标待处理数据项的标签相关联的部分,以得到目标待处理数据项中的剩余部分;以及利用剩余部分来构建无关数据集中的一条无关数据项,一条无关数据项的标签对应于目标待处理数据项的标签。In some embodiments of the first aspect, wherein constructing the irrelevant data set based on the data set to be processed includes: removing the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, to obtain The remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
在第一方面的一些实施例中,其中待处理数据集为图像数据集,并且其中基于待处理数据集构建无关数据集包括:对待处理数据集中的目标待处理数据项执行图像分割,以得到与目标待处理数据项对应的背景图像;以及利用背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the first aspect, wherein the data set to be processed is an image data set, and wherein constructing an unrelated data set based on the data set to be processed comprises: performing image segmentation on a target data item to be processed in the data set to be processed to obtain a A background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
如此,本公开的实施例中将背景图像作为偏见的代表,进而可以对数据集进行偏见检查。In this way, in the embodiment of the present disclosure, the background image is used as a representative of bias, so that the data set can be checked for bias.
在第一方面的一些实施例中,其中待处理数据集中的待处理数据项为视频序列,并且其中基于待处理数据集构建无关数据集包括:基于视频序列中一帧图像与一帧图像的前一帧图像之间的梯度信息,确定视频序列的二值图像;基于二值图像,生成视频序列的背景图像;以及利用视频序列的背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the first aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein constructing an irrelevant data set based on the data set to be processed includes: The gradient information between a frame of images determines the binary image of the video sequence; generates the background image of the video sequence based on the binary image; and uses the background image of the video sequence to construct an irrelevant data item in the irrelevant data set.
如此,考虑到视频序列中的各帧图像之间的相似性,以及视频序列中背景基本不变的特性,能够得到视频序列对应的背景图像。In this way, the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
在第一方面的一些实施例中,还包括:通过将目标无关数据项输入经训练的分类模型,得到类激活图CAM;通过将CAM与目标无关数据项叠加,得到叠加结果;以及显示叠加结果。In some embodiments of the first aspect, it also includes: obtaining the class activation map CAM by inputting the target-independent data item into the trained classification model; by superimposing the CAM and the target-independent data item to obtain an overlay result; and displaying the overlay result .
如此,本公开的实施例提供了一种将数据集偏见进行定量评估的方案,从而可以明确地表征数据集偏见的显著性,并且能够可视化地呈现产生偏见的具体位置。如此,用户能够更直观更全面地知道数据集偏见的情况。该方案不需要用户过多的参与,可以自动化进行,能够在保证对偏见的定量评估准确性的前提下,提高处理的效率。In this way, the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be presented visually. In this way, users can more intuitively and comprehensively know the bias of the data set. This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
第二方面,提供了一种数据处理装置。该装置包括:构建单元,被配置为基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的;划分单元,被配置为将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的;训练单元,被配置为基于第一数据集和第一样本权重分布,对分类模型进行训练;以及评估单元,被配置为基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。In a second aspect, a data processing device is provided. The device includes: a construction unit configured to construct an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed The dividing unit is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first same This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed; the training unit is configured to train the classification model based on the first data set and the first sample weight distribution and an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of the bias of the data set to be processed with the sample weight distribution.
在第二方面的一些实施例中,还包括更新单元,被配置为:如果评估结果大于预设阈值,更新待处理数据集的样本权重分布。In some embodiments of the second aspect, an updating unit is further included, configured to: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
在第二方面的一些实施例中,其中更新单元被配置为:更新样本权重分布的部分,使得 更新第二样本权重分布而不更新第一样本权重分布。In some embodiments of the second aspect, wherein the update unit is configured to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
在第二方面的一些实施例中,其中更新单元被配置为:将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布。In some embodiments of the second aspect, the update unit is configured to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
在第二方面的一些实施例中,还包括调整单元,被配置为:基于推荐样本权重分布,对待处理数据集进行增加或删除,以构建无偏数据集。In some embodiments of the second aspect, an adjustment unit is further included, configured to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
在第二方面的一些实施例中,其中更新单元被配置为通过以下至少一项来更新样本权重分布:采用预定的规则更新样本权重分布,采用随机的方式更新样本权重分布,获取用户对样本权重分布的修改以更新样本权重分布,或者通过遗传算法对样本权重分布进行优化以更新样本权重分布。In some embodiments of the second aspect, the update unit is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution by using a predetermined rule, update the sample weight distribution in a random manner, and obtain the user's weight on the sample The distribution is modified to update the sample weight distribution, or the sample weight distribution is optimized by the genetic algorithm to update the sample weight distribution.
在第二方面的一些实施例中,其中构建单元被配置为:从待处理数据集的目标待处理数据项中去除与目标待处理数据项的标签相关联的部分,以得到目标待处理数据项中的剩余部分;以及利用剩余部分来构建无关数据集中的一条无关数据项,一条无关数据项的标签对应于目标待处理数据项的标签。In some embodiments of the second aspect, wherein the construction unit is configured to: remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to obtain the target data item to be processed and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
在第二方面的一些实施例中,其中待处理数据集为图像数据集,并且其中构建单元被配置为:对待处理数据集中的目标待处理数据项执行图像分割,以得到与目标待处理数据项对应的背景图像;以及利用背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the second aspect, wherein the data set to be processed is an image data set, and wherein the construction unit is configured to: perform image segmentation on the target data item to be processed in the data set to be processed to obtain the target data item to be processed a corresponding background image; and using the background image to construct an unrelated data item in the unrelated data set.
在第二方面的一些实施例中,其中待处理数据集中的待处理数据项为视频序列,并且其中构建单元被配置为:基于视频序列中一帧图像与一帧图像的前一帧图像之间的梯度信息,确定视频序列的二值图像;基于二值图像,生成视频序列的背景图像;以及利用视频序列的背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the second aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein the construction unit is configured to: The gradient information of the video sequence is determined to determine the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
在第二方面的一些实施例中,还包括:更新单元,被配置为:通过将目标无关数据项输入经训练的分类模型,得到CAM;以及通过将CAM与目标无关数据项叠加,得到叠加结果;以及显示单元,被配置为显示叠加结果。In some embodiments of the second aspect, further comprising: an update unit configured to: obtain a CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items ; and a display unit configured to display an overlay result.
第三方面,提供了一种计算设备,包括处理器以及存储器,所述存储器上存储有由处理器执行的指令,当该指令被处理器执行时使得所述计算设备实现:基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的;将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的;基于第一数据集和第一样本权重分布,对分类模型进行训练;以及基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。In a third aspect, a computing device is provided, including a processor and a memory, the memory stores instructions executed by the processor, and when the instructions are executed by the processor, the computing device realizes: based on the data set to be processed Constructing an irrelevant data set, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; the irrelevant data sets are divided into the first data set and the second data set Data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the data items to be processed in the data set to be processed The sample weight is determined; based on the first data set and the first sample weight distribution, the classification model is trained; and based on the second data set and the second sample weight distribution, the classification model is evaluated to obtain an evaluation result, and the evaluation result Indicates the significance of bias for the dataset to be processed with a distribution of sample weights.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:如果评估结果大于预设阈值,更新待处理数据集的样本权重分布。In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is enabled to implement: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:更新样本权重分布的部分,使得更新第二样本权重分布而不更新第一样本权重分布。In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布。In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is enabled to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:基于推 荐样本权重分布,对待处理数据集进行增加或删除,以构建无偏数据集。In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device is enabled to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述装置通过以下至少一项来更新样本权重分布:采用预定的规则更新样本权重分布,采用随机的方式更新样本权重分布,获取用户对样本权重分布的修改以更新样本权重分布,或者通过遗传算法对样本权重分布进行优化以更新样本权重分布。In some embodiments of the third aspect, when the instruction is executed by the processor, the device is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, and update the sample weight distribution in a random manner , obtain the modification of the sample weight distribution by the user to update the sample weight distribution, or optimize the sample weight distribution through the genetic algorithm to update the sample weight distribution.
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:从待处理数据集的目标待处理数据项中去除与目标待处理数据项的标签相关联的部分,以得到目标待处理数据项中的剩余部分;以及利用剩余部分来构建无关数据集中的一条无关数据项,一条无关数据项的标签对应于目标待处理数据项的标签。In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is caused to: remove the part associated with the tag of the target data item to be processed from the target data item to be processed in the data set to be processed , to obtain the remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
在第三方面的一些实施例中,其中待处理数据集为图像数据集,并且其中当该指令被处理器执行时使得所述计算设备实现:对待处理数据集中的目标待处理数据项执行图像分割,以得到与目标待处理数据项对应的背景图像;以及利用背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the third aspect, wherein the data set to be processed is an image data set, and wherein the instructions, when executed by a processor, cause the computing device to: perform image segmentation on a target data item to be processed in the data set to be processed , to obtain a background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
在第三方面的一些实施例中,其中待处理数据集中的待处理数据项为视频序列,并且其中当该指令被处理器执行时使得所述计算设备实现:基于视频序列中一帧图像与一帧图像的前一帧图像之间的梯度信息,确定视频序列的二值图像;基于二值图像,生成视频序列的背景图像;以及利用视频序列的背景图像来构建无关数据集中的一条无关数据项。In some embodiments of the third aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein when the instruction is executed by the processor, the computing device realizes: based on a frame image and a frame image in the video sequence The gradient information between the previous frame image of the frame image determines the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and the background image of the video sequence is used to construct an irrelevant data item in the irrelevant data set .
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:通过将目标无关数据项输入经训练的分类模型,得到CAM;以及通过将CAM与目标无关数据项叠加,得到叠加结果;以及显示叠加结果。In some embodiments of the third aspect, the instructions, when executed by a processor, cause the computing device to: obtain a CAM by inputting a target-independent data item into a trained classification model; superimposing, obtaining the superimposing result; and displaying the superimposing result.
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现根据上述第一方面或任一实施例中的方法的操作。In a fourth aspect, there is provided a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the operation according to the method in the above-mentioned first aspect or any embodiment is realized .
第五方面,提供了一种芯片或芯片系统。该芯片或芯片系统包括处理电路,被配置为执行根据上述第一方面或任一实施例中的方法的操作。In a fifth aspect, a chip or a chip system is provided. The chip or chip system includes a processing circuit configured to perform operations according to the method in the above first aspect or any one of the embodiments.
第六方面,提供了一种计算机程序或计算机程序产品。该计算机程序或计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,计算机可执行指令在被执行时使设备实现根据上述第一方面或任一实施例中的方法的操作。In a sixth aspect, a computer program or computer program product is provided. The computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause the device to implement operations according to the method in the first aspect or any of the above-mentioned embodiments .
附图说明Description of drawings
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的元素,其中:The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals indicate the same or similar elements, wherein:
图1示出了根据本公开的实施例的系统100的结构示意图;FIG. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure;
图2示出了根据本公开的实施例的数据集处理模块200的一个结构示意图;FIG. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure;
图3示出了根据本公开的实施例的模型训练模块130得到推荐样本权重的过程300的示意图;FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure;
图4示出了根据本公开的实施例的系统100被部署于云环境中的场景400的示意图;FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure;
图5示出了根据本公开的实施例的系统100被部署于不同环境中的场景500的示意图;FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure;
图6示出了根据本公开的实施例的计算设备600的结构示意图;FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure;
图7示出了根据本公开的实施例的数据处理方法700的示意流程图;FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure;
图8示出了根据本公开的实施例的构建无关数据项的过程800的示意流程图;FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure;
图9示出了根据本公开的实施例的对待处理数据集的样本权重分布进行更新的过程900的示意图;FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure;
图10示出了根据本公开的实施例的数据处理装置1000的示意框图。Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
人工智能(Artificial Intelligence,AI)利用计算机来模拟人的某些思维过程和智能行为。人工智能的研究历史有着一条从以“推理”为重点,到以“知识”为重点,再到以“学习”为重点的自然、清晰的脉络。人工智能已经被广泛地应用到了安防、医疗、交通、教育、金融等各个行业。Artificial Intelligence (AI) uses computers to simulate certain human thinking processes and intelligent behaviors. The history of artificial intelligence research has a natural and clear vein from focusing on "reasoning", to focusing on "knowledge", and then focusing on "learning". Artificial intelligence has been widely applied to various industries such as security, medical care, transportation, education, and finance.
机器学习(Machine Learning)是人工智能的一个分支,其研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。也就是说,机器学习研究的是如何在经验学习中改善具体算法的性能。Machine learning (Machine Learning) is a branch of artificial intelligence, which studies how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their own performance. In other words, machine learning studies how to improve the performance of specific algorithms during empirical learning.
深度学习(Deep Learning)是一类基于深层次神经网络算法的机器学习技术,其主要特征是使用多重非线性变换结构对数据进行处理和分析。主要应用于人工智能领域的感知、决策等场景,例如图像和语音识别、自然语言翻译、计算机博弈等。Deep learning (Deep Learning) is a type of machine learning technology based on deep neural network algorithms. Its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
数据与算法是人工智能的两个重要支柱,相应地,数据偏见(Data Bias)是人工智能领域中重点关注的问题。对于特定的机器学习任务而言,数据中存在与该任务呈相关但不存在非因果关系的因素,例如样本不均衡、数据中存在人为标志物等,这样的因素可以认为是数据偏见。Data and algorithms are two important pillars of artificial intelligence. Correspondingly, data bias (Data Bias) is a key concern in the field of artificial intelligence. For a specific machine learning task, there are factors in the data that are related to the task but not non-causal, such as sample imbalance, artificial markers in the data, etc. Such factors can be considered as data bias.
数据集偏见是指数据集中存在一些机器学习模型可能会学习到的虚假特征。以图像数据集为例,图像中可能存在一些与数据采集机型、采集参数等相关的信息,这些信息与采集任务无关。但是由于数据采集缺陷,机器学习模型可能会基于这些信息进行推测,直接猜出分类结果,而不再学习与目标任务真正相关的图像特征。Dataset bias refers to the presence of spurious features in a dataset that some machine learning models may learn. Taking the image data set as an example, there may be some information related to the data acquisition model and acquisition parameters in the image, which has nothing to do with the acquisition task. However, due to data collection defects, the machine learning model may make speculations based on this information and directly guess the classification results, instead of learning the image features that are really related to the target task.
机器学习模型在利用具有数据集偏见的图像数据集进行训练时,可能无法按照预期客观真实地对训练任务进行学习。从而导致学习得到的机器学习模型在实际使用环境中难以按预期完成目标任务,出现严重的性能下降;或者即便性能没有下降,但出错的原因也可能令人难以接受,甚至惹出伦理官司。例如,某一预测口红的模型,在将嘴部遮去之后,几乎不影响预测的结果,可见,该模型实际上并没有学习嘴部相关的特征。再例如,某一医学影像识别模型,基于医生放置的标记物来推测采集地点,从而影响了预测结果。When a machine learning model is trained on an image dataset with dataset bias, it may not be able to learn objectively and realistically for the training task as expected. As a result, it is difficult for the learned machine learning model to complete the target tasks as expected in the actual use environment, resulting in serious performance degradation; or even if the performance does not decrease, the reasons for errors may be unacceptable, and even lead to ethical lawsuits. For example, a model for predicting lipstick hardly affects the prediction results after covering the mouth. It can be seen that the model does not actually learn mouth-related features. Another example is a medical image recognition model that infers the collection location based on the markers placed by the doctor, which affects the prediction results.
目前的一种方案是将可能影响模型学习的区域裁掉,或者针对图像数据,可以对色彩、 灰阶等进行调整,从而可以避免这些数据偏见对模型训练的影响。然而,这种方式难以穷举所有的偏见,并且这种方式工作量大,需要大量的人力消耗和时间成本。One of the current solutions is to cut out areas that may affect model learning, or to adjust the color, grayscale, etc. for image data, so as to avoid the impact of these data biases on model training. However, it is difficult to enumerate all prejudices in this way, and this way requires a lot of work, manpower consumption and time cost.
有鉴于此,本公开的实施例提供了一种对数据集偏见进行定量评估的方案,从而能够有效地确定数据集偏见的影响,进而能够基于此对数据集进行调整,确保调整后的数据集不会因数据偏见对模型有负面影响。In view of this, the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the impact of data set bias can be effectively determined, and the data set can be adjusted based on this to ensure that the adjusted data set The model will not be negatively affected by data bias.
图1示出了根据本公开的实施例的系统100的结构示意图。如图1所示,系统100可以如图1所示,系统架构100包括输入/输出(Input/Output,I/O)模块110、数据集处理模块120和模型训练模块130。可选地,如图1所示,系统100还可以包括模型存储模块140和数据存储模块150。图1所示的各个模块之间可以彼此进行通信。Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure. As shown in FIG. 1 , the system 100 may be shown in FIG. 1 , and the system architecture 100 includes an input/output (Input/Output, I/O) module 110 , a data set processing module 120 and a model training module 130 . Optionally, as shown in FIG. 1 , the system 100 may further include a model storage module 140 and a data storage module 150 . The various modules shown in FIG. 1 can communicate with each other.
输入/输出模块110可以用于获取待处理数据集。例如可以接收由用户输入的待处理数据集。The input/output module 110 can be used to acquire data sets to be processed. For example, a data set to be processed input by a user may be received.
可选地,该待处理数据集可以被存储在数据存储模块150中。作为一个示例,数据存储模块150可以是云服务提供商提供的对象存储服务(Object Storage Service,OBS)对应的数据存储资源。Optionally, the data set to be processed may be stored in the data storage module 150 . As an example, the data storage module 150 may be a data storage resource corresponding to an object storage service (Object Storage Service, OBS) provided by a cloud service provider.
待处理数据集包括大量的待处理数据项,每条待处理数据项具有标签。换句话说,待处理数据集中包含带由标注的多个待处理数据项。The data set to be processed includes a large number of data items to be processed, and each data item to be processed has a label. In other words, the data set to be processed contains a plurality of data items to be processed marked with .
标签可以是通过人工的方式被标注的,或者可以通过机器学习等方式得到的,本公开对此不限定。标签也可以被称为任务标签或标注信息或其他名称等,本文中不再一一罗列。The labels may be marked manually, or may be obtained through machine learning, which is not limited in the present disclosure. Tags can also be called task tags, annotation information, or other names, which will not be listed in this article.
在一些示例中,标注信息可以是标注人员根据经验,针对待处理数据项中的特定部分进行标注的。或者,标注信息可以是通过图像识别模型和标注模型进行标注的。In some examples, the annotation information may be annotated by an annotator for a specific part of the data item to be processed based on experience. Alternatively, the annotation information may be annotated through an image recognition model and an annotation model.
例如,对于包括人脸的图像数据,可以针对人脸部分标注例如性别、年龄、是否戴眼镜、是否戴帽子、人脸大小等等的标签。例如,对于医疗图像(如超声采集图像),可以针对被检测部分标注有是否有病变。For example, for image data including a human face, tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face. For example, for a medical image (such as an ultrasound image), whether there is a lesion can be marked for the detected part.
可理解,待处理数据项可以包括与标签有关的部分和与标签无关的部分。以上述的人脸图像为例,假设标签是针对人脸的(例如同边界框标注出人脸位置),那么该图像中的人脸区域是与标签有关的部分,而该图像中人脸区域之外的其他区域是与标签无关的部分。假设标签是针对眼睛的(例如通过“黑”、“棕”等标注出瞳孔颜色),那么该图像中的眼睛区域是与标签有关的部分,而该图像中眼睛区域之外的其他区域是与标签无关的部分。It can be understood that the data item to be processed may include a tag-related part and a tag-independent part. Taking the above-mentioned face image as an example, assuming that the label is for the face (for example, the position of the face is marked with the bounding box), then the face area in the image is the part related to the label, and the face area in the image The rest of the area is not related to the label. Assuming that the label is aimed at the eyes (for example, the pupil color is marked by "black", "brown", etc.), then the eye area in the image is the part related to the label, while other areas in the image are related to the eye area. Tags don't matter.
待处理数据集中的待处理数据项可以是任何类型的数据,例如图像、视频、语音、文本等等。为了描述方便,下文中以图像为例进行阐述。The data items to be processed in the data set to be processed may be any type of data, such as images, videos, voices, texts, and so on. For the convenience of description, an image is taken as an example for illustration below.
本公开的实施例对待处理数据项的来源不作限定,以图像为例,例如可以是从开源数据集收集的,例如可以是由不同的图像采集设备所采集的,例如可以是由同一图像采集设备在不同时间所采集的,例如可以是由图像采集设备所采集的视频序列中的图像帧,或上述所列的任意组合,或其他等等。The embodiment of the present disclosure does not limit the source of the data items to be processed. Taking images as an example, for example, they may be collected from open source data sets, for example, they may be collected by different image acquisition devices, for example, they may be collected by the same image acquisition device Captured at different times, for example, may be image frames in a video sequence captured by an image capture device, or any combination of the above, or others.
输入/输出模块110可以被实现为彼此独立的输入模块和输出模块,或者也可以被实现为同时具备输入和输出功能的耦合模块。作为示例,可以采用图形用户界面(Graphical User Interface,GUI)或命令行界面(Command-Line Interface,CLI)实现。The input/output module 110 may be implemented as an input module and an output module that are independent of each other, or may also be implemented as a coupling module having both input and output functions. As an example, a graphical user interface (Graphical User Interface, GUI) or a command line interface (Command-Line Interface, CLI) may be used for implementation.
数据集处理模块120可以从输入/输出模块110获取待处理数据集,或者可选地,可以从数据存储模块150获取待处理数据集。进一步地,数据集处理模块120可以基于待处理数据 集构建无关数据集。无关数据集包括具有标签的无关数据项,并且无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的。The data set processing module 120 can obtain the data set to be processed from the input/output module 110 , or alternatively, can obtain the data set to be processed from the data storage module 150 . Further, the data set processing module 120 can construct an irrelevant data set based on the data set to be processed. The unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
可选地,该无关数据集可以被存储在数据存储模块150中。Optionally, the unrelated data set may be stored in the data storage module 150 .
如上所述,待处理数据项具有标签,并且待处理数据项包括与标签有关的部分和与标签无关的部分。那么,可以将待处理数据项中的与标签有关的部分去除,仅保留待处理数据项中与标签无关的部分,作为无关数据项,且该无关数据项的标签即为待处理数据项的标签。该过程也可以被称为拆分、分割、分离或其他名称等,本公开对此不限定。As described above, the data item to be processed has a label, and the data item to be processed includes a part related to the label and a part not related to the label. Then, the part related to the label in the data item to be processed can be removed, and only the part irrelevant to the label in the data item to be processed can be reserved as an irrelevant data item, and the label of the unrelated data item is the label of the data item to be processed . This process may also be called splitting, division, separation or other names, etc., which is not limited in the present disclosure.
也就是说,针对待处理数据集的某一待处理数据项(称为目标待处理数据项),可以从该目标待处理数据项中去除与其标签相关联的部分,以得到目标待处理数据项中的剩余部分。随后利用剩余部分来构建无关数据集中的一条无关数据项,一条无关数据项的标签对应于目标待处理数据项的标签。That is to say, for a certain data item to be processed in the data set to be processed (called the target data item to be processed), the part associated with its label can be removed from the target data item to be processed to obtain the target data item to be processed the remainder of the . Then use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
举例来讲,假设待处理数据项是人脸图像,标签表示人脸肤色,如“白”。那么,可以将该人脸图像中的人脸区域去除,将去除人脸区域之后的剩余部分作为对应的无关数据项,且该无关数据项仍具有人脸肤色的标签“白”。For example, assume that the data item to be processed is a face image, and the label represents the skin color of the face, such as "white". Then, the face area in the face image can be removed, and the remaining part after removing the face area can be used as the corresponding irrelevant data item, and the irrelevant data item still has the label "white" of the face skin color.
在一些实现方式中,如果待处理数据集中的待处理数据项是图像,那么可以通过图像分割的方式得到无关数据项。图像中与标签关联的部分是前景区域,图像中除前景区域之外的其他区域为背景区域,那么可以通过前景-背景分离,仅基于背景区域确定无关数据项。In some implementations, if the data item to be processed in the data set to be processed is an image, the irrelevant data item can be obtained by means of image segmentation. The part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.
具体地,针对待处理数据集中的目标待处理数据项(目标图像)执行图像分割,得到目标图像对应的背景图像,然后再利用背景图像构建无关数据项。Specifically, image segmentation is performed on the target data item to be processed (target image) in the data set to be processed to obtain a background image corresponding to the target image, and then use the background image to construct an irrelevant data item.
本公开实施例对图像分割所采用的具体算法不作限定,例如可以采用如下列出的算法中的一项或多项来执行,也可以采用其他的算法来执行:基于阈值的图像分割算法、基于区域的图像分割算法、基于边缘检测的图像分割算法、基于小波分析和小波变换的图像分割算法、基于遗传算法的图像分割算法、基于主动轮廓模型的图像分割算法、基于深度学习的图像分割算法等等,其中基于深度学习的图像分割算法包括但不限于:基于特征编码(feature encoder based)的分割算法、基于区域选择(regional proposal based)的分割算法、基于RNN的分割算法、基于上采样/反卷积的分割算法、基于提高特征分辨率的分割算法、基于特征增强的分割算法、使用条件随机场(Conditional Random Field,CRF)/马尔可夫随机场(Marcov Random Field,MRF)的分割算法等。The embodiment of the present disclosure does not limit the specific algorithm used for image segmentation. For example, one or more of the algorithms listed below can be used to perform, and other algorithms can also be used to perform: threshold-based image segmentation algorithm, based on Image segmentation algorithm based on region, image segmentation algorithm based on edge detection, image segmentation algorithm based on wavelet analysis and wavelet transform, image segmentation algorithm based on genetic algorithm, image segmentation algorithm based on active contour model, image segmentation algorithm based on deep learning, etc. etc., where image segmentation algorithms based on deep learning include but are not limited to: segmentation algorithms based on feature encoder based, segmentation algorithms based on regional proposal based, segmentation algorithms based on RNN, based on upsampling/inversion Convolution segmentation algorithm, segmentation algorithm based on feature resolution enhancement, feature enhancement based segmentation algorithm, segmentation algorithm using Conditional Random Field (CRF)/Marcov Random Field (MRF), etc. .
在另一些实现方式中,如果待处理数据集中的待处理数据项是视频序列。不同的待处理数据项可以具有相同或不同的时长,例如,待处理数据集中的第一待处理数据项为第一视频序列,其长度为m1帧,包括m1帧图像。例如,待处理数据集中的第二待处理数据项为第二视频序列,其长度为m2帧,包括m2帧图像。m1与m2可以相等或不相等。In other implementation manners, if the data item to be processed in the data set to be processed is a video sequence. Different data items to be processed may have the same or different durations. For example, the first data item to be processed in the data set to be processed is a first video sequence, the length of which is m1 frames, including m1 frame images. For example, the second data item to be processed in the data set to be processed is a second video sequence with a length of m2 frames, including m2 frames of images. m1 and m2 may or may not be equal.
具体地,针对待处理数据集中的目标待处理数据项(目标视频序列)执行视频分割,得到目标视频序列对应的背景图像,然后再利用背景图像构建无关数据项。Specifically, video segmentation is performed on the target data item to be processed (target video sequence) in the data set to be processed to obtain a background image corresponding to the target video sequence, and then use the background image to construct an irrelevant data item.
本公开实施例对视频分割所采用的具体算法不作限定。作为一例,可以针对目标视频序列中每一帧图像执行图像分割,将各帧图像分割后的背景区域进行融合,得到与该目标视频序列对应的背景图像。作为另一例,可以基于目标视频序列中相邻两帧之间的梯度得到与该目标视频序列对应的背景图像。具体的,可以基于视频序列的梯度信息,得到视频序列对应的二值图像。随后基于该二值图像生成视频序列的背景图像,如下结合图2所述。The embodiment of the present disclosure does not limit the specific algorithm adopted for video segmentation. As an example, image segmentation may be performed for each frame image in the target video sequence, and the background regions after segmentation of each frame image are fused to obtain a background image corresponding to the target video sequence. As another example, the background image corresponding to the target video sequence may be obtained based on the gradient between two adjacent frames in the target video sequence. Specifically, the binary image corresponding to the video sequence may be obtained based on the gradient information of the video sequence. The background image of the video sequence is then generated based on this binary image, as described below in conjunction with FIG. 2 .
图2示出了根据本公开的实施例的数据集处理模块200的一个结构示意图。数据集处理模块200可以作为如图1中的数据集处理模块120的一种实现方式,数据集处理模块200可以用于基于待处理数据集确定无关数据集,其中,待处理数据集的待处理数据项为视频序列,无关数据集中的无关数据项可以是与视频序列对应的背景图像。Fig. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure. The data set processing module 200 can be used as an implementation of the data set processing module 120 in Figure 1, and the data set processing module 200 can be used to determine irrelevant data sets based on the data sets to be processed, wherein The data item is a video sequence, and the irrelevant data item in the irrelevant data set may be a background image corresponding to the video sequence.
如图2所示,数据集处理模块200可以包括梯度计算子模块210、梯度叠加子模块220、阈值化子模块230、形态学处理子模块240和分离子模块250。As shown in FIG. 2 , the dataset processing module 200 may include a gradient calculation submodule 210 , a gradient superposition submodule 220 , a thresholding submodule 230 , a morphological processing submodule 240 and a separation submodule 250 .
梯度计算子模块210可以用于计算目标视频序列中的一帧图像与其前一帧图像之间的梯度信息。The gradient calculation sub-module 210 can be used to calculate the gradient information between a frame image and the previous frame image in the target video sequence.
举例来讲,假设目标视频序列包括m1帧图像,分别为第0帧图像、第1帧图像、…、第m1-1帧图像。那么可以计算每相邻两帧图像之间的梯度信息,具体的,计算第1帧图像与第0帧图像之间,第2帧图像与第1帧图像之间,…第m1-1帧图像与第m1-2帧图像之间的梯度信息。For example, assume that the target video sequence includes m1 frame images, which are frame 0, image 1, ..., frame m1-1 respectively. Then the gradient information between every two adjacent frames of images can be calculated, specifically, between the 1st frame image and the 0th frame image, between the 2nd frame image and the 1st frame image, ... the m1-1th frame image The gradient information between the m1-2th frame image.
本公开的实施例对计算梯度信息的具体方式不作限定,例如可以计算帧差。例如可以计算两帧图像的特征向量沿特定维度(例如时间维度T)的梯度,这样能够从视频序列中通过运动信息提取固定不变的背景部分,如图像边框等。例如可以计算图像与灰度化后的图像之差,从而能够提取视频帧图像中的彩色部分,这样能够避免将彩色标记作为前景部分,例如在视频采集之后后期增加的一些彩色标记或文字等。The embodiments of the present disclosure do not limit the specific manner of calculating the gradient information, for example, the frame difference may be calculated. For example, the gradient of the feature vectors of two frames of images along a specific dimension (such as the time dimension T) can be calculated, so that fixed background parts, such as image borders, can be extracted from video sequences through motion information. For example, the difference between the image and the grayscaled image can be calculated, so that the colored part in the video frame image can be extracted, so that the color mark can be avoided as the foreground part, such as some color marks or text added in the later stage after video capture.
梯度叠加子模块220可以用于将梯度计算子模块210得到的梯度信息进行叠加,从而得到梯度叠加图。The gradient superposition sub-module 220 can be used to superimpose the gradient information obtained by the gradient calculation sub-module 210 to obtain a gradient superposition map.
梯度叠加子模块220进行叠加的方式可以包括但不限于加权求和(如平均值)、求最大值、求最小值或其他等。The manner of superposition by the gradient superposition sub-module 220 may include but not limited to weighted summation (such as average value), maximum value, minimum value or others.
阈值化子模块230可以用于对梯度叠加子模块220得到的梯度叠加图进行阈值化处理,得到初始二值图。The thresholding sub-module 230 may be configured to perform thresholding processing on the gradient overlay image obtained by the gradient overlay sub-module 220 to obtain an initial binary image.
具体地,针对梯度叠加图中的各个像素,将值大于阈值的那些像素标记为1,将值小于或等于阈值的那些像素标记为0,从而得到初始二值图,该初始二值图中的像素值要不为1要不为0。Specifically, for each pixel in the gradient overlay map, those pixels whose value is greater than the threshold value are marked as 1, and those pixels whose value is less than or equal to the threshold value are marked as 0, so as to obtain an initial binary image, in which The pixel value is either 1 or 0.
形态学处理子模块240可以对阈值化子模块230得到的初始二值图进行形态学处理,得到视频序列对应的二值图像。The morphological processing sub-module 240 may perform morphological processing on the initial binary image obtained by the thresholding sub-module 230 to obtain a binary image corresponding to the video sequence.
举例来说,如果在初始二值图中某像素的像素值为1,但是该像素的所有相邻像素的像素值都是0,那么可以将该像素的像素值重置为0。For example, if the pixel value of a certain pixel in the initial binary image is 1, but the pixel values of all adjacent pixels of this pixel are 0, then the pixel value of this pixel can be reset to 0.
示例性地,形态学处理可以包括但不限于形态学膨胀、形态学腐蚀等。例如,形态学处理子模块240可以对阈值化子模块230得到的初始二值图进行若干次形态学膨胀,然后再执行同样次数的形态学腐蚀,从而得到二值图像。Exemplarily, morphological processing may include, but not limited to, morphological dilation, morphological erosion, and the like. For example, the morphological processing submodule 240 may perform several times of morphological expansion on the initial binary image obtained by the thresholding submodule 230, and then perform the same number of morphological erosions to obtain a binary image.
分离子模块250可以基于形态学处理子模块240得到的二值图像,得到视频序列对应的背景图像。The separation sub-module 250 can obtain the background image corresponding to the video sequence based on the binary image obtained by the morphological processing sub-module 240 .
示例性地,可以对二值图像执行抠图操作,得到背景图像。例如可以通过矩阵点乘的方式得到背景图像。Exemplarily, a matting operation may be performed on a binary image to obtain a background image. For example, the background image can be obtained by matrix dot product.
这样,能够充分考虑视频序列中各帧图像之间背景的相似性,来得到与视频序列对应的背景图像。In this way, the background image corresponding to the video sequence can be obtained by fully considering the similarity of the background among the frame images in the video sequence.
如此,本公开的实施例中将背景图像作为偏见的代表,进而可以对数据集进行偏见检查。可理解,如果该数据集不存在偏见,那么背景图像的特征不应该与关联于前景区域的标签存在任何关系的。In this way, in the embodiment of the present disclosure, the background image is used as a representative of bias, so that the data set can be checked for bias. Understandably, if the dataset is not biased, then the features of the background image should not have any relationship to the labels associated with the foreground regions.
假设待处理数据集中包括N条待处理数据项,无关数据集中包括N1条无关数据项。如果针对每一条待处理数据项均执行处理,得到对应的无关数据项,那么N1=N。如果针对待处理数据集中的部分数据项执行处理,得到对应的无关数据项,那么N1<N。可理解,针对全部待处理数据项的处理得到的无关数据集具有更多的无关数据项,进而可以更完整全面地对待处理数据集进行分析和评估。Assume that the data set to be processed includes N data items to be processed, and the unrelated data set includes N1 unrelated data items. If processing is performed on each data item to be processed to obtain a corresponding irrelevant data item, then N1=N. If processing is performed on some data items in the data set to be processed to obtain corresponding irrelevant data items, then N1<N. It can be understood that the irrelevant data set obtained by processing all the data items to be processed has more irrelevant data items, so that the data set to be processed can be analyzed and evaluated more completely and comprehensively.
在一种实现方式中,可以将所构建的无关数据集分为两部分:第一部分无关数据项和第二部分无关数据项,其中,第一部分无关数据项可以用于训练模型,第二部分无关数据项可以用于测试模型。本公开实施例对该划分方式不做限定,作为一例,可以按照9:1或1:1或其他比例将无关数据集分为第一部分和第二部分。In one implementation, the constructed irrelevant data set can be divided into two parts: the first part of irrelevant data items and the second part of irrelevant data items, wherein the first part of irrelevant data items can be used to train the model, and the second part of irrelevant data items Data items can be used to test the model. The embodiment of the present disclosure does not limit the division method. As an example, the irrelevant data set may be divided into the first part and the second part according to 9:1 or 1:1 or other ratios.
示例性地,可以将第一部分无关数据项组成的集合称为无关训练集,将第二部分无关数据项组成的集合称为无关测试集。或者可选地,第一部分无关数据项组成的集合可以包括无关训练集和无关验证集。作为一例,可以按照7:2:1将无关数据集划分为无关训练集、无关验证集和无关测试集。Exemplarily, the set composed of the first part of irrelevant data items may be called an irrelevant training set, and the set composed of the second part of irrelevant data items may be called an irrelevant test set. Or optionally, the first part of the set of irrelevant data items may include an irrelevant training set and an irrelevant verification set. As an example, the unrelated data set can be divided into an unrelated training set, an unrelated verification set and an unrelated test set according to 7:2:1.
为了简化描述,下文中将第一部分无关数据项组成的集合称为第一数据集(或训练集),将第二部分无关数据项组成的集合称为第二数据集(或测试集)。In order to simplify the description, hereinafter, the set composed of the first part of irrelevant data items is called the first data set (or training set), and the set composed of the second part of irrelevant data items is called the second data set (or test set).
在一些实施例中,数据集处理模块120可以先对待处理数据集进行预处理,然后再基于经预处理的待处理数据集来构建无关数据集。预处理包括但不限于:聚类分析、数据去噪等。In some embodiments, the dataset processing module 120 may preprocess the dataset to be processed first, and then construct an irrelevant dataset based on the preprocessed dataset to be processed. Preprocessing includes but is not limited to: cluster analysis, data denoising, etc.
模型训练模块130可以包括训练子模块132和评估子模块134。The model training module 130 may include a training submodule 132 and an evaluation submodule 134 .
训练子模块132可以用于对分类模型进行训练。具体地,可以基于无关数据集中的第一部分无关数据项和该第一部分中的每个无关数据项的标签,对分类模型进行训练。The training sub-module 132 can be used to train the classification model. Specifically, the classification model may be trained based on the first part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the first part.
作为一种实现方式,用于训练的第一部分无关数据项可以是无关数据集的全部,如此,可以采用更多的数据项参与训练,使得经训练的分类模型更加鲁棒。作为另一种实现方式,用于训练的第一部分无关数据项可以是无关数据集的部分,如上所述,无关数据集被划分为第一部分无关数据项和第二部分无关数据项。As an implementation, the first part of irrelevant data items used for training may be all of the irrelevant data sets, so that more data items may be used for training, making the trained classification model more robust. As another implementation manner, the first part of irrelevant data items used for training may be part of an irrelevant data set. As mentioned above, the irrelevant data set is divided into the first part of irrelevant data items and the second part of irrelevant data items.
为了下文的描述方便,将用于训练的第一部分无关数据项组成的集合称为训练集,相应地,第一部分无关数据项可以为训练项。For the convenience of description below, the set of the first part of irrelevant data items used for training is called a training set, and correspondingly, the first part of irrelevant data items may be training items.
应注意的是,此处的训练可以是对初始的分类模型进行训练或者可以是对先前训练的分类模型进行更新,其中初始的分类模型可以是未进行训练的分类模型。先前训练的分类模型可以是对初始的分类模型进行训练后得到的。作为示例,训练子模块132可以从模型存储模块140获取初始的分类模型或者先前训练的分类模型。It should be noted that the training here may be to train an initial classification model or may be to update a previously trained classification model, wherein the initial classification model may be a classification model that has not been trained. The previously trained classification model may be obtained after training the initial classification model. As an example, the training sub-module 132 can obtain an initial classification model or a previously trained classification model from the model storage module 140 .
训练子模块132可以从数据集处理模块120或数据存储模块150获取用于训练的无关数据集中的第一部分无关数据项和该第一部分中的每个无关数据项的标签。或者,训练子模块132可以从数据集处理模块120获取用于训练的无关数据集中的第一部分无关数据项,并从输入/输出模块110获取该第一部分无关数据项中的每个无关数据项的标签。The training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training and the label of each irrelevant data item in the first part from the data set processing module 120 or the data storage module 150 . Or, the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training from the data set processing module 120, and obtain the information of each irrelevant data item in the first part of irrelevant data items from the input/output module 110. Label.
可选地,在基于训练集(无关数据集中的第一部分无关数据项)进行训练之前,训练子模块132可以对训练集进行预处理,包括但不限于:特征提取、聚类分析、边缘检测、图像 去噪等。举例来讲,经过特征提取之后的训练数据项可以被表征为S维特征向量,其中S大于1。Optionally, before training based on the training set (the first part of irrelevant data items in the irrelevant data set), the training submodule 132 can preprocess the training set, including but not limited to: feature extraction, cluster analysis, edge detection, Image denoising, etc. For example, the training data item after feature extraction can be characterized as an S-dimensional feature vector, where S is greater than 1.
可理解,本公开实施例对分类模型的模型结构不做限定,作为一例,该分类模型可以是卷积神经网络(Convolutional Neural Network,CNN)模型,可选地可以包括输入层、卷积层、反卷积层、池化层、全连接层、输出层等。It can be understood that the embodiment of the present disclosure does not limit the model structure of the classification model. As an example, the classification model can be a convolutional neural network (Convolutional Neural Network, CNN) model, which can optionally include an input layer, a convolutional layer, Deconvolution layer, pooling layer, fully connected layer, output layer, etc.
分类模型中包括大量的参数,可以表示该模型中的计算公式或计算因子的权重,并且可以通过训练对参数进行迭代更新。分类模型还包括超参数(hyper-parameter),用于指导分类型的构建或训练,超参数例如模型训练的迭代(iteration)次数、学习率(leaning rate)、批尺寸(batch size)、模型的层数、每层神经元的个数等。超参数可以是通过训练集对模型进行训练获得的参数,也可以是预先设定的参数,预先设定的参数指不会通过对模型的训练而被更新。The classification model includes a large number of parameters, which can represent the calculation formula or the weight of the calculation factor in the model, and the parameters can be updated iteratively through training. The classification model also includes hyper-parameters, which are used to guide the construction or training of classifications, such as the number of iterations of model training, learning rate, batch size, model The number of layers, the number of neurons in each layer, etc. The hyperparameters can be parameters obtained by training the model through the training set, or they can be preset parameters, and the preset parameters will not be updated through the training of the model.
示例性地,训练子模块132对分类模型进行训练的过程可以参照已有的训练过程。作为示意性描述,该训练过程可以是:将训练集中的训练数据项输入到分类模型,将训练数据对应的标签作为参考,利用损失函数(loss function)得到分类模型的输出与对应的标签之间的损失值,并根据该损失值对分类模型的参数进行调整。训练集中的每个训练数据项迭代地对分类模型进行训练,分类模型的参数不断调整,直到分类模型可以根据输入的训练数据项准确度较高地输出与训练数据项对应的标签更接近的输出值,例如损失函数最小或小于参考阈值。Exemplarily, the process of training the classification model by the training sub-module 132 can refer to the existing training process. As a schematic description, the training process can be: input the training data items in the training set to the classification model, use the label corresponding to the training data as a reference, and use the loss function (loss function) to obtain the relationship between the output of the classification model and the corresponding label. The loss value of , and adjust the parameters of the classification model according to the loss value. Each training data item in the training set iteratively trains the classification model, and the parameters of the classification model are continuously adjusted until the classification model can output an output value closer to the label corresponding to the training data item with higher accuracy according to the input training data item , such as the loss function is minimal or smaller than a reference threshold.
在训练过程中的损失函数是用于衡量分类模型被训练的程度(也就是用于计算分类模型预测的结果与真实值之间的差异)的函数。在训练分类模型的过程中,因为希望分类模型的输出尽可能的接近真实值(即对应的标签),所以可以通过比较当前分类模型的预测值和真实值,再根据两者之间的差异情况来更新分类模型中的参数。每次训练都通过损失函数判断当前的分类模型预测的值与真实值之间的差异,更新分类模型的参数,直到分类模型能够预测出与真实值非常接近的值,则认为分类模型被训练完成。The loss function in the training process is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value). In the process of training the classification model, because it is hoped that the output of the classification model is as close as possible to the real value (that is, the corresponding label), it is possible to compare the predicted value and the real value of the current classification model, and then according to the difference between the two to update the parameters in the classification model. Each training uses the loss function to judge the difference between the value predicted by the current classification model and the real value, and updates the parameters of the classification model until the classification model can predict a value very close to the real value, then the classification model is considered to be trained .
本公开的实施例中的“分类模型”也可以被称为机器学习模型、卷积分类模型、背景分类模型、数据偏见模型、或其他名称等,或者也可以被简称为“模型”等,本公开对此不限定。可选地,经训练的分类模型可以被存储在模型存储模块140中。在一些示例中,模型存储模块140可以作为模型训练模块130的一部分。The "classification model" in the embodiments of the present disclosure may also be called a machine learning model, a convolutional classification model, a background classification model, a data bias model, or other names, or may also be referred to as a "model" for short. Publicity is not limited to this. Optionally, the trained classification model may be stored in the model storage module 140 . In some examples, model storage module 140 may be part of model training module 130 .
评估子模块134可以用于对分类模型进行评估。具体的,可以基于无关数据集中的第二部分无关数据项和该第二部分中的每个无关数据项的标签,确定关于经训练的分类模型的评估结果。该评估结果可以用于表征待处理数据集的数据偏见的显著性。The evaluation sub-module 134 can be used to evaluate the classification model. Specifically, the evaluation result of the trained classification model may be determined based on the second part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the second part. The evaluation results can be used to characterize the significance of data bias in the data set to be processed.
如上所述,第二部分无关数据项组成的集合可以为测试集,相应地,第二部分无关数据项可以为测试数据项。As mentioned above, the set of the second part of irrelevant data items may be a test set, and correspondingly, the second part of irrelevant data items may be test data items.
作为示例,该评估过程可以包括:将测试数据项输入到经训练的分类模型,得到关于该测试数据项的预测结果,基于该预测结果与测试数据项的标签的比较结果,确定评估结果。As an example, the evaluation process may include: inputting a test data item into a trained classification model, obtaining a prediction result about the test data item, and determining an evaluation result based on a comparison result of the prediction result with a label of the test data item.
本公开实施例中,评估结果可以包括以下至少一项:正确率、准确率、召回率、F1指数、准确率-召回率(Precision-Recall,P-R)曲线、平均精度(Average Precision,AP)指标、误报率、漏报率等。In the embodiment of the present disclosure, the evaluation result may include at least one of the following: correct rate, precision rate, recall rate, F1 index, precision-recall rate (Precision-Recall, P-R) curve, average precision (Average Precision, AP) index , false positive rate, false negative rate, etc.
具体地,可以构建混淆矩阵,其中示出正例(Positive,也称阳性)和负例(Negative, 也称阴性)的数量、真实值以及预测值等。Specifically, a confusion matrix may be constructed, which shows the number of positive examples (Positive, also called positive) and negative examples (Negative, also called negative), real values, predicted values, and the like.
准确率是指分类正确的样本占总样本的比例。例如,测试集中的测试数据项的数量为N2,其中预测结果与标签一致的数量为N21,那么准确率可以表示为N21/N2。The accuracy rate refers to the proportion of correctly classified samples to the total samples. For example, the number of test data items in the test set is N2, and the number of predicted results consistent with the label is N21, then the accuracy rate can be expressed as N21/N2.
正确率也称为精确率,是指预测为正的样本中,实际也为正的样本的比例。例如,测试集中的测试数据项的数量为N2,若预测结果中为正例的数量为N22,且该N22个测试数据项中标注为正例的数量为N23,那么正确率可以表示为N23/N22。The correct rate, also known as the precision rate, refers to the proportion of samples that are actually positive among the samples that are predicted to be positive. For example, the number of test data items in the test set is N2, if the number of positive examples in the prediction result is N22, and the number of positive examples in the N22 test data items is N23, then the correct rate can be expressed as N23/ N22.
召回率是指实际为正的样本中,被预测为正的比例。例如,测试集中的测试数据项的数量为N2,其中标注为正例的数量为N31,针对该N31个正例,若预测结果中也为正例的数量为N32,那么召回率可以表示为N32/N31。Recall refers to the proportion of samples that are actually positive that are predicted to be positive. For example, the number of test data items in the test set is N2, and the number marked as positive examples is N31. For the N31 positive examples, if the number of positive examples in the prediction result is N32, then the recall rate can be expressed as N32 /N31.
P-R曲线是将横轴定义为召回率,将纵轴定义为精确率。P-R曲线上的一个点代表:在某一阈值下,模型将大于该阈值的结果判定为正样本,小于该阈值的结果判定为负样本,此时返回结所对应的召回率和精确率。整条P-R曲线是通过将阈值从高到低移动而生成的。原点附近代表当阈值最大时模型的精确率和召回率。The P-R curve defines the horizontal axis as the recall rate and the vertical axis as the precision rate. A point on the P-R curve represents: under a certain threshold, the model judges the result greater than the threshold as a positive sample, and the result smaller than the threshold is judged as a negative sample. At this time, the recall rate and precision rate corresponding to the result are returned. The entire P-R curve is generated by shifting the threshold from high to low. Near the origin represents the precision and recall of the model when the threshold is maximum.
F1指数也称为F1得分(score),是精确率和召回率的调和平均值。例如,可以将精确率与召回率的乘积的两倍与精确率和召回率之和两者的比值作为F1指数。The F1 index, also known as the F1 score (score), is the harmonic mean of precision and recall. For example, the ratio of twice the product of the precision rate and the recall rate to the sum of the precision rate and the recall rate can be used as the F1 index.
在本公开的一些实施例中,评估结果可以包括正例表征值,如第一正确率和/或第一召回率。第一正确率表示预测为正的样本中,实际也为正的样本的比例。第一召回率表示实际为正的样本中,被预测为正的比例。评估结果可以包括负例表征值,如第二正确率和/或第二召回率。第二正确率表示预测为负的样本中,实际也为负的样本的比例。第二召回率表示实际为负的样本中,被预测为负的比例。In some embodiments of the present disclosure, the evaluation result may include a positive example characterization value, such as a first accuracy rate and/or a first recall rate. The first correct rate indicates the proportion of samples that are actually positive among the samples that are predicted to be positive. The first recall rate represents the proportion of the samples that are actually positive that are predicted to be positive. The evaluation result may include a negative example characterization value, such as a second accuracy rate and/or a second recall rate. The second correct rate indicates the proportion of samples that are actually negative among the samples that are predicted to be negative. The second recall rate represents the proportion of samples that are actually negative that are predicted to be negative.
在本公开的一些实施例中,评估结果可以包括第一预测均值和/或第二预测均值。第一预测均值表示针对实际为正的样本的预测值的平均值。第二预测均值表示针对实际为负的样本的预测值的平均值。评估结果可以包括均值差异,用于表示第一预测均值与第二预测均值之间的差异,如可以通过第一预测均值与第二预测均值之差或通过第一预测均值与第二预测均值的比值等来表示均值差异。In some embodiments of the present disclosure, the evaluation result may include a first predicted mean value and/or a second predicted mean value. The first predicted mean represents the average of predicted values for samples that are actually positive. The second predicted mean represents the average of predicted values for samples that are actually negative. The evaluation result may include mean difference, which is used to represent the difference between the first predicted mean and the second predicted mean, such as by the difference between the first predicted mean and the second predicted mean or by the difference between the first predicted mean and the second predicted mean Ratio, etc. to represent the mean difference.
应理解,上面列出的仅是评估结果的一些示例,还可以通过其他的表征作为评估结果,本公开不再一一罗列。It should be understood that the above list is only some examples of evaluation results, and other characterizations may also be used as evaluation results, which will not be listed in this disclosure.
示例性地,该评估结果可以由输入/输出模块110呈现给用户。例如可以通过图形用户界面呈现,便于用户查看。Exemplarily, the evaluation result can be presented to the user by the input/output module 110 . For example, it may be presented through a graphical user interface, which is convenient for users to view.
如此,通过本公开的实施例的方式,可以将数据集的偏见显著性通过量化的形式进行表征。这种定量评估的方案能够给用户提供明确的参考,便于用户对数据集进行调整等处理。In this way, by means of the embodiments of the present disclosure, the bias significance of the data set can be characterized in a quantitative form. This quantitative evaluation scheme can provide users with a clear reference, which is convenient for users to adjust the data set and other processing.
在输入/输出模块110包括图形用户界面的场景下,输入/输出模块110还可以通过图形用户界面以可视化方式呈现数据集偏见的表征。In scenarios where the input/output module 110 includes a graphical user interface, the input/output module 110 can also visually present representations of dataset biases through the graphical user interface.
具体的,通过将目标无关数据项输入经训练的分类模型,得到类激活图(Class Activation Map,CAM)。随后通过将CAM与目标无关数据项叠加而得到叠加结果,并显示该叠加结果。Specifically, by inputting target-independent data items into the trained classification model, a Class Activation Map (CAM) is obtained. Then an overlay result is obtained by overlaying the CAM and the target-independent data item, and the overlay result is displayed.
类激活图即类激活热力图,这样,本公开的实施例能够通过CAM表征分类模型的关注区域,具体的,是哪些区域(即模型的关注区域)导致了偏见。The class activation map is the class activation heat map. In this way, the embodiments of the present disclosure can use the CAM to characterize the attention areas of the classification model, specifically, which areas (ie, the attention areas of the model) cause bias.
本公开的实施例对得到CAM的具体方式不作限定。作为一例,可以采用基于梯度的类激活图方法(Gradient-based CAM,Grad-CAM)得到CAM。例如,可以提取分类模型的最 后一个卷积层的输出,即最后一层特征图,将提取出的最后一层特征图加权求和,得到CAM。可选地,也可以将加权求和后的结果再经线性整流单元(Rectified Linear Unit,ReLU)激活函数的处理后,作为CAM。这里进行加权求和的权重可以是顶层全连接层的权值。作为一例,可以计算分类模型的最后一层柔性最大值(Softmax)的输出对最后一层特征图所有像素的偏导数,再取宽高维度上的全局平均,作为对应的权重。The embodiments of the present disclosure do not limit the specific manner of obtaining the CAM. As an example, CAM can be obtained by using Gradient-based CAM (Grad-CAM). For example, the output of the last convolutional layer of the classification model, that is, the feature map of the last layer, can be extracted, and the extracted feature maps of the last layer can be weighted and summed to obtain CAM. Optionally, the weighted and summed results can also be used as a CAM after being processed by a Rectified Linear Unit (ReLU) activation function. The weights for weighted summation here can be the weights of the top fully connected layer. As an example, the partial derivative of the output of the last layer of softmax (Softmax) of the classification model to all pixels of the last layer feature map can be calculated, and then the global average in the width and height dimensions can be taken as the corresponding weight.
本公开的实施例对CAM与目标无关数据项(如背景图像)叠加的方式不作限定,例如可以采用加权求和的方式进行叠加,作为一例,CAM与背景图形的权重可以相等。Embodiments of the present disclosure do not limit the manner in which the CAM and the target-independent data item (such as the background image) are superimposed. For example, weighted summation may be used for superimposition. As an example, the weights of the CAM and the background graphics may be equal.
如此,本公开的实施例提供了一种将数据集偏见进行定量评估和可视化呈现的方案,从而可以明确地表征数据集偏见的显著性,并且能够可视化地呈现产生偏见的具体位置。如此,用户能够更直观更全面地知道数据集偏见的情况。该方案不需要用户过多的参与,可以自动化进行,能够在保证对偏见的定量评估准确性的前提下,提高处理的效率。In this way, the embodiments of the present disclosure provide a solution for quantitatively evaluating and visually presenting data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be visually presented. In this way, users can more intuitively and comprehensively know the bias of the data set. This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
模型训练模块130还可以用于基于分类模型对待处理数据集进行调整。The model training module 130 can also be used to adjust the data set to be processed based on the classification model.
具体地,待处理数据集可以具有初始样本权重分布,相应地,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布。举例来说,假设目标待处理数据项的初始样本权重为a,那么基于该目标待处理数据项所生成的无关数据项的样本权重也为a。Specifically, the data set to be processed may have an initial sample weight distribution, correspondingly, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution. For example, assuming that the initial sample weight of the target data item to be processed is a, then the sample weight of the irrelevant data item generated based on the target data item to be processed is also a.
示例性地,模型训练模块130可以用于基于对分类模型的迭代训练,得到推荐样本权重分布,如下结合图3所述。Exemplarily, the model training module 130 can be used to obtain the weight distribution of recommended samples based on the iterative training of the classification model, as described below in conjunction with FIG. 3 .
图3示出了根据本公开的实施例的模型训练模块130得到推荐样本权重的过程300的示意图。FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure.
在310处,确定具有第一样本权重分布的第一数据集和具有第二样本权重分布的第二数据集。At 310, a first data set having a first distribution of sample weights and a second data set having a second distribution of sample weights are determined.
具体的,可以基于待处理数据集构建无关数据集,将无关数据集划分为第一数据集和第二数据集,如上文中的实施例所述。Specifically, an unrelated data set may be constructed based on the data set to be processed, and the unrelated data set may be divided into a first data set and a second data set, as described in the above embodiments.
示例性地,待处理数据集中的待处理数据项可以具有初始样本权重,也就是说,待处理数据集可以具有初始样本权重分布。作为一例,初始样本权重可以是用户通过输入/输出模块110输入的。作为另一例,可以通过初始化过程确定初始化样本权重。Exemplarily, the data items to be processed in the data set to be processed may have initial sample weights, that is, the data set to be processed may have an initial sample weight distribution. As an example, the initial sample weight may be input by the user through the input/output module 110 . As another example, initialization sample weights may be determined through an initialization process.
样本权重可以用于指示待处理数据项的采样概率,举例来讲,假设第i条待处理数据项的样本权重为w i,那么该第i条待处理数据项的采样概率为
Figure PCTCN2022083841-appb-000001
The sample weight can be used to indicate the sampling probability of the data item to be processed. For example, assuming that the sample weight of the i-th data item to be processed is w i , then the sampling probability of the i-th data item to be processed is
Figure PCTCN2022083841-appb-000001
作为一例,初始样本权重分布可以指示待处理数据集中各条待处理数据项的采样概率相等。假设该待处理数据集中包括N条待处理数据项,每条待处理数据项的初始样本权重均为1,那么每条待处理数据项的采样概率都被初始化为1/N。As an example, the initial sample weight distribution may indicate that the sampling probabilities of each data item to be processed in the data set to be processed are equal. Assuming that the data set to be processed includes N data items to be processed, and the initial sample weight of each data item to be processed is 1, then the sampling probability of each data item to be processed is initialized to 1/N.
可理解,确定初始样本权重分布的同时可以相应地确定第一样本权重分布以及第二样本权重分布。It can be understood that while determining the initial sample weight distribution, the first sample weight distribution and the second sample weight distribution can be correspondingly determined.
在320处,基于第一样本权重分布对第一数据集进行采样,并通过迭代的方式对分类模型进行训练。At 320, the first data set is sampled based on the first sample weight distribution, and the classification model is trained iteratively.
在330处,基于第二数据集对S320训练之后的分类模型进行评估,得到评估结果。At 330, the classification model trained in S320 is evaluated based on the second data set, and an evaluation result is obtained.
示例性地,评估结果可以是基于经训练的分类模型对第二数据集中的无关数据项的预测结果与该无关数据项的标签的比较而得到的。作为示例,可以将无关数据项输入到经训练的 分类模型,得到关于无关数据项的预测结果,基于无关数据项的预测结果与无关数据项的标签的比较结果,确定评估结果。评估结果可以包括以下至少一项:正确率、准确率、召回率、F1指数、准确率-召回率曲线、平均精度指标、误报率、漏报率等。关于评估结果可以参照上文的描述,这里不再赘述。Exemplarily, the evaluation result may be obtained based on the comparison of the predicted result of the trained classification model for the irrelevant data item in the second data set with the label of the irrelevant data item. As an example, irrelevant data items can be input into the trained classification model to obtain prediction results about the irrelevant data items, and the evaluation results are determined based on the comparison results of the prediction results of the irrelevant data items and the labels of the irrelevant data items. The evaluation result may include at least one of the following: accuracy rate, precision rate, recall rate, F1 index, precision rate-recall rate curve, average precision index, false positive rate, false negative rate, and the like. Regarding the evaluation results, reference may be made to the above description, and details are not repeated here.
在340处,判断评估结果所指示的偏见显著性是否高。At 340, it is determined whether the significance of the bias indicated by the evaluation result is high.
如果340处判断确定评估结果指示偏见显著性高,例如评估结果大于预设阈值,那么可以进行到350处。否则如果340处判断确定评估结果指示偏见显著性不高,例如评估结果不大于预设阈值,那么可以进行到360处。If it is determined at 340 that the evaluation result indicates that the bias is significant, for example, the evaluation result is greater than a preset threshold, then proceed to 350 . Otherwise, if it is determined at 340 that the evaluation result indicates that the bias is not significant, for example, the evaluation result is not greater than the preset threshold, then the process may proceed to 360 .
预设阈值可以基于待处理数据集的处理精度和应用场景等进行设定。预设阈值可以与评估结果的具体含义有关,如评估结果包括正确率,预设阈值可以被设定为例如30%或50%或其他数值等。The preset threshold can be set based on the processing accuracy and application scenarios of the data set to be processed. The preset threshold may be related to the specific meaning of the evaluation result. For example, the evaluation result includes a correct rate, and the preset threshold may be set to, for example, 30% or 50% or other numerical values.
在350处,更新样本权重分布。At 350, the sample weight distribution is updated.
参照图3,如图3中虚线箭头所示,在350之后,可以返回310或320继续执行。Referring to FIG. 3 , as shown by the dotted arrow in FIG. 3 , after 350 , return to 310 or 320 to continue execution.
在一例中,可以返回310继续执行,也就是说重新构建第一数据集和第二数据集。这样,在上一循环中某个无关数据项可能属于第一数据集,但是在下一循环中该无关数据项可能属于第一数据集也可能属于第二数据集。In one example, it may return to 310 to continue execution, that is to say, rebuild the first data set and the second data set. In this way, an irrelevant data item may belong to the first data set in the previous cycle, but the irrelevant data item may belong to the first data set or the second data set in the next cycle.
在另一例中,可以返回320继续执行,也就是说第一数据集和第二数据集中的无关数据项没有发生变化,但是第一样本权重分布和/或第二样本权重分布被更新了。In another example, it may return to 320 to continue execution, that is to say, the irrelevant data items in the first data set and the second data set do not change, but the first sample weight distribution and/or the second sample weight distribution are updated.
在350之后可以基于更新后的第一样本权重分布对第一数据集重新采样,并重新迭代地训练分类模型。并且基于第二数据集对重新训练后的分类模型进行评估,重新得到评估结果。After 350, the first data set may be re-sampled based on the updated first sample weight distribution, and the classification model may be re-trained iteratively. And the retrained classification model is evaluated based on the second data set, and the evaluation result is obtained again.
如此可以迭代地执行310至350或者320至350,直到评估结果指示偏见显著性不高(例如评估结果不大于预设阈值)为止。In this way, 310 to 350 or 320 to 350 may be iteratively performed until the evaluation result indicates that the bias is not significant (for example, the evaluation result is not greater than the preset threshold).
本公开的实施例对更新样本权重分布的具体实现方式不作限定。The embodiments of the present disclosure do not limit the specific implementation manner of updating the sample weight distribution.
作为一例,可以采用随机的方式更新样本权重分布。举例来讲,可以随机地将部分待处理数据项的样本权重进行更新,例如将某待处理数据项的样本权重从1更新为2,将另一待处理数据项的样本权重从1更新为3等等。可理解,该随机的方式具有不确定性,可能会使得得到推荐样本权重分布的过程耗时较长。As an example, the sample weight distribution may be updated in a random manner. For example, the sample weights of some data items to be processed can be randomly updated, for example, the sample weight of a data item to be processed is updated from 1 to 2, and the sample weight of another data item to be processed is updated from 1 to 3 and many more. It can be understood that the random method has uncertainty, which may make the process of obtaining the weight distribution of the recommended samples take a long time.
作为另一例,可以采用预定的规则更新样本权重分布。例如,可以对第二样本权重分布进行更新。举例来讲,如果评估结果指示分类模型对第二数据集中的无关数据项的预测结果不同于该无关数据项的标签,则可以将该无关数据项的样本权重调高。如将该无关数据项的样本权重从a1更新为a1+1或2*a1或其他等。在该例中,第一样本权重分布可以保持不变或者可以采用其他方式对第一样本权重分布进行更新。可选地,在该例中,更新样本权重分布之后,可以将第二数据集与第一数据集进行交换再进行下一循环。例如,在下一循环中,将基于上一循环的第二数据集以及更新的第二样本权重分布训练分类模型。As another example, a predetermined rule may be used to update the sample weight distribution. For example, the second sample weight distribution may be updated. For example, if the evaluation result indicates that the prediction result of the classification model for the irrelevant data item in the second data set is different from the label of the irrelevant data item, then the sample weight of the irrelevant data item may be increased. For example, the sample weight of the irrelevant data item is updated from a1 to a1+1 or 2*a1 or others. In this example, the weight distribution of the first sample may remain unchanged or the weight distribution of the first sample may be updated in other ways. Optionally, in this example, after the sample weight distribution is updated, the second data set may be exchanged with the first data set before performing the next cycle. For example, in the next cycle, the classification model will be trained based on the second data set of the previous cycle and the updated second sample weight distribution.
作为另一例,可以通过遗传算法对样本权重分布进行优化以更新样本权重分布。举例来讲,可以将样本权重分布作为遗传算法的基因初始值,可以基于在330得到的评估结果构建目标函数,从而可以采用该遗传算法对样本权重分布进行优化,优化后的样本权重分布即更新后的样本权重分布。本公开的实施例对遗传算法的目标函数的具有构建方式不作限定,例如评估结果包括正样本与负样本的均值差异以及正确率,那么可以将均值差异与正确率之和 作为目标函数。可理解,也可以采用其他方式构建目标函数,这里不再罗列。As another example, the distribution of sample weights may be optimized through a genetic algorithm to update the distribution of sample weights. For example, the sample weight distribution can be used as the genetic initial value of the genetic algorithm, and the objective function can be constructed based on the evaluation result obtained at 330, so that the genetic algorithm can be used to optimize the sample weight distribution, and the optimized sample weight distribution is updated immediately The subsequent sample weight distribution. The embodiment of the present disclosure does not limit the construction method of the objective function of the genetic algorithm. For example, the evaluation result includes the mean difference and the correct rate of the positive sample and the negative sample, then the sum of the mean difference and the correct rate can be used as the objective function. It is understandable that other methods can also be used to construct the objective function, which will not be listed here.
如此,本公开的实施例能够基于经训练的分类模型,对待处理数据集的样本权重分布进行更新,从而得到推荐样本权重分布,该过程不需要用户参与,自动化程度高。In this way, the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation and has a high degree of automation.
作为另一例,可以获取用户对样本权重分布的修改以更新样本权重分布。举例来讲,用户可以参考评估结果和/或所显示的叠加结果(如上所述),根据经验推测要对样本权重分布进行怎样的修改,然后通过输入/输出模块110输入该修改,以更新样本权重分布。As another example, user modifications to the sample weight distribution may be acquired to update the sample weight distribution. For example, the user can empirically infer what modification to the sample weight distribution should be made by referring to the evaluation results and/or the displayed overlay results (as described above), and then input the modification through the input/output module 110 to update the sample weight distribution.
如此,该方式能够充分地考虑用户需求,基于用户的修改来更新样本权重分布,使得得到的推荐样本权重分布更能符合用户期望,提升用户满意度。In this way, this method can fully consider the user's needs, and update the sample weight distribution based on the user's modification, so that the obtained recommended sample weight distribution can better meet the user's expectations and improve user satisfaction.
在360处,得到推荐样本权重分布。At 360, a recommendation sample weight distribution is obtained.
如果340处判断确定评估结果指示偏见显著性不高,例如评估结果不大于预设阈值,那么可以将得到当前评估结果的样本权重分布作为推荐样本权重分布。If it is determined at 340 that the evaluation result indicates that the significance of the bias is not high, for example, the evaluation result is not greater than the preset threshold, then the sample weight distribution obtained from the current evaluation result may be used as the recommended sample weight distribution.
如此,本公开的实施例能够基于迭代地训练分类模型来更新样本权重分布,能够查看到数据集偏见随样本权重分布更新的变化,这样能够迭代地对待处理数据集进行检测,得到有效的具有高参考性的推荐样本权重分布。In this way, the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can view the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and an effective high-quality Referential recommended sample weight distribution.
输入/输出模块110还可以呈现该推荐样本权重分布,以供用户作为进一步调整待处理数据集的参考。例如,通过图形用户界面以可视化方式呈现推荐样本权重分布。The input/output module 110 can also present the recommended sample weight distribution for the user as a reference for further adjustment of the data set to be processed. For example, the recommended sample weight distribution is presented visually through a graphical user interface.
示例性地,数据集处理模块120可以基于得到的推荐样本权重分布,对待处理数据集进行增加或删除,以构建无偏数据集。Exemplarily, the data set processing module 120 may add or delete the data set to be processed based on the obtained recommended sample weight distribution, so as to construct an unbiased data set.
作为一例,数据集处理模块120可以将推荐样本权重大的待处理数据项进行复制,以扩充待处理数据集中的待处理数据项的数目。数据集处理模块120可以将推荐样本权重小的待处理数据项进行删除,以缩减待处理数据集中的待处理数据项的数目。As an example, the data set processing module 120 may copy the data items to be processed with a large recommended sample weight, so as to expand the number of data items to be processed in the data set to be processed. The data set processing module 120 may delete the unprocessed data items whose recommended sample weights are small, so as to reduce the number of unprocessed data items in the unprocessed data set.
作为一例,可以经由输入/输出模块110获取用户对部分待处理数据项的删除指令,以将部分待处理数据项进行删除。可以经由输入/输出模块110获取用户输入的其他的数据项,以增加到当前的待处理数据集中。As an example, a user's deletion instruction for some data items to be processed may be obtained via the input/output module 110, so as to delete some data items to be processed. Other data items input by the user may be obtained via the input/output module 110 to be added to the current data set to be processed.
举例来讲,用户可以基于推荐样本权重分布对待处理数据集进行增加或删除。例如,用户可以找到与推荐样本权重大的待处理数据项相似的其他样本,作为新的数据项增加到该数据集中,从而实现了对数据集的数据补充。作为一例,相似的其他样本可以是由同一(或同一型号的)图像采集设备在相似的环境(如关照条件等)所采集的其他图像。For example, users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set. As an example, other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
如此,本公开的实施例中,能够基于推荐样本权重分布对待处理数据集进行增加或删除,从而能够构建无偏数据集。进一步地,该无偏数据集可以用于训练更加稳健、无偏的特定任务的模型。In this way, in the embodiments of the present disclosure, the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
可理解,图1所示的系统100可以是能够与用户进行交互的系统,该系统10可以是软件系统也、硬件系统、或软硬结合的系统。It can be understood that the system 100 shown in FIG. 1 may be a system capable of interacting with users, and the system 10 may be a software system, a hardware system, or a system combining hardware and software.
在一些示例中,该系统100可以被实现为计算设备或者计算设备的一部分,其中计算设备包括但不限于台式机、移动终端、可穿戴设备、服务器、云服务器等。In some examples, the system 100 can be implemented as a computing device or a part of a computing device, where the computing device includes but not limited to a desktop computer, a mobile terminal, a wearable device, a server, a cloud server, and the like.
可理解,图1所示的系统100可以被实现为人工智能平台(AI平台)。AI平台是为AI开发者和用户提供便捷的AI开发环境以及便利的开发工具的平台。AI平台中可以内置有各种解决不同问题的AI模型或者AI子模型,AI平台可以根据用户输入的需求建立适用的AI模型。即用户只需在AI平台中确定自己的需求,且按照提示准备好数据集上传至AI平台, AI平台就能为用户训练出一个可用于实现用户需要的AI模型。本公开实施例中的AI模型可以用于评价用户输入的待处理数据集的数据偏见。It can be understood that the system 100 shown in FIG. 1 can be implemented as an artificial intelligence platform (AI platform). The AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users. Various AI models or AI sub-models for solving different problems can be built into the AI platform, and the AI platform can establish applicable AI models according to the needs input by users. That is, users only need to determine their own needs in the AI platform, and follow the prompts to prepare the data set and upload it to the AI platform, and the AI platform can train the user an AI model that can be used to realize the user's needs. The AI model in the embodiments of the present disclosure can be used to evaluate the data bias of the data set to be processed input by the user.
图4示出了根据本公开的实施例的系统100被部署于云环境中的场景400的示意图。场景400中,系统100被全部部署在云环境410中。FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure. In the scenario 400 , the system 100 is fully deployed in the cloud environment 410 .
云环境410是云计算模式下利用基础资源向用户提供云服务的实体。云环境410包括云数据中心412和云服务平台414,云数据中心412包括云服务提供商拥有的大量基础资源(包括计算资源、存储资源和网络资源),云数据中心412包括的计算资源可以是大量的计算设备(例如服务器)。系统100可以独立地部署在云数据中心412内的服务器或虚拟机上,系统100也可以分布式地部署在云数据中心412内的多台服务器上、或者分布式地部署在云数据中心412内的多台虚拟机上、再或者分布式地部署在云数据中心412内的服务器和虚拟机上。The cloud environment 410 is an entity that provides cloud services to users by using basic resources in the cloud computing mode. The cloud environment 410 includes a cloud data center 412 and a cloud service platform 414. The cloud data center 412 includes a large number of basic resources (comprising computing resources, storage resources and network resources) owned by the cloud service provider. The computing resources included in the cloud data center 412 can be A large number of computing devices (such as servers). The system 100 can be independently deployed on servers or virtual machines in the cloud data center 412, and the system 100 can also be deployed on multiple servers in the cloud data center 412 in a distributed manner, or distributed in the cloud data center 412 on multiple virtual machines, or distributedly deployed on servers and virtual machines in the cloud data center 412.
如图4所示,系统100可以由云服务提供商在云服务平台414抽象成一种AI开发云服务424提供给用户,用户在云服务平台414购买该云服务后(可预充值再根据最终资源的使用情况进行结算),云环境410利用部署在云数据中心412的系统平台100向用户提供AI开发云服务424。在使用AI开发云服务424时,用户可以通过应用程序接口(application program interface,API)或者GUI上传待处理数据集等。云环境410中的系统100接收用户上传的待处理数据集,可以执行数据集处理、模型训练、数据集调整等操作。系统100可以通过API或者GUI向用户返回模型的评估结果、推荐样本权重分布等。As shown in Figure 4, the system 100 can be abstracted into an AI development cloud service 424 by the cloud service provider on the cloud service platform 414 and provided to the user. Settlement based on usage conditions), the cloud environment 410 utilizes the system platform 100 deployed in the cloud data center 412 to provide the AI development cloud service 424 to the user. When using the AI development cloud service 424, the user can upload the data set to be processed through an application program interface (application program interface, API) or GUI. The system 100 in the cloud environment 410 receives the data set to be processed uploaded by the user, and can perform operations such as data set processing, model training, and data set adjustment. The system 100 can return the evaluation result of the model, the weight distribution of recommended samples, etc. to the user through API or GUI.
在本申请的另一种实施例中,云环境410下的系统100被抽象成AI开发云服务424向用户提供时,可分为两部分,例如数据集偏见评估云服务和数据集调整云服务。用户在云服务平台414可以仅购买数据集偏见评估云服务,该云服务平台414可以基于用户上传的待处理数据集构建无关数据集,通过训练得到分类模型,并向用户返回分类模型的评估结果,以便用户获知待处理数据集的偏见显著性。用户也可以在云服务平台414进一步购买数据集调整云服务,该云服务平台414可以基于样本权重分布,对分类模型进行迭代训练,对样本权重分布进行更新,并向用户返回推荐样本权重分布,以便用户参考该推荐样本权重分布对待处理数据集进行增加或删除以构建无偏数据集。In another embodiment of the present application, when the system 100 under the cloud environment 410 is abstracted into an AI development cloud service 424 and provided to users, it can be divided into two parts, such as data set bias evaluation cloud service and data set adjustment cloud service . On the cloud service platform 414, the user can only purchase the data set bias evaluation cloud service. The cloud service platform 414 can construct an irrelevant data set based on the data set to be processed uploaded by the user, obtain a classification model through training, and return the evaluation result of the classification model to the user , so that the user is informed of the significance of bias in the dataset being processed. The user can also further purchase the data set adjustment cloud service on the cloud service platform 414. The cloud service platform 414 can iteratively train the classification model based on the sample weight distribution, update the sample weight distribution, and return the recommended sample weight distribution to the user. So that the user can add or delete the data set to be processed with reference to the weight distribution of the recommended samples to construct an unbiased data set.
图5示出了根据本公开的实施例的系统100被部署于不同环境中的场景500的示意图。场景500中,系统100被分布式地部署在不同的环境中,不同的环境可以包括但不限于云环境510、边缘环境520和终端计算设备530中的至少两者。FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure. In scenario 500 , the system 100 is distributed and deployed in different environments, which may include but not limited to at least two of cloud environment 510 , edge environment 520 and terminal computing device 530 .
系统100可以在逻辑上被分成多个部分,每个部分具有不同的功能。例如,如图1所示,系统100包括输入/输出模块110、数据集处理模块120、模型训练模块130、模型存储模块140和数据存储模块150。系统100的各部分可以分别部署在终端计算设备530、边缘环境520和云环境510中的任意两个或三个环境中。部署在不同环境的系统100的各个部分协同实现为用户提供各种功能。例如,在一种场景中,终端计算设备530中部署系统100的输入/输出模块110和数据存储模块150,边缘环境520的边缘计算设备中部署系统100的数据集处理模块120,云环境510中部署系统100的模型训练模块130和模型存储模块140。用户将待处理数据集发送至终端计算设备530中的输入/输出模块110,终端计算设备530将待处理数据集存储至数据存储模块150。边缘环境520的边缘计算设备中的数据集处理模块120基于来自终端计算设备530的待处理数据集构建无关数据集。云环境510中的模型训练模块130基于来自边缘环境520的无关数据集来训练分类模型。云环境510还可以将经训练的分类模型 存在至模型存储模块140。应理解,本申请不对系统100的哪些部分具体被部署在什么环境进行限制,实际应用时可根据终端计算设备530的计算能力、边缘环境520和云环境510的资源占有情况或具体应用需求进行适应性的部署。 System 100 may be logically divided into multiple sections, each section having a different function. For example, as shown in FIG. 1 , the system 100 includes an input/output module 110 , a data set processing module 120 , a model training module 130 , a model storage module 140 and a data storage module 150 . Each part of the system 100 can be deployed in any two or three environments of the terminal computing device 530 , the edge environment 520 and the cloud environment 510 . Various parts of the system 100 deployed in different environments cooperate to provide users with various functions. For example, in one scenario, the input/output module 110 and the data storage module 150 of the system 100 are deployed in the terminal computing device 530, the data set processing module 120 of the system 100 is deployed in the edge computing device of the edge environment 520, and the data set processing module 120 of the system 100 is deployed in the cloud environment 510 The model training module 130 and the model storage module 140 of the deployment system 100 are deployed. The user sends the data set to be processed to the input/output module 110 in the terminal computing device 530 , and the terminal computing device 530 stores the data set to be processed to the data storage module 150 . The data set processing module 120 in the edge computing device of the edge environment 520 constructs an irrelevant data set based on the data set to be processed from the terminal computing device 530 . The model training module 130 in the cloud environment 510 trains a classification model based on an unrelated dataset from the edge environment 520 . The cloud environment 510 may also store the trained classification model to the model storage module 140. It should be understood that this application does not limit which parts of the system 100 are deployed in which environment, and can be adapted according to the computing capability of the terminal computing device 530, the resource occupancy of the edge environment 520 and the cloud environment 510, or specific application requirements during actual application. Sexual deployment.
边缘环境520为包括距离终端计算设备530较近的边缘计算设备集合的环境,边缘计算设备包括但不限于:边缘服务器、拥有计算能力的边缘小站等。可理解,系统100也可以被单独部署在边缘环境520的一台边缘服务器上,或者可以被分布式地部署在边缘环境520的多台边缘服务器上。The edge environment 520 is an environment including a collection of edge computing devices that are closer to the terminal computing device 530 , and the edge computing devices include but are not limited to: edge servers, edge small stations with computing capabilities, and the like. It can be understood that the system 100 may also be independently deployed on one edge server in the edge environment 520 , or may be deployed on multiple edge servers in the edge environment 520 in a distributed manner.
终端计算设备530包括但不限于:终端服务器、智能手机、笔记本电脑、平板电脑、个人台式电脑、智能摄相机等。可理解,系统100也可以被单独部署在一台终端计算设备530上,或者可以被分布式地部署在多台终端计算设备530上。The terminal computing device 530 includes, but is not limited to: a terminal server, a smart phone, a notebook computer, a tablet computer, a personal desktop computer, a smart camera, and the like. It can be understood that the system 100 may also be independently deployed on one terminal computing device 530 , or may be deployed on multiple terminal computing devices 530 in a distributed manner.
图6示出了根据本公开的实施例的计算设备600的结构示意图。图6中的计算设备600可以被实现为图5中的云环境510中的设备、边缘环境520中的设备、或者终端计算设备530。应理解,图6所示的计算设备600也可以被视为计算设备集群,即,计算设备600包括前述云环境510中的设备、边缘环境520中的设备、终端计算设备530中的一个或多个设备。FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure. The computing device 600 in FIG. 6 may be implemented as a device in the cloud environment 510 in FIG. 5 , a device in the edge environment 520 , or a terminal computing device 530 . It should be understood that the computing device 600 shown in FIG. 6 can also be regarded as a computing device cluster, that is, the computing device 600 includes one or more of the aforementioned devices in the cloud environment 510, devices in the edge environment 520, and terminal computing devices 530. devices.
如图6所示,计算设备600包括存储器610、处理器620、通信接口630以及总线640,其中,总线640用于计算设备600的各个部件彼此之间的通信。As shown in FIG. 6 , the computing device 600 includes a memory 610 , a processor 620 , a communication interface 630 and a bus 640 , wherein the bus 640 is used for communication between various components of the computing device 600 .
存储器610可以是只读存储器(Read Only Memory,ROM),随机存取存储器(Random Access Memory,RAM),硬盘,快闪存储器或其任意组合。存储器610可以存储程序,当存储器610中存储的程序被处理器620执行时,处理器620和通信接口630用于执行如上所述的系统100中各个模块能够执行的过程。应理解,处理器620和通信接口630也可以用于执行本说明书下文所述的数据处理的方法实施例中的部分或全部内容。存储器还可以存储数据集和分类模型。例如,存储器610中的一部分存储资源被划分成一个数据存储模块,用于存储数据集,如待处理数据集、无关数据集等,存储器610中的一部分存储资源被划分成模型存储模块,用于存储分类模型。The memory 610 may be a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a hard disk, a flash memory or any combination thereof. The memory 610 can store programs, and when the programs stored in the memory 610 are executed by the processor 620, the processor 620 and the communication interface 630 are used to perform the processes that can be performed by the various modules in the system 100 as described above. It should be understood that the processor 620 and the communication interface 630 may also be used to execute part or all of the content in the embodiments of the data processing method described below in this specification. The memory can also store datasets and classification models. For example, a part of the storage resources in the memory 610 is divided into a data storage module for storing data sets, such as data sets to be processed, irrelevant data sets, etc., and a part of the storage resources in the memory 610 is divided into a model storage module for Store classification models.
处理器620可以采用中央处理单元(Central Processing Unit,CPU),专用集成电路(Application-Specific Integrated Circuit,ASIC),图形处理单元(Graphics Processing Unit,GPU)或其任意组合。处理器620可以包括一个或多个芯片。处理器620可以包括加速器,例如神经处理单元(Neural Processing Unit,NPU)。The processor 620 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a graphics processing unit (Graphics Processing Unit, GPU) or any combination thereof. Processor 620 may include one or more chips. The processor 620 may include an accelerator, such as a Neural Processing Unit (Neural Processing Unit, NPU).
通信接口630使用例如收发器一类的收发模块,来实现计算设备600与其他设备或通信网络之间的通信。例如,可以通过通信接口630获取数据。The communication interface 630 uses a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .
总线640可包括在计算设备600各个部件(例如,存储器610、处理器620、通信接口630)之间传送信息的通路。 Bus 640 may include pathways for communicating information between various components of computing device 600 (eg, memory 610 , processor 620 , communication interface 630 ).
图7示出了根据本公开的实施例的数据处理方法700的示意流程图。图7所示的方法700可以由系统100执行。FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure. The method 700 shown in FIG. 7 can be executed by the system 100 .
如图7所示,在框710,基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的。As shown in Figure 7, in block 710, an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
示例性地,待处理数据集包括多个待处理数据项,每一待处理数据项具有标签。待处理数据项可以包括与标签有关的部分和与标签无关的部分。Exemplarily, the data set to be processed includes a plurality of data items to be processed, and each data item to be processed has a label. Data items to be processed may include tag-related parts and tag-independent parts.
在一些实施例中,可以从待处理数据集的目标待处理数据项中去除与目标待处理数据项 的标签相关联的部分,以得到目标待处理数据项中的剩余部分。利用剩余部分来构建无关数据集中的一条无关数据项,一条无关数据项的标签对应于目标待处理数据项的标签。In some embodiments, the part associated with the tag of the target data item to be processed may be removed from the target data item to be processed in the data set to be processed to obtain the remainder of the target data item to be processed. Use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
在一些实施例中,待处理数据集为图像数据集,也就是说,待处理数据项为图像。那么可以对待处理数据集中的目标待处理数据项执行图像分割,以得到与目标待处理数据项对应的背景图像。利用背景图像来构建无关数据集中的一条无关数据项。In some embodiments, the data set to be processed is an image data set, that is, the data item to be processed is an image. Then image segmentation may be performed on the target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed. A background image is used to construct an unrelated data item in an unrelated data set.
具体地,图像中与标签关联的部分是前景区域,图像中除前景区域之外的其他区域为背景区域,那么可以通过前景-背景分离,仅基于背景区域确定无关数据项。Specifically, the part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.
在一些实施例中,待处理数据集中的待处理数据项为视频序列。那么可以基于视频序列中一帧图像与一帧图像的前一帧图像之间的梯度信息,确定视频序列的二值图像。并基于二值图像,生成视频序列的背景图像。随后利用视频序列的背景图像来构建无关数据集中的一条无关数据项。In some embodiments, the data items to be processed in the data set to be processed are video sequences. Then the binary image of the video sequence can be determined based on the gradient information between one frame image in the video sequence and the previous frame image of the one frame image. And based on the binary image, the background image of the video sequence is generated. The background image of the video sequence is then used to construct an unrelated data item in the unrelated data set.
图8示出了根据本公开的实施例的构建无关数据项的过程800的示意流程图。具体的,图8所示的是基于待处理数据项(视频序列)构建无关数据项的过程。FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure. Specifically, what is shown in FIG. 8 is the process of constructing irrelevant data items based on the data items to be processed (video sequences).
如图8所示,在框810,计算目标视频序列中相邻两帧图像之间的梯度信息。As shown in FIG. 8 , at block 810 , the gradient information between two adjacent frames of images in the target video sequence is calculated.
示例性地,可以计算两帧图像的特征向量沿时间维度的梯度,从而得到梯度信息。这样,能够得到视频序列中静止不变的背景部分,例如图像边框等。Exemplarily, the gradient of the feature vectors of the two frames of images along the time dimension may be calculated, so as to obtain gradient information. In this way, the static and unchanging background parts in the video sequence can be obtained, such as image borders and the like.
在框820,基于梯度信息的叠加,得到梯度叠加图。At block 820, a gradient overlay map is obtained based on the overlay of the gradient information.
示例性地,可以将810得到的梯度信息进行加权求和或求最大值或求最小值等,以完成叠加,得到梯度叠加图。Exemplarily, the gradient information obtained in step 810 may be weighted and summed, maximized or minimized, etc., to complete superposition and obtain a gradient superposition map.
在框830,对梯度叠加图进行阈值化处理,得到初始二值图。In block 830, thresholding is performed on the gradient overlay image to obtain an initial binary image.
在框840,对初始二值图进行形态学处理,得到二值图像。In block 840, perform morphological processing on the initial binary image to obtain a binary image.
示例性地,对初始二值图进行若干次形态学膨胀,然后再执行相同次数的形态学腐蚀,从而得到二值图像。Exemplarily, the initial binary image is subjected to several times of morphological expansion, and then the same number of times of morphological erosion is performed, so as to obtain the binary image.
在框850,基于二值图像得到背景图像,并且将背景图像作为与该视频序列对应的无关数据项。At block 850, a background image is obtained based on the binary image, and the background image is used as an irrelevant data item corresponding to the video sequence.
示例性地,可以对二值图像执行抠图操作,例如可以通过矩阵点乘的方式,从而得到背景图像。Exemplarily, a matting operation may be performed on a binary image, for example, a matrix dot product may be used to obtain a background image.
如此,考虑到视频序列中的各帧图像之间的相似性,以及视频序列中背景基本不变的特性,能够得到视频序列对应的背景图像。In this way, the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
另外,无关数据项的标签是基于待处理数据项的标签而确定的。具体地,目标待处理数据项具有标签A,且通过对目标待处理数据项进行处理(如图像分割等)得到目标无关数据项,那么该目标无关数据项的标签也为标签A。In addition, the label of the irrelevant data item is determined based on the label of the data item to be processed. Specifically, if the target data item to be processed has label A, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation), then the label of the target unrelated data item is also label A.
在框720,将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的。At block 720, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
无关数据项的样本权重是基于待处理数据项的样本权重而确定的。具体地,目标待处理数据项具有样本权重w,且通过对目标待处理数据项进行处理(如图像分割等)得到目标无关数据项,那么该目标无关数据项的样本权重也为样本权重w。The sample weights of the irrelevant data items are determined based on the sample weights of the data items to be processed. Specifically, the target data item to be processed has a sample weight w, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation, etc.), then the sample weight of the target unrelated data item is also the sample weight w.
本公开实施例中对第一数据集和第二数据集划分的方式不作限定。例如可以按照9:1的 方式划分,使得第一数据集中无关数据项的数量与第二数据集中无关数据项的数量的比值约为9:1。例如可以按照1:1的方式划分,使得第一数据集中无关数据项的数量与第二数据集中无关数据项的数量的比值约为1:1。另外,第一数据集也可以被进一步划分为第一子数据集和第二子数据集,例如第一子数据集中无关数据项的数量与第二子数据集中无关数据项的数量的比值约为7:2。可理解,这里所列出的比例仅是示意,不构成对本公开实施例的限定。In the embodiment of the present disclosure, the manner of dividing the first data set and the second data set is not limited. For example, it may be divided in a manner of 9:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is about 9:1. For example, it may be divided in a manner of 1:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is approximately 1:1. In addition, the first data set can also be further divided into the first sub-data set and the second sub-data set, for example, the ratio of the number of irrelevant data items in the first sub-data set to the number of irrelevant data items in the second sub-data set is about 7:2. It can be understood that the ratios listed here are only for illustration, and are not intended to limit the embodiments of the present disclosure.
在框730,基于第一数据集和第一样本权重分布,对分类模型进行训练。At block 730, the classification model is trained based on the first data set and the first sample weight distribution.
具体地,可以基于第一样本权重分布对第一数据集进行采样,并基于第一数据集基于第一数据集中无关数据项的标签,对分类模型进行训练。Specifically, the first data set may be sampled based on the first sample weight distribution, and the classification model may be trained based on the first data set based on labels of irrelevant data items in the first data set.
也就是说,可以将第一数据集作为训练集对分类模型进行训练。可选地,在训练之前,可以对第一数据集进行预处理,包括但不限于:特征提取、聚类分析、边缘检测、图像去噪等。That is to say, the classification model can be trained by using the first data set as a training set. Optionally, before training, the first data set may be preprocessed, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, and the like.
本公开实施例对分类模型的具体结构不作限定,例如可以为卷积神经网络,至少包括卷积层和全连接层。The embodiment of the present disclosure does not limit the specific structure of the classification model, for example, it may be a convolutional neural network, including at least a convolutional layer and a fully connected layer.
在框740,基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。At block 740, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias for the processing data set having the sample weight distribution.
也就是说,可以将第二数据集作为测试集,得到评估结果。具体地,可以基于分类模型对第二数据集中无关数据项的预测结果以及第二数据集中无关数据项的标签两者之间的比较结果,得到评估结果。That is to say, the second data set can be used as a test set to obtain an evaluation result. Specifically, the evaluation result may be obtained based on a prediction result of the classification model for the irrelevant data item in the second data set and a comparison result between labels of the unrelated data item in the second data set.
作为一例,评估结果可以包括针对第二数据集中正样本的第一正确率以及针对第二数据集中负样本的第二正确率。As an example, the evaluation result may include a first accuracy rate for positive samples in the second data set and a second accuracy rate for negative samples in the second data set.
如此,本公开的实施例中能够通过构建无关数据集,基于无关数据集训练和评估,得到对于待处理数据集的偏见显著性的定量化表征。能够提供定量的偏见参见,便于对待处理数据集的进一步调整等操作。In this way, in the embodiments of the present disclosure, by constructing an unrelated data set, training and evaluating based on the unrelated data set, a quantitative representation of the bias significance of the data set to be processed can be obtained. It can provide quantitative bias reference, which is convenient for further adjustment of the data set to be processed.
示例性地,如果在框740得到的评估结果指示该偏见显著性大(或称存在显著的偏见),那么可以对待处理数据集的样本权重分布进行更新。Exemplarily, if the evaluation result obtained in block 740 indicates that the bias is significant (or there is a significant bias), then the sample weight distribution of the data set to be processed may be updated.
在一些实施例中,如果评估结果大于预设阈值,更新待处理数据集的样本权重分布。进一步地,在此之后,可以返回720重新获得第一数据集和第二数据集,并重复执行730和740,直到框740得到的评估结果指示该偏见显著性不大(或称不存在显著的偏见),例如,评估结果不大于预设阈值。随后,可以将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布,输出该推荐样本权重分布。In some embodiments, if the evaluation result is greater than a preset threshold, the sample weight distribution of the data set to be processed is updated. Further, after this, return to 720 to obtain the first data set and the second data set again, and repeatedly execute 730 and 740 until the evaluation result obtained in block 740 indicates that the bias is not significant (or there is no significant bias). bias), for example, the evaluation result is not greater than a preset threshold. Subsequently, the sample weight distribution when the evaluation result is not greater than the preset threshold may be used as the recommended sample weight distribution, and the recommended sample weight distribution may be output.
本公开实施例中对更新样本权重分布的具体方式不作限定,例如可以采用如下的至少一种方式来更新:采用预定的规则更新样本权重分布,采用随机的方式更新样本权重分布,获取用户对样本权重分布的修改以更新样本权重分布,或者通过遗传算法对样本权重分布进行优化以更新样本权重分布。The embodiment of the present disclosure does not limit the specific method of updating the sample weight distribution. For example, at least one of the following methods can be used to update: use predetermined rules to update the sample weight distribution, use a random method to update the sample weight distribution, and obtain the user's sample weight distribution. Modify the weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
在本公开的一些实现方式中,更新样本权重分布可以更新第一数据集的第一样本权重分布,如此在返回执行720时,再次执行的720中的第一数据集的第一样本权重分布被更新,进而在730处被训练的分类模型也被更新。In some implementations of the present disclosure, updating the sample weight distribution may update the first sample weight distribution of the first data set, so that when returning to execution 720, the first sample weight of the first data set in 720 executed again The distribution is updated and thus the classification model trained at 730 is also updated.
在本公开的另一实现方式中,更新样本权重分布可以更新第一数据集的第一样本权重分布以及更新第二数据集的第二样本权重分布。作为一例,可以更新待处理数据集的样本权重 分布,并且可以重新对无关数据集进行划分。作为另一例,可以更新待处理数据集的样本权重分布,从而适应性地更新第一样本权重分布和第二样本权重分布,但是第一数据集和第二数据集中的无关数据项不变。如此在返回执行720时,再次执行的720中的第一数据集被更新或者第一数据集的第一样本权重分布被更新,进而在730处被训练的分类模型也被更新。In another implementation manner of the present disclosure, updating the sample weight distribution may update the first sample weight distribution of the first data set and update the second sample weight distribution of the second data set. As an example, the distribution of sample weights for a data set to be processed can be updated, and an unrelated data set can be repartitioned. As another example, the sample weight distribution of the data set to be processed may be updated, so as to adaptively update the first sample weight distribution and the second sample weight distribution, but irrelevant data items in the first data set and the second data set remain unchanged. In this way, when returning to execute 720 , the first data set in 720 executed again is updated or the first sample weight distribution of the first data set is updated, and then the classification model trained at 730 is also updated.
在本公开的另一实现方式中,更新样本权重分布可以更新第二数据集的第二样本权重分布。可选地,第一样本权重分布可以包括不变。作为一个示例,在该实现方式中,在返回执行720时,可以将上一次执行的720中的第一数据集和第二数据集进行交换。如此,在返回执行730时的第一数据集是上一次执行的过程中的第二数据集。这样,能够实现对待处理数据集的更全面的考虑,使得分类模型对于偏见显著性的评估结果更加准确。In another implementation manner of the present disclosure, updating the sample weight distribution may update the second sample weight distribution of the second data set. Optionally, the first sample weight distribution may include invariance. As an example, in this implementation manner, when the execution of 720 is returned, the first data set and the second data set in 720 executed last time may be exchanged. As such, the first data set at the time of return execution 730 is the second data set during the previous execution. In this way, a more comprehensive consideration of the data set to be processed can be realized, so that the evaluation result of the classification model for the significance of the bias is more accurate.
图9示出了根据本公开的实施例的对待处理数据集的样本权重分布进行更新的过程900的示意图。FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure.
如图9所示,在框910,基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的。As shown in Figure 9, in block 910, an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
在框920,将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的。At block 920, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
在框930,基于第一数据集和第一样本权重分布,对分类模型进行训练。At block 930, a classification model is trained based on the first data set and the first sample weight distribution.
在框940,基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。At block 940, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of the bias of the processed data set having the sample weight distribution.
关于图9中的910至940可以分别参照上面结合图7所描述的710至740,为了简洁,这里不再赘述。Regarding 910 to 940 in FIG. 9 , reference may be made to 710 to 740 described above in conjunction with FIG. 7 , and details are not repeated here for brevity.
在图9中,在框950,判断评估结果是否大于预设阈值。如果确定该评估结果大于预设阈值,则执行960。如果确定该评估结果不大于预设阈值,则执行980。In FIG. 9, at block 950, it is determined whether the evaluation result is greater than a preset threshold. If it is determined that the evaluation result is greater than the preset threshold, go to 960 . If it is determined that the evaluation result is not greater than the preset threshold, go to 980 .
在框960,更新第二数据集的第二样本权重分布。At block 960, the second sample weight distribution for the second data set is updated.
作为一些示例,可以更新第二数据集中所有无关数据项的样本权重,或者可以更新第二数据集中部分无关数据项的样本权重。As some examples, the sample weights of all irrelevant data items in the second data set may be updated, or the sample weights of some irrelevant data items in the second data set may be updated.
作为一些示例,可以基于在940中分类模型对第二数据集中无关数据项的预测结果,来更新第二样本权重分布。As some examples, the weight distribution of the second sample may be updated based on a prediction result of the classification model for irrelevant data items in the second data set at 940 .
具体的,可以将预测正确的第二数据集中无关数据项的样本权重调大,或者,可以将预测错误的第二数据集中无关数据项的样本权重调小。举例来讲,假设第二数据集中第一无关数据项的样本权重为2,且将该第二数据集中第一无关数据项输入到分类模型得到的预测结果与其标签一致,则可以将第二数据集中第一无关数据项的样本权重调大,例如从2变为3或4或其他值等。举例来讲,假设第二数据集中第二无关数据项的样本权重为2,且将该第二数据集中第二无关数据项输入到分类模型得到的预测结果与其标签不一致,则可以将第二数据集中第二无关数据项的样本权重调小,例如从2变为1。Specifically, the sample weights of irrelevant data items in the second data set with correct predictions may be increased, or the sample weights of irrelevant data items in the second data set with wrong predictions may be adjusted small. For example, assuming that the sample weight of the first irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the first irrelevant data item in the second data set into the classification model is consistent with its label, then the second data The sample weight of the first irrelevant data item in the set is increased, for example, from 2 to 3 or 4 or other values. For example, assuming that the sample weight of the second irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the second irrelevant data item in the second data set into the classification model is inconsistent with its label, the second data can be The sample weight of the second irrelevant data item in the set is reduced, for example, from 2 to 1.
在框970,将具有第一样本权重分布的第一数据集与具有更新的第二样本权重分布的第二数据集进行交换。At block 970, the first data set having the first sample weight distribution is exchanged with the second data set having the updated second sample weight distribution.
可理解,交换之后的第一数据集为框920中的第二数据集,且该交换之后的第一数据集所具有的第一样本权重分布为在框960所更新的第二样本权重分布。交换之后的第二数据集 为框920中的第一数据集,且该交换之后的第二数据集所具有的第二样本权重分布为在框920中的第一样本权重分布。It can be understood that the first data set after exchange is the second data set in block 920, and the first sample weight distribution of the first data set after exchange is the second sample weight distribution updated in block 960 . The second data set after the swap is the first data set in block 920, and the second sample weight distribution of the second data set after the swap is the first sample weight distribution in block 920.
在框970之后,返回执行930。也就是说,使用在970中交换之后的第一数据集对分类模型重新进行训练。After block 970 , execution returns to 930 . That is, the classification model is retrained using the first data set after the exchange in 970 .
在框980,输出推荐样本权重分布。At block 980, the recommended sample weight distribution is output.
示例性地,将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布。具体的,可以基于第一样本权重分布和第二样本权重分布确定推荐样本权重分布。Exemplarily, the sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution. Specifically, the recommended sample weight distribution may be determined based on the first sample weight distribution and the second sample weight distribution.
在本公开的一些实施例中,可以通过可视化方式呈现数据集偏见的关注区域,具体的,可以通过将目标无关数据项输入经训练的分类模型,得到类激活图。随后通过将类激活图与目标无关数据项叠加得到叠加结果,并显示叠加结果。作为示例,该叠加结果可以通过对热图加权求和的方式得到,这样通过显示叠加结果,能够直观地查看到哪些是分类模型的关注区域,并且这些关注区域是造成偏见的重要因素。In some embodiments of the present disclosure, the focus area of data set bias can be presented in a visual manner, specifically, a class activation map can be obtained by inputting target-independent data items into a trained classification model. Then the overlay result is obtained by overlaying the class activation map with the target-independent data item, and the overlay result is displayed. As an example, the overlay results can be obtained by weighting and summing the heatmap, so that by displaying the overlay results, it is possible to visually see which are the attention areas of the classification model, and these attention areas are important factors that cause bias.
在本公开的一些实施例中,在得到推荐样本权重分布之后,可选地还可以包括基于该推荐样本权重分布对待处理数据集进行调整,以得到无偏数据集。In some embodiments of the present disclosure, after obtaining the recommended sample weight distribution, it may optionally further include adjusting the data set to be processed based on the recommended sample weight distribution to obtain an unbiased data set.
示例性地,可以通过对待处理数据集进行增加或删除,构建无偏数据集。Exemplarily, an unbiased data set can be constructed by adding or deleting the data set to be processed.
作为一例,可以将推荐样本权重大的待处理数据项进行复制,以扩充待处理数据集中的待处理数据项的数目。作为一例,可以将推荐样本权重小的待处理数据项进行删除,以缩减待处理数据集中的待处理数据项的数目。As an example, data items to be processed with a large recommended sample weight may be copied to expand the number of data items to be processed in the data set to be processed. As an example, data items to be processed with small recommended sample weights may be deleted, so as to reduce the number of data items to be processed in the data set to be processed.
作为一例,可以获取用户对部分待处理数据项的删除指令,以将部分待处理数据项进行删除。可以获取用户输入的其他的数据项,以增加到当前的待处理数据集中。As an example, a user's deletion instruction for some data items to be processed may be obtained, so as to delete some data items to be processed. Other data items entered by the user can be obtained to be added to the current pending data set.
举例来讲,用户可以基于推荐样本权重分布对待处理数据集进行增加或删除。例如,用户可以找到与推荐样本权重大的待处理数据项相似的其他样本,作为新的数据项增加到该数据集中,从而实现了对数据集的数据补充。作为一例,相似的其他样本可以是由同一(或同一型号的)图像采集设备在相似的环境(如关照条件等)所采集的其他图像。For example, users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set. As an example, other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
如此,本公开的实施例中,能够基于推荐样本权重分布对待处理数据集进行增加或删除,从而能够构建无偏数据集。进一步地,该无偏数据集可以用于训练更加稳健、无偏的特定任务的模型。In this way, in the embodiments of the present disclosure, the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
可理解的是,本公开实施例中结合图7至图9所描述的过程,可以参照上面结合图1至图6所描述的模块等的功能,为了简洁,不再重复。It can be understood that, the processes described in the embodiment of the present disclosure with reference to FIG. 7 to FIG. 9 may refer to the functions of the modules and the like described above in conjunction with FIG. 1 to FIG. 6 , and are not repeated for brevity.
图10示出了根据本公开的实施例的数据处理装置1000的示意框图。装置1000可以通过软件、硬件或者两者结合的方式实现。在一些实施例中,装置1000可以为实现图1所示的系统100中的部分或全部功能的软件或硬件装置。Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure. Apparatus 1000 may be implemented by software, hardware or a combination of both. In some embodiments, the device 1000 may be a software or hardware device that implements part or all of the functions in the system 100 shown in FIG. 1 .
如图10所示,装置1000包括构建单元1010、划分单元1020、训练单元1030和评估单元1040。As shown in FIG. 10 , the device 1000 includes a construction unit 1010 , a division unit 1020 , a training unit 1030 and an evaluation unit 1040 .
构建单元1010被配置为基于待处理数据集构建无关数据集,无关数据集包括具有标签的无关数据项,无关数据项的标签是基于待处理数据集中的待处理数据项的标签确定的。The construction unit 1010 is configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
划分单元1020被配置为将无关数据集划分为第一数据集和第二数据集,第一数据集具有第一样本权重分布,第二数据集具有第二样本权重分布,第一样本权重分布和第二样本权重分布是基于待处理数据集中的待处理数据项的样本权重确定的。The division unit 1020 is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight The distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed.
训练单元1030被配置为基于第一数据集和第一样本权重分布,对分类模型进行训练。The training unit 1030 is configured to train the classification model based on the first data set and the first sample weight distribution.
评估单元1040被配置为基于第二数据集和第二样本权重分布,对分类模型进行评估,以得到评估结果,评估结果指示具有样本权重分布的待处理数据集的偏见显著性。The evaluation unit 1040 is configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias in the data set to be processed with the sample weight distribution.
在一些实施例中,装置1000还可以包括更新单元1050、调整单元1060和显示单元1070。In some embodiments, the device 1000 may further include an update unit 1050 , an adjustment unit 1060 and a display unit 1070 .
更新单元1050被配置为:如果评估单元1040得到的评估结果大于预设阈值,更新待处理数据集的样本权重分布。The update unit 1050 is configured to update the sample weight distribution of the data set to be processed if the evaluation result obtained by the evaluation unit 1040 is greater than a preset threshold.
作为一例,更新单元1050可以被配置为更新样本权重分布的部分,使得更新第二样本权重分布而不更新第一样本权重分布。As an example, the updating unit 1050 may be configured to update a part of the sample weight distribution, so that the second sample weight distribution is updated without updating the first sample weight distribution.
在一些实施例中,更新单元1050可以被配置为通过以下至少一项来更新样本权重分布:采用预定的规则更新样本权重分布,采用随机的方式更新样本权重分布,获取用户对样本权重分布的修改以更新样本权重分布,或者通过遗传算法对样本权重分布进行优化以更新样本权重分布。In some embodiments, the update unit 1050 may be configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, update the sample weight distribution in a random manner, and acquire user modification to the sample weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
在一些实施例中,更新单元1050可以被配置为将评估结果不大于预设阈值时的样本权重分布作为推荐样本权重分布。In some embodiments, the updating unit 1050 may be configured to use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
调整单元1060被配置为基于推荐样本权重分布,对待处理数据集进行增加或删除,以构建无偏数据集。The adjustment unit 1060 is configured to add or delete the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
更新单元1050还被配置为:通过将目标无关数据项输入经训练的分类模型,得到类激活图;通过将嘞激活图与目标无关数据项叠加,得到叠加结果。The update unit 1050 is further configured to: obtain a class activation map by inputting target-independent data items into the trained classification model; and obtain a superposition result by superimposing the activation map and target-independent data items.
显示单元1070被配置为显示推荐样本权重分布和/或叠加结果。The display unit 1070 is configured to display the recommended sample weight distribution and/or the superposition result.
在一些实施例中,构建单元1010可以被配置为从待处理数据集的目标待处理数据项中去除与目标待处理数据项的标签相关联的部分,以得到目标待处理数据项中的剩余部分;以及利用剩余部分来构建无关数据集中的一条无关数据项,该一条无关数据项的标签对应于目标待处理数据项的标签。In some embodiments, the construction unit 1010 may be configured to remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, so as to obtain the remaining part of the target data item to be processed ; and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
在一些实施例中,待处理数据集为图像数据集,构建单元1010可以被配置为对待处理数据集中的目标待处理数据项执行图像分割,以得到与目标待处理数据项对应的背景图像;以及利用背景图像来构建无关数据集中的一条无关数据项。In some embodiments, the data set to be processed is an image data set, and the construction unit 1010 may be configured to perform image segmentation on the target data item to be processed in the data set to be processed, so as to obtain a background image corresponding to the target data item to be processed; and A background image is used to construct an unrelated data item in an unrelated data set.
在一些实施例中,待处理数据集中的待处理数据项为视频序列,构建单元1010可以被配置为基于视频序列中一帧图像与一帧图像的前一帧图像之间的梯度信息,确定视频序列的二值图像;基于二值图像,生成视频序列的背景图像;以及利用视频序列的背景图像来构建无关数据集中的一条无关数据项。In some embodiments, the data item to be processed in the data set to be processed is a video sequence, and the construction unit 1010 may be configured to determine the video A binary image of the sequence; a background image of the video sequence is generated based on the binary image; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
本公开的实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时也可以有另外的划分方式,另外,在公开的实施例中的各功能单元可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上单元集成为一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。The division of units in the embodiments of the present disclosure is schematic, and it is only a logical function division. In actual implementation, there may be other division methods. In addition, the functional units in the disclosed embodiments can be integrated into one A processor may also exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
图10所示的数据处理装置1000能够用于实现上述结合图7至图9所示的数据处理过程。The data processing device 1000 shown in FIG. 10 can be used to implement the above data processing process shown in conjunction with FIGS. 7 to 9 .
本公开还可以实现为计算机程序产品。计算机程序产品可以包括用于执行本公开的各个方面的计算机可读程序指令。本公开可以实现为计算机可读存储介质,其上存储有计算机可读程序指令,当处理器运行所述指令时,使得处理器执行上述数据处理的过程。The present disclosure can also be implemented as a computer program product. A computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure. The present disclosure may be implemented as a computer-readable storage medium, on which computer-readable program instructions are stored, and when a processor executes the instructions, the processor is made to execute the above-mentioned data processing process.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计 算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)或闪存、静态随机存取存储器(Static Random Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Versatile Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM) or flash memory, Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (Digital Versatile Discs, DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机可读程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer readable program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or in the form of a or any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby implementing various aspects of the present disclosure.
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机可读程序指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer readable program instructions.

Claims (22)

  1. 一种数据处理方法,包括:A data processing method, comprising:
    基于待处理数据集构建无关数据集,所述无关数据集包括具有标签的无关数据项,所述无关数据项的标签是基于所述待处理数据集中的待处理数据项的标签确定的;Constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed;
    将所述无关数据集划分为第一数据集和第二数据集,所述第一数据集具有第一样本权重分布,所述第二数据集具有第二样本权重分布,所述第一样本权重分布和所述第二样本权重分布是基于所述待处理数据集中的待处理数据项的样本权重确定的;Dividing the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight distribution This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;
    基于所述第一数据集和所述第一样本权重分布,对分类模型进行训练;以及training a classification model based on the first data set and the first sample weight distribution; and
    基于所述第二数据集和所述第二样本权重分布,对所述分类模型进行评估,以得到评估结果,所述评估结果指示具有所述样本权重分布的所述待处理数据集的偏见显著性。Evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating that the bias of the to-be-processed data set having the sample weight distribution is significant sex.
  2. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    如果所述评估结果大于预设阈值,更新所述待处理数据集的所述样本权重分布;以及If the evaluation result is greater than a preset threshold, updating the sample weight distribution of the data set to be processed; and
    基于更新后的样本权重分布,重复执行所述训练和所述评估,直到所述评估结果不大于所述预设阈值。Based on the updated sample weight distribution, the training and the evaluation are repeatedly performed until the evaluation result is not greater than the preset threshold.
  3. 根据权利要求2所述的方法,其中所述更新样本权重分布包括:The method according to claim 2, wherein said updating sample weight distribution comprises:
    更新所述样本权重分布的部分,使得更新所述第二样本权重分布而不更新所述第一样本权重分布。The portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
  4. 根据权利要求2或3所述的方法,还包括:The method according to claim 2 or 3, further comprising:
    将所述评估结果不大于所述预设阈值时的样本权重分布作为推荐样本权重分布。The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
  5. 根据权利要求4所述的方法,还包括:The method according to claim 4, further comprising:
    基于所述推荐样本权重分布,对所述待处理数据集进行增加或删除,以构建无偏数据集。Based on the weight distribution of the recommended samples, the data set to be processed is added or deleted to construct an unbiased data set.
  6. 根据权利要求2至5中任一项所述的方法,其中所述更新样本权重分布包括以下至少一项:The method according to any one of claims 2 to 5, wherein said updating sample weight distribution comprises at least one of the following:
    采用预定的规则更新所述样本权重分布,updating the sample weight distribution using a predetermined rule,
    采用随机的方式更新所述样本权重分布,updating the sample weight distribution in a random manner,
    获取用户对所述样本权重分布的修改以更新所述样本权重分布,或者obtaining user modifications to the sample weight distribution to update the sample weight distribution, or
    通过遗传算法对样本权重分布进行优化以更新所述样本权重分布。Optimizing the sample weight distribution by genetic algorithm to update the sample weight distribution.
  7. 根据权利要求1至6中任一项所述的方法,其中基于所述待处理数据集构建无关数据集包括:The method according to any one of claims 1 to 6, wherein constructing an irrelevant data set based on the data set to be processed comprises:
    从所述待处理数据集的目标待处理数据项中去除与所述目标待处理数据项的标签相关联的部分,以得到所述目标待处理数据项中的剩余部分;以及removing from the target data item to be processed a part associated with the tag of the target data item to be processed to obtain a remaining part of the target data item to be processed; and
    利用所述剩余部分来构建所述无关数据集中的一条无关数据项,所述一条无关数据项的标签对应于所述目标待处理数据项的标签。Using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
  8. 根据权利要求1至6中任一项所述的方法,其中所述待处理数据集为图像数据集,并且其中基于所述待处理数据集构建所述无关数据集包括:The method according to any one of claims 1 to 6, wherein the data set to be processed is an image data set, and wherein constructing the irrelevant data set based on the data set to be processed comprises:
    对所述待处理数据集中的目标待处理数据项执行图像分割,以得到与所述目标待处理数据项对应的背景图像;以及performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and
    利用所述背景图像来构建所述无关数据集中的一条无关数据项。A piece of irrelevant data item in the irrelevant data set is constructed by using the background image.
  9. 根据权利要求1至6中任一项所述的方法,其中所述待处理数据集中的待处理数据项 为视频序列,并且其中基于所述待处理数据集构建所述无关数据集包括:The method according to any one of claims 1 to 6, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein constructing the irrelevant data set based on the data set to be processed comprises:
    基于所述视频序列中一帧图像与所述一帧图像的前一帧图像之间的梯度信息,确定所述视频序列的二值图像;determining a binary image of the video sequence based on gradient information between a frame image in the video sequence and a previous frame image of the one frame image;
    基于所述二值图像,生成所述视频序列的背景图像;以及generating a background image of the video sequence based on the binary image; and
    利用所述视频序列的背景图像来构建所述无关数据集中的一条无关数据项。A piece of irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
  10. 根据权利要求1至9中任一项所述的方法,还包括:The method according to any one of claims 1 to 9, further comprising:
    通过将目标无关数据项输入所述经训练的分类模型,得到类激活图CAM;Obtaining a class activation map CAM by inputting target-independent data items into said trained classification model;
    通过将所述CAM与所述目标无关数据项叠加,得到叠加结果;以及Obtaining an overlay result by overlaying the CAM and the target-independent data item; and
    显示所述叠加结果。Display the overlay result.
  11. 一种数据处理装置,包括:A data processing device, comprising:
    构建单元,被配置为基于待处理数据集构建无关数据集,所述无关数据集包括具有标签的无关数据项,所述无关数据项的标签是基于所述待处理数据集中的待处理数据项的标签确定的;A construction unit configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are based on the unprocessed data items in the unprocessed data set label determined;
    划分单元,被配置为将所述无关数据集划分为第一数据集和第二数据集,所述第一数据集具有第一样本权重分布,所述第二数据集具有第二样本权重分布,所述第一样本权重分布和所述第二样本权重分布是基于所述待处理数据集中的待处理数据项的样本权重确定的;a division unit configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution , the first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;
    训练单元,被配置为基于所述第一数据集和所述第一样本权重分布,对分类模型进行训练;以及a training unit configured to train a classification model based on the first data set and the first sample weight distribution; and
    评估单元,被配置为基于所述第二数据集和所述第二样本权重分布,对所述分类模型进行评估,以得到评估结果,所述评估结果指示具有所述样本权重分布的所述待处理数据集的偏见显著性。An evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating that the classifier with the sample weight distribution is to be Dealing with salience of bias in datasets.
  12. 根据权利要求11所述的装置,还包括更新单元,被配置为:The apparatus according to claim 11, further comprising an update unit configured to:
    如果所述评估结果大于预设阈值,更新所述待处理数据集的所述样本权重分布。If the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
  13. 根据权利要求12所述的装置,其中所述更新单元被配置为:The apparatus of claim 12, wherein the update unit is configured to:
    更新所述样本权重分布的部分,使得更新所述第二样本权重分布而不更新所述第一样本权重分布。The portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
  14. 根据权利要求12或13所述的装置,其中所述更新单元被配置为:The apparatus according to claim 12 or 13, wherein the updating unit is configured to:
    将所述评估结果不大于所述预设阈值时的样本权重分布作为推荐样本权重分布。The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
  15. 根据权利要求14所述的装置,还包括调整单元,被配置为:The apparatus of claim 14, further comprising an adjustment unit configured to:
    基于所述推荐样本权重分布,对所述待处理数据集进行增加或删除,以构建无偏数据集。Based on the weight distribution of the recommended samples, the data set to be processed is added or deleted to construct an unbiased data set.
  16. 根据权利要求12至15中任一项所述的装置,其中所述更新单元被配置为通过以下至少一项来更新所述样本权重分布:The apparatus according to any one of claims 12 to 15, wherein the update unit is configured to update the sample weight distribution by at least one of the following:
    采用预定的规则更新所述样本权重分布,updating the sample weight distribution using a predetermined rule,
    采用随机的方式更新所述样本权重分布,updating the sample weight distribution in a random manner,
    获取用户对所述样本权重分布的修改以更新所述样本权重分布,或者obtaining user modifications to the sample weight distribution to update the sample weight distribution, or
    通过遗传算法对样本权重分布进行优化以更新所述样本权重分布。Optimizing the sample weight distribution by genetic algorithm to update the sample weight distribution.
  17. 根据权利要求11至16中任一项所述的装置,其中所述构建单元被配置为:The device according to any one of claims 11 to 16, wherein the construction unit is configured to:
    从所述待处理数据集的目标待处理数据项中去除与所述目标待处理数据项的标签相关联的部分,以得到所述目标待处理数据项中的剩余部分;以及removing from the target data item to be processed a part associated with the tag of the target data item to be processed to obtain a remaining part of the target data item to be processed; and
    利用所述剩余部分来构建所述无关数据集中的一条无关数据项,所述一条无关数据项的标签对应于所述目标待处理数据项的标签。Using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
  18. 根据权利要求11至16中任一项所述的装置,其中所述待处理数据集为图像数据集,并且其中所述构建模块被配置为:The apparatus according to any one of claims 11 to 16, wherein the data set to be processed is an image data set, and wherein the building blocks are configured to:
    对所述待处理数据集中的目标待处理数据项执行图像分割,以得到与所述目标待处理数据项对应的背景图像;以及performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and
    利用所述背景图像来构建所述无关数据集中的一条无关数据项。A piece of irrelevant data item in the irrelevant data set is constructed by using the background image.
  19. 根据权利要求11至16中任一项所述的装置,其中所述待处理数据集中的待处理数据项为视频序列,并且其中所述构建单元被配置为:The device according to any one of claims 11 to 16, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein the construction unit is configured to:
    基于所述视频序列中一帧图像与所述一帧图像的前一帧图像之间的梯度信息,确定所述视频序列的二值图像;determining a binary image of the video sequence based on gradient information between a frame image in the video sequence and a previous frame image of the one frame image;
    基于所述二值图像,生成所述视频序列的背景图像;以及generating a background image of the video sequence based on the binary image; and
    利用所述视频序列的背景图像来构建所述无关数据集中的一条无关数据项。A piece of irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
  20. 根据权利要求11至19中任一项所述的装置,还包括:Apparatus according to any one of claims 11 to 19, further comprising:
    更新单元,被配置为:通过将目标无关数据项输入所述经训练的分类模型,得到类激活图CAM;以及通过将所述CAM与所述目标无关数据项叠加,得到叠加结果;以及An update unit configured to: obtain a class activation map CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items; and
    显示单元,被配置为显示所述叠加结果。A display unit configured to display the superposition result.
  21. 一种计算设备,其特征在于,包括处理器和存储器,所述处理器读取并执行所述存储器存储的计算机程序,使得所述计算设备执行根据权利要求1至10中任一项所述的方法。A computing device, characterized in that it includes a processor and a memory, the processor reads and executes the computer program stored in the memory, so that the computing device executes the computer program according to any one of claims 1 to 10 method.
  22. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至10中任一项所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method according to any one of claims 1 to 10 is realized.
PCT/CN2022/083841 2021-05-25 2022-03-29 Data processing method and apparatus, computing device, and computer readable storage medium WO2022247448A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110574231.3 2021-05-25
CN202110574231.3A CN115471714A (en) 2021-05-25 2021-05-25 Data processing method, data processing device, computing equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2022247448A1 true WO2022247448A1 (en) 2022-12-01

Family

ID=84229488

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/083841 WO2022247448A1 (en) 2021-05-25 2022-03-29 Data processing method and apparatus, computing device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN115471714A (en)
WO (1) WO2022247448A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915450A (en) * 2012-09-28 2013-02-06 常州工学院 Online adaptive adjustment tracking method for target image regions
US20200167653A1 (en) * 2018-11-27 2020-05-28 Wipro Limited Method and device for de-prejudicing artificial intelligence based anomaly detection
US20200372406A1 (en) * 2019-05-22 2020-11-26 Oracle International Corporation Enforcing Fairness on Unlabeled Data to Improve Modeling Performance
CN112115963A (en) * 2020-07-30 2020-12-22 浙江工业大学 Method for generating unbiased deep learning model based on transfer learning
CN112508580A (en) * 2021-02-03 2021-03-16 北京淇瑀信息科技有限公司 Model construction method and device based on rejection inference method and electronic equipment
CN112639843A (en) * 2018-09-10 2021-04-09 谷歌有限责任公司 Suppression of deviation data using machine learning models

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915450A (en) * 2012-09-28 2013-02-06 常州工学院 Online adaptive adjustment tracking method for target image regions
CN112639843A (en) * 2018-09-10 2021-04-09 谷歌有限责任公司 Suppression of deviation data using machine learning models
US20200167653A1 (en) * 2018-11-27 2020-05-28 Wipro Limited Method and device for de-prejudicing artificial intelligence based anomaly detection
US20200372406A1 (en) * 2019-05-22 2020-11-26 Oracle International Corporation Enforcing Fairness on Unlabeled Data to Improve Modeling Performance
CN112115963A (en) * 2020-07-30 2020-12-22 浙江工业大学 Method for generating unbiased deep learning model based on transfer learning
CN112508580A (en) * 2021-02-03 2021-03-16 北京淇瑀信息科技有限公司 Model construction method and device based on rejection inference method and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CORRELL MICHAEL; HEER JEFFREY: "Surprise! Bayesian Weighting for De-Biasing Thematic Maps", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, IEEE, USA, vol. 23, no. 1, 1 January 2017 (2017-01-01), USA, pages 651 - 660, XP011634791, ISSN: 1077-2626, DOI: 10.1109/TVCG.2016.2598618 *
JINYIN CHEN, CHEN YIPENG; CHEN YIMING; ZHENG HAIBIN; JI SHOULING; SHI JIE; CHENG YAO: "Fairness Research on Deep Learning", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, KEXUE CHUBANSHE, BEIJING, CN, vol. 58, no. 2, 8 February 2021 (2021-02-08), CN , pages 264 - 280, XP093007463, ISSN: 1000-1239, DOI: 10.7544/issn1000-1239.2021.20200758 *

Also Published As

Publication number Publication date
CN115471714A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
WO2018121690A1 (en) Object attribute detection method and device, neural network training method and device, and regional detection method and device
US11416772B2 (en) Integrated bottom-up segmentation for semi-supervised image segmentation
CN111724083A (en) Training method and device for financial risk recognition model, computer equipment and medium
US20150325046A1 (en) Evaluation of Three-Dimensional Scenes Using Two-Dimensional Representations
CN109993102B (en) Similar face retrieval method, device and storage medium
US11875512B2 (en) Attributionally robust training for weakly supervised localization and segmentation
US20220261659A1 (en) Method and Apparatus for Determining Neural Network
CN111582409A (en) Training method of image label classification network, image label classification method and device
Ayyar et al. Review of white box methods for explanations of convolutional neural networks in image classification tasks
Li et al. Localizing and quantifying infrastructure damage using class activation mapping approaches
WO2024060416A1 (en) End-to-end weakly supervised semantic segmentation and labeling method for pathological image
Lin et al. An analysis of English classroom behavior by intelligent image recognition in IoT
Kajabad et al. YOLOv4 for urban object detection: Case of electronic inventory in St. Petersburg
Pang et al. Salient object detection via effective background prior and novel graph
WO2022247448A1 (en) Data processing method and apparatus, computing device, and computer readable storage medium
CN114255381B (en) Training method of image recognition model, image recognition method, device and medium
CN116258937A (en) Small sample segmentation method, device, terminal and medium based on attention mechanism
CN116029760A (en) Message pushing method, device, computer equipment and storage medium
CN113763313A (en) Text image quality detection method, device, medium and electronic equipment
Heidari et al. Forest roads damage detection based on deep learning algorithms
Jiao et al. A visual consistent adaptive image thresholding method
КАЛИТА Information technology of facial emotion recognition for visual safety surveillance
CN117095180B (en) Embryo development stage prediction and quality assessment method based on stage identification
Anggoro et al. Classification of Solo Batik patterns using deep learning convolutional neural networks algorithm
CN116703933A (en) Image segmentation training method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22810181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE