WO2022247448A1

WO2022247448A1 - Data processing method and apparatus, computing device, and computer readable storage medium

Info

Publication number: WO2022247448A1
Application number: PCT/CN2022/083841
Authority: WO
Inventors: 张诗杰; 朱森华
Original assignee: 华为云计算技术有限公司
Priority date: 2021-05-25
Filing date: 2022-03-29
Publication date: 2022-12-01
Also published as: CN115471714A

Abstract

Provided are a data processing method and apparatus, a computing device, and a computer readable storage medium. In the method, an irrelevant dataset having a tag is constructed on the basis of a dataset to be processed; the irrelevant dataset is divided into a first dataset having a first sample weight distribution and a second dataset having a second sample weight distribution, the first and second sample weight distributions being determined on the basis of a sample weight of a data item to be processed in the dataset to be processed; a classification model is trained on the basis of the first dataset and the first sample weight distribution; the classification model is evaluated on the basis of the second dataset and the second sample weight distribution to obtain an evaluation result which indicates the bias significance of the dataset to be processed having sample weight distributions.

Description

Data processing method, apparatus, computing device, and computer-readable storage medium

technical field

The present disclosure relates to the field of artificial intelligence, and more particularly, to a data processing method, device, computing device, and computer-readable storage medium.

Background technique

Dataset bias is a widespread problem that has a huge negative impact in machine learning, especially deep learning, and is difficult to detect and easily overlooked. Especially for scenarios with high requirements for model security, if the training is based on a biased data set, the resulting model may cause serious accidents in actual use.

At present, the bias of the data set is checked by guessing or based on experience, but this solution needs to consume a lot of human resources, which is not only inefficient, but also has low accuracy and cannot meet actual needs.

Contents of the invention

Exemplary embodiments of the present disclosure provide a data processing method including a scheme for assessing bias in a data set, enabling a more precise check of the bias in the data set.

In a first aspect, a data processing method is provided. The method includes: constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; Divided into the first data set and the second data set, the first data set has the first sample weight distribution, the second data set has the second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the The sample weights of the data items to be processed in the processing data set are determined; the classification model is trained based on the first data set and the first sample weight distribution; and the classification model is trained based on the second data set and the second sample weight distribution. Evaluate, to obtain an evaluation result indicating the significance of the bias for the dataset under processing with the distribution of sample weights.

As such, by way of embodiments of the present disclosure, the significance of bias of a data set can be more accurately assessed. This evaluation scheme is convenient for users to adjust the data set and other processing.

In some embodiments of the first aspect, further comprising: if the evaluation result is greater than a preset threshold, updating the sample weight distribution of the data set to be processed; based on the updated sample weight distribution, repeatedly performing the training and the evaluation until The evaluation result is not greater than the preset threshold.

In this way, the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation, and is highly efficient and highly automated.

In some embodiments of the first aspect, wherein updating the sample weight distribution includes: updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

In some embodiments of the first aspect, it further includes: using the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.

In this way, the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can check the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and effective and accurate Weight distribution of recommended samples with high degree.

In some embodiments of the first aspect, it further includes: adding or deleting the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.

In this way, in the embodiments of the present disclosure, the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased data set can be used to train a more robust and unbiased task-specific model to meet actual needs.

In some embodiments of the first aspect, updating the sample weight distribution includes at least one of the following: using a predetermined rule to update the sample weight distribution, using a random method to update the sample weight distribution, obtaining user modification to the sample weight distribution to update the sample Weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.

In some embodiments of the first aspect, wherein constructing the irrelevant data set based on the data set to be processed includes: removing the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, to obtain The remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.

In some embodiments of the first aspect, wherein the data set to be processed is an image data set, and wherein constructing an unrelated data set based on the data set to be processed comprises: performing image segmentation on a target data item to be processed in the data set to be processed to obtain a A background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.

In this way, in the embodiment of the present disclosure, the background image is used as a representative of bias, so that the data set can be checked for bias.

In some embodiments of the first aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein constructing an irrelevant data set based on the data set to be processed includes: The gradient information between a frame of images determines the binary image of the video sequence; generates the background image of the video sequence based on the binary image; and uses the background image of the video sequence to construct an irrelevant data item in the irrelevant data set.

In this way, the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.

In some embodiments of the first aspect, it also includes: obtaining the class activation map CAM by inputting the target-independent data item into the trained classification model; by superimposing the CAM and the target-independent data item to obtain an overlay result; and displaying the overlay result .

In this way, the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be presented visually. In this way, users can more intuitively and comprehensively know the bias of the data set. This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.

In a second aspect, a data processing device is provided. The device includes: a construction unit configured to construct an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed The dividing unit is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first same This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed; the training unit is configured to train the classification model based on the first data set and the first sample weight distribution and an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of the bias of the data set to be processed with the sample weight distribution.

In some embodiments of the second aspect, an updating unit is further included, configured to: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.

In some embodiments of the second aspect, wherein the update unit is configured to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

In some embodiments of the second aspect, the update unit is configured to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.

In some embodiments of the second aspect, an adjustment unit is further included, configured to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.

In some embodiments of the second aspect, the update unit is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution by using a predetermined rule, update the sample weight distribution in a random manner, and obtain the user's weight on the sample The distribution is modified to update the sample weight distribution, or the sample weight distribution is optimized by the genetic algorithm to update the sample weight distribution.

In some embodiments of the second aspect, wherein the construction unit is configured to: remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to obtain the target data item to be processed and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.

In some embodiments of the second aspect, wherein the data set to be processed is an image data set, and wherein the construction unit is configured to: perform image segmentation on the target data item to be processed in the data set to be processed to obtain the target data item to be processed a corresponding background image; and using the background image to construct an unrelated data item in the unrelated data set.

In some embodiments of the second aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein the construction unit is configured to: The gradient information of the video sequence is determined to determine the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.

In some embodiments of the second aspect, further comprising: an update unit configured to: obtain a CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items ; and a display unit configured to display an overlay result.

In a third aspect, a computing device is provided, including a processor and a memory, the memory stores instructions executed by the processor, and when the instructions are executed by the processor, the computing device realizes: based on the data set to be processed Constructing an irrelevant data set, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; the irrelevant data sets are divided into the first data set and the second data set Data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the data items to be processed in the data set to be processed The sample weight is determined; based on the first data set and the first sample weight distribution, the classification model is trained; and based on the second data set and the second sample weight distribution, the classification model is evaluated to obtain an evaluation result, and the evaluation result Indicates the significance of bias for the dataset to be processed with a distribution of sample weights.

In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is enabled to implement: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.

In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is enabled to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.

In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device is enabled to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.

In some embodiments of the third aspect, when the instruction is executed by the processor, the device is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, and update the sample weight distribution in a random manner , obtain the modification of the sample weight distribution by the user to update the sample weight distribution, or optimize the sample weight distribution through the genetic algorithm to update the sample weight distribution.

In some embodiments of the third aspect, when the instructions are executed by the processor, the computing device is caused to: remove the part associated with the tag of the target data item to be processed from the target data item to be processed in the data set to be processed , to obtain the remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.

In some embodiments of the third aspect, wherein the data set to be processed is an image data set, and wherein the instructions, when executed by a processor, cause the computing device to: perform image segmentation on a target data item to be processed in the data set to be processed , to obtain a background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.

In some embodiments of the third aspect, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein when the instruction is executed by the processor, the computing device realizes: based on a frame image and a frame image in the video sequence The gradient information between the previous frame image of the frame image determines the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and the background image of the video sequence is used to construct an irrelevant data item in the irrelevant data set .

In some embodiments of the third aspect, the instructions, when executed by a processor, cause the computing device to: obtain a CAM by inputting a target-independent data item into a trained classification model; superimposing, obtaining the superimposing result; and displaying the superimposing result.

In a fourth aspect, there is provided a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the operation according to the method in the above-mentioned first aspect or any embodiment is realized .

In a fifth aspect, a chip or a chip system is provided. The chip or chip system includes a processing circuit configured to perform operations according to the method in the above first aspect or any one of the embodiments.

In a sixth aspect, a computer program or computer program product is provided. The computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause the device to implement operations according to the method in the first aspect or any of the above-mentioned embodiments .

Description of drawings

The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals indicate the same or similar elements, wherein:

FIG. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure;

FIG. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure;

FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure;

FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure;

Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.

Artificial Intelligence (AI) uses computers to simulate certain human thinking processes and intelligent behaviors. The history of artificial intelligence research has a natural and clear vein from focusing on "reasoning", to focusing on "knowledge", and then focusing on "learning". Artificial intelligence has been widely applied to various industries such as security, medical care, transportation, education, and finance.

Machine learning (Machine Learning) is a branch of artificial intelligence, which studies how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their own performance. In other words, machine learning studies how to improve the performance of specific algorithms during empirical learning.

Deep learning (Deep Learning) is a type of machine learning technology based on deep neural network algorithms. Its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.

Data and algorithms are two important pillars of artificial intelligence. Correspondingly, data bias (Data Bias) is a key concern in the field of artificial intelligence. For a specific machine learning task, there are factors in the data that are related to the task but not non-causal, such as sample imbalance, artificial markers in the data, etc. Such factors can be considered as data bias.

Dataset bias refers to the presence of spurious features in a dataset that some machine learning models may learn. Taking the image data set as an example, there may be some information related to the data acquisition model and acquisition parameters in the image, which has nothing to do with the acquisition task. However, due to data collection defects, the machine learning model may make speculations based on this information and directly guess the classification results, instead of learning the image features that are really related to the target task.

When a machine learning model is trained on an image dataset with dataset bias, it may not be able to learn objectively and realistically for the training task as expected. As a result, it is difficult for the learned machine learning model to complete the target tasks as expected in the actual use environment, resulting in serious performance degradation; or even if the performance does not decrease, the reasons for errors may be unacceptable, and even lead to ethical lawsuits. For example, a model for predicting lipstick hardly affects the prediction results after covering the mouth. It can be seen that the model does not actually learn mouth-related features. Another example is a medical image recognition model that infers the collection location based on the markers placed by the doctor, which affects the prediction results.

One of the current solutions is to cut out areas that may affect model learning, or to adjust the color, grayscale, etc. for image data, so as to avoid the impact of these data biases on model training. However, it is difficult to enumerate all prejudices in this way, and this way requires a lot of work, manpower consumption and time cost.

In view of this, the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the impact of data set bias can be effectively determined, and the data set can be adjusted based on this to ensure that the adjusted data set The model will not be negatively affected by data bias.

Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure. As shown in FIG. 1 , the system 100 may be shown in FIG. 1 , and the system architecture 100 includes an input/output (Input/Output, I/O) module 110 , a data set processing module 120 and a model training module 130 . Optionally, as shown in FIG. 1 , the system 100 may further include a model storage module 140 and a data storage module 150 . The various modules shown in FIG. 1 can communicate with each other.

The input/output module 110 can be used to acquire data sets to be processed. For example, a data set to be processed input by a user may be received.

Optionally, the data set to be processed may be stored in the data storage module 150 . As an example, the data storage module 150 may be a data storage resource corresponding to an object storage service (Object Storage Service, OBS) provided by a cloud service provider.

The data set to be processed includes a large number of data items to be processed, and each data item to be processed has a label. In other words, the data set to be processed contains a plurality of data items to be processed marked with .

The labels may be marked manually, or may be obtained through machine learning, which is not limited in the present disclosure. Tags can also be called task tags, annotation information, or other names, which will not be listed in this article.

In some examples, the annotation information may be annotated by an annotator for a specific part of the data item to be processed based on experience. Alternatively, the annotation information may be annotated through an image recognition model and an annotation model.

For example, for image data including a human face, tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face. For example, for a medical image (such as an ultrasound image), whether there is a lesion can be marked for the detected part.

It can be understood that the data item to be processed may include a tag-related part and a tag-independent part. Taking the above-mentioned face image as an example, assuming that the label is for the face (for example, the position of the face is marked with the bounding box), then the face area in the image is the part related to the label, and the face area in the image The rest of the area is not related to the label. Assuming that the label is aimed at the eyes (for example, the pupil color is marked by "black", "brown", etc.), then the eye area in the image is the part related to the label, while other areas in the image are related to the eye area. Tags don't matter.

The data items to be processed in the data set to be processed may be any type of data, such as images, videos, voices, texts, and so on. For the convenience of description, an image is taken as an example for illustration below.

The embodiment of the present disclosure does not limit the source of the data items to be processed. Taking images as an example, for example, they may be collected from open source data sets, for example, they may be collected by different image acquisition devices, for example, they may be collected by the same image acquisition device Captured at different times, for example, may be image frames in a video sequence captured by an image capture device, or any combination of the above, or others.

The input/output module 110 may be implemented as an input module and an output module that are independent of each other, or may also be implemented as a coupling module having both input and output functions. As an example, a graphical user interface (Graphical User Interface, GUI) or a command line interface (Command-Line Interface, CLI) may be used for implementation.

The data set processing module 120 can obtain the data set to be processed from the input/output module 110 , or alternatively, can obtain the data set to be processed from the data storage module 150 . Further, the data set processing module 120 can construct an irrelevant data set based on the data set to be processed. The unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.

Optionally, the unrelated data set may be stored in the data storage module 150 .

As described above, the data item to be processed has a label, and the data item to be processed includes a part related to the label and a part not related to the label. Then, the part related to the label in the data item to be processed can be removed, and only the part irrelevant to the label in the data item to be processed can be reserved as an irrelevant data item, and the label of the unrelated data item is the label of the data item to be processed . This process may also be called splitting, division, separation or other names, etc., which is not limited in the present disclosure.

That is to say, for a certain data item to be processed in the data set to be processed (called the target data item to be processed), the part associated with its label can be removed from the target data item to be processed to obtain the target data item to be processed the remainder of the . Then use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.

For example, assume that the data item to be processed is a face image, and the label represents the skin color of the face, such as "white". Then, the face area in the face image can be removed, and the remaining part after removing the face area can be used as the corresponding irrelevant data item, and the irrelevant data item still has the label "white" of the face skin color.

In some implementations, if the data item to be processed in the data set to be processed is an image, the irrelevant data item can be obtained by means of image segmentation. The part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.

Specifically, image segmentation is performed on the target data item to be processed (target image) in the data set to be processed to obtain a background image corresponding to the target image, and then use the background image to construct an irrelevant data item.

The embodiment of the present disclosure does not limit the specific algorithm used for image segmentation. For example, one or more of the algorithms listed below can be used to perform, and other algorithms can also be used to perform: threshold-based image segmentation algorithm, based on Image segmentation algorithm based on region, image segmentation algorithm based on edge detection, image segmentation algorithm based on wavelet analysis and wavelet transform, image segmentation algorithm based on genetic algorithm, image segmentation algorithm based on active contour model, image segmentation algorithm based on deep learning, etc. etc., where image segmentation algorithms based on deep learning include but are not limited to: segmentation algorithms based on feature encoder based, segmentation algorithms based on regional proposal based, segmentation algorithms based on RNN, based on upsampling/inversion Convolution segmentation algorithm, segmentation algorithm based on feature resolution enhancement, feature enhancement based segmentation algorithm, segmentation algorithm using Conditional Random Field (CRF)/Marcov Random Field (MRF), etc. .

In other implementation manners, if the data item to be processed in the data set to be processed is a video sequence. Different data items to be processed may have the same or different durations. For example, the first data item to be processed in the data set to be processed is a first video sequence, the length of which is m1 frames, including m1 frame images. For example, the second data item to be processed in the data set to be processed is a second video sequence with a length of m2 frames, including m2 frames of images. m1 and m2 may or may not be equal.

Specifically, video segmentation is performed on the target data item to be processed (target video sequence) in the data set to be processed to obtain a background image corresponding to the target video sequence, and then use the background image to construct an irrelevant data item.

The embodiment of the present disclosure does not limit the specific algorithm adopted for video segmentation. As an example, image segmentation may be performed for each frame image in the target video sequence, and the background regions after segmentation of each frame image are fused to obtain a background image corresponding to the target video sequence. As another example, the background image corresponding to the target video sequence may be obtained based on the gradient between two adjacent frames in the target video sequence. Specifically, the binary image corresponding to the video sequence may be obtained based on the gradient information of the video sequence. The background image of the video sequence is then generated based on this binary image, as described below in conjunction with FIG. 2 .

Fig. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure. The data set processing module 200 can be used as an implementation of the data set processing module 120 in Figure 1, and the data set processing module 200 can be used to determine irrelevant data sets based on the data sets to be processed, wherein The data item is a video sequence, and the irrelevant data item in the irrelevant data set may be a background image corresponding to the video sequence.

As shown in FIG. 2 , the dataset processing module 200 may include a gradient calculation submodule 210 , a gradient superposition submodule 220 , a thresholding submodule 230 , a morphological processing submodule 240 and a separation submodule 250 .

The gradient calculation sub-module 210 can be used to calculate the gradient information between a frame image and the previous frame image in the target video sequence.

For example, assume that the target video sequence includes m1 frame images, which are frame 0, image 1, ..., frame m1-1 respectively. Then the gradient information between every two adjacent frames of images can be calculated, specifically, between the 1st frame image and the 0th frame image, between the 2nd frame image and the 1st frame image, ... the m1-1th frame image The gradient information between the m1-2th frame image.

The embodiments of the present disclosure do not limit the specific manner of calculating the gradient information, for example, the frame difference may be calculated. For example, the gradient of the feature vectors of two frames of images along a specific dimension (such as the time dimension T) can be calculated, so that fixed background parts, such as image borders, can be extracted from video sequences through motion information. For example, the difference between the image and the grayscaled image can be calculated, so that the colored part in the video frame image can be extracted, so that the color mark can be avoided as the foreground part, such as some color marks or text added in the later stage after video capture.

The gradient superposition sub-module 220 can be used to superimpose the gradient information obtained by the gradient calculation sub-module 210 to obtain a gradient superposition map.

The manner of superposition by the gradient superposition sub-module 220 may include but not limited to weighted summation (such as average value), maximum value, minimum value or others.

The thresholding sub-module 230 may be configured to perform thresholding processing on the gradient overlay image obtained by the gradient overlay sub-module 220 to obtain an initial binary image.

Specifically, for each pixel in the gradient overlay map, those pixels whose value is greater than the threshold value are marked as 1, and those pixels whose value is less than or equal to the threshold value are marked as 0, so as to obtain an initial binary image, in which The pixel value is either 1 or 0.

The morphological processing sub-module 240 may perform morphological processing on the initial binary image obtained by the thresholding sub-module 230 to obtain a binary image corresponding to the video sequence.

For example, if the pixel value of a certain pixel in the initial binary image is 1, but the pixel values of all adjacent pixels of this pixel are 0, then the pixel value of this pixel can be reset to 0.

Exemplarily, morphological processing may include, but not limited to, morphological dilation, morphological erosion, and the like. For example, the morphological processing submodule 240 may perform several times of morphological expansion on the initial binary image obtained by the thresholding submodule 230, and then perform the same number of morphological erosions to obtain a binary image.

The separation sub-module 250 can obtain the background image corresponding to the video sequence based on the binary image obtained by the morphological processing sub-module 240 .

Exemplarily, a matting operation may be performed on a binary image to obtain a background image. For example, the background image can be obtained by matrix dot product.

In this way, the background image corresponding to the video sequence can be obtained by fully considering the similarity of the background among the frame images in the video sequence.

In this way, in the embodiment of the present disclosure, the background image is used as a representative of bias, so that the data set can be checked for bias. Understandably, if the dataset is not biased, then the features of the background image should not have any relationship to the labels associated with the foreground regions.

Assume that the data set to be processed includes N data items to be processed, and the unrelated data set includes N1 unrelated data items. If processing is performed on each data item to be processed to obtain a corresponding irrelevant data item, then N1=N. If processing is performed on some data items in the data set to be processed to obtain corresponding irrelevant data items, then N1<N. It can be understood that the irrelevant data set obtained by processing all the data items to be processed has more irrelevant data items, so that the data set to be processed can be analyzed and evaluated more completely and comprehensively.

In one implementation, the constructed irrelevant data set can be divided into two parts: the first part of irrelevant data items and the second part of irrelevant data items, wherein the first part of irrelevant data items can be used to train the model, and the second part of irrelevant data items Data items can be used to test the model. The embodiment of the present disclosure does not limit the division method. As an example, the irrelevant data set may be divided into the first part and the second part according to 9:1 or 1:1 or other ratios.

Exemplarily, the set composed of the first part of irrelevant data items may be called an irrelevant training set, and the set composed of the second part of irrelevant data items may be called an irrelevant test set. Or optionally, the first part of the set of irrelevant data items may include an irrelevant training set and an irrelevant verification set. As an example, the unrelated data set can be divided into an unrelated training set, an unrelated verification set and an unrelated test set according to 7:2:1.

In order to simplify the description, hereinafter, the set composed of the first part of irrelevant data items is called the first data set (or training set), and the set composed of the second part of irrelevant data items is called the second data set (or test set).

In some embodiments, the dataset processing module 120 may preprocess the dataset to be processed first, and then construct an irrelevant dataset based on the preprocessed dataset to be processed. Preprocessing includes but is not limited to: cluster analysis, data denoising, etc.

The model training module 130 may include a training submodule 132 and an evaluation submodule 134 .

The training sub-module 132 can be used to train the classification model. Specifically, the classification model may be trained based on the first part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the first part.

As an implementation, the first part of irrelevant data items used for training may be all of the irrelevant data sets, so that more data items may be used for training, making the trained classification model more robust. As another implementation manner, the first part of irrelevant data items used for training may be part of an irrelevant data set. As mentioned above, the irrelevant data set is divided into the first part of irrelevant data items and the second part of irrelevant data items.

For the convenience of description below, the set of the first part of irrelevant data items used for training is called a training set, and correspondingly, the first part of irrelevant data items may be training items.

It should be noted that the training here may be to train an initial classification model or may be to update a previously trained classification model, wherein the initial classification model may be a classification model that has not been trained. The previously trained classification model may be obtained after training the initial classification model. As an example, the training sub-module 132 can obtain an initial classification model or a previously trained classification model from the model storage module 140 .

The training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training and the label of each irrelevant data item in the first part from the data set processing module 120 or the data storage module 150 . Or, the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training from the data set processing module 120, and obtain the information of each irrelevant data item in the first part of irrelevant data items from the input/output module 110. Label.

Optionally, before training based on the training set (the first part of irrelevant data items in the irrelevant data set), the training submodule 132 can preprocess the training set, including but not limited to: feature extraction, cluster analysis, edge detection, Image denoising, etc. For example, the training data item after feature extraction can be characterized as an S-dimensional feature vector, where S is greater than 1.

It can be understood that the embodiment of the present disclosure does not limit the model structure of the classification model. As an example, the classification model can be a convolutional neural network (Convolutional Neural Network, CNN) model, which can optionally include an input layer, a convolutional layer, Deconvolution layer, pooling layer, fully connected layer, output layer, etc.

The classification model includes a large number of parameters, which can represent the calculation formula or the weight of the calculation factor in the model, and the parameters can be updated iteratively through training. The classification model also includes hyper-parameters, which are used to guide the construction or training of classifications, such as the number of iterations of model training, learning rate, batch size, model The number of layers, the number of neurons in each layer, etc. The hyperparameters can be parameters obtained by training the model through the training set, or they can be preset parameters, and the preset parameters will not be updated through the training of the model.

Exemplarily, the process of training the classification model by the training sub-module 132 can refer to the existing training process. As a schematic description, the training process can be: input the training data items in the training set to the classification model, use the label corresponding to the training data as a reference, and use the loss function (loss function) to obtain the relationship between the output of the classification model and the corresponding label. The loss value of , and adjust the parameters of the classification model according to the loss value. Each training data item in the training set iteratively trains the classification model, and the parameters of the classification model are continuously adjusted until the classification model can output an output value closer to the label corresponding to the training data item with higher accuracy according to the input training data item , such as the loss function is minimal or smaller than a reference threshold.

The loss function in the training process is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value). In the process of training the classification model, because it is hoped that the output of the classification model is as close as possible to the real value (that is, the corresponding label), it is possible to compare the predicted value and the real value of the current classification model, and then according to the difference between the two to update the parameters in the classification model. Each training uses the loss function to judge the difference between the value predicted by the current classification model and the real value, and updates the parameters of the classification model until the classification model can predict a value very close to the real value, then the classification model is considered to be trained .

The "classification model" in the embodiments of the present disclosure may also be called a machine learning model, a convolutional classification model, a background classification model, a data bias model, or other names, or may also be referred to as a "model" for short. Publicity is not limited to this. Optionally, the trained classification model may be stored in the model storage module 140 . In some examples, model storage module 140 may be part of model training module 130 .

The evaluation sub-module 134 can be used to evaluate the classification model. Specifically, the evaluation result of the trained classification model may be determined based on the second part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the second part. The evaluation results can be used to characterize the significance of data bias in the data set to be processed.

As mentioned above, the set of the second part of irrelevant data items may be a test set, and correspondingly, the second part of irrelevant data items may be test data items.

As an example, the evaluation process may include: inputting a test data item into a trained classification model, obtaining a prediction result about the test data item, and determining an evaluation result based on a comparison result of the prediction result with a label of the test data item.

In the embodiment of the present disclosure, the evaluation result may include at least one of the following: correct rate, precision rate, recall rate, F1 index, precision-recall rate (Precision-Recall, P-R) curve, average precision (Average Precision, AP) index , false positive rate, false negative rate, etc.

Specifically, a confusion matrix may be constructed, which shows the number of positive examples (Positive, also called positive) and negative examples (Negative, also called negative), real values, predicted values, and the like.

The accuracy rate refers to the proportion of correctly classified samples to the total samples. For example, the number of test data items in the test set is N2, and the number of predicted results consistent with the label is N21, then the accuracy rate can be expressed as N21/N2.

The correct rate, also known as the precision rate, refers to the proportion of samples that are actually positive among the samples that are predicted to be positive. For example, the number of test data items in the test set is N2, if the number of positive examples in the prediction result is N22, and the number of positive examples in the N22 test data items is N23, then the correct rate can be expressed as N23/ N22.

Recall refers to the proportion of samples that are actually positive that are predicted to be positive. For example, the number of test data items in the test set is N2, and the number marked as positive examples is N31. For the N31 positive examples, if the number of positive examples in the prediction result is N32, then the recall rate can be expressed as N32 /N31.

The P-R curve defines the horizontal axis as the recall rate and the vertical axis as the precision rate. A point on the P-R curve represents: under a certain threshold, the model judges the result greater than the threshold as a positive sample, and the result smaller than the threshold is judged as a negative sample. At this time, the recall rate and precision rate corresponding to the result are returned. The entire P-R curve is generated by shifting the threshold from high to low. Near the origin represents the precision and recall of the model when the threshold is maximum.

The F1 index, also known as the F1 score (score), is the harmonic mean of precision and recall. For example, the ratio of twice the product of the precision rate and the recall rate to the sum of the precision rate and the recall rate can be used as the F1 index.

In some embodiments of the present disclosure, the evaluation result may include a positive example characterization value, such as a first accuracy rate and/or a first recall rate. The first correct rate indicates the proportion of samples that are actually positive among the samples that are predicted to be positive. The first recall rate represents the proportion of the samples that are actually positive that are predicted to be positive. The evaluation result may include a negative example characterization value, such as a second accuracy rate and/or a second recall rate. The second correct rate indicates the proportion of samples that are actually negative among the samples that are predicted to be negative. The second recall rate represents the proportion of samples that are actually negative that are predicted to be negative.

In some embodiments of the present disclosure, the evaluation result may include a first predicted mean value and/or a second predicted mean value. The first predicted mean represents the average of predicted values for samples that are actually positive. The second predicted mean represents the average of predicted values for samples that are actually negative. The evaluation result may include mean difference, which is used to represent the difference between the first predicted mean and the second predicted mean, such as by the difference between the first predicted mean and the second predicted mean or by the difference between the first predicted mean and the second predicted mean Ratio, etc. to represent the mean difference.

It should be understood that the above list is only some examples of evaluation results, and other characterizations may also be used as evaluation results, which will not be listed in this disclosure.

Exemplarily, the evaluation result can be presented to the user by the input/output module 110 . For example, it may be presented through a graphical user interface, which is convenient for users to view.

In this way, by means of the embodiments of the present disclosure, the bias significance of the data set can be characterized in a quantitative form. This quantitative evaluation scheme can provide users with a clear reference, which is convenient for users to adjust the data set and other processing.

In scenarios where the input/output module 110 includes a graphical user interface, the input/output module 110 can also visually present representations of dataset biases through the graphical user interface.

Specifically, by inputting target-independent data items into the trained classification model, a Class Activation Map (CAM) is obtained. Then an overlay result is obtained by overlaying the CAM and the target-independent data item, and the overlay result is displayed.

The class activation map is the class activation heat map. In this way, the embodiments of the present disclosure can use the CAM to characterize the attention areas of the classification model, specifically, which areas (ie, the attention areas of the model) cause bias.

The embodiments of the present disclosure do not limit the specific manner of obtaining the CAM. As an example, CAM can be obtained by using Gradient-based CAM (Grad-CAM). For example, the output of the last convolutional layer of the classification model, that is, the feature map of the last layer, can be extracted, and the extracted feature maps of the last layer can be weighted and summed to obtain CAM. Optionally, the weighted and summed results can also be used as a CAM after being processed by a Rectified Linear Unit (ReLU) activation function. The weights for weighted summation here can be the weights of the top fully connected layer. As an example, the partial derivative of the output of the last layer of softmax (Softmax) of the classification model to all pixels of the last layer feature map can be calculated, and then the global average in the width and height dimensions can be taken as the corresponding weight.

Embodiments of the present disclosure do not limit the manner in which the CAM and the target-independent data item (such as the background image) are superimposed. For example, weighted summation may be used for superimposition. As an example, the weights of the CAM and the background graphics may be equal.

In this way, the embodiments of the present disclosure provide a solution for quantitatively evaluating and visually presenting data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be visually presented. In this way, users can more intuitively and comprehensively know the bias of the data set. This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.

The model training module 130 can also be used to adjust the data set to be processed based on the classification model.

Specifically, the data set to be processed may have an initial sample weight distribution, correspondingly, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution. For example, assuming that the initial sample weight of the target data item to be processed is a, then the sample weight of the irrelevant data item generated based on the target data item to be processed is also a.

Exemplarily, the model training module 130 can be used to obtain the weight distribution of recommended samples based on the iterative training of the classification model, as described below in conjunction with FIG. 3 .

FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure.

At 310, a first data set having a first distribution of sample weights and a second data set having a second distribution of sample weights are determined.

Specifically, an unrelated data set may be constructed based on the data set to be processed, and the unrelated data set may be divided into a first data set and a second data set, as described in the above embodiments.

Exemplarily, the data items to be processed in the data set to be processed may have initial sample weights, that is, the data set to be processed may have an initial sample weight distribution. As an example, the initial sample weight may be input by the user through the input/output module 110 . As another example, initialization sample weights may be determined through an initialization process.

The sample weight can be used to indicate the sampling probability of the data item to be processed. For example, assuming that the sample weight of the i-th data item to be processed is w _i , then the sampling probability of the i-th data item to be processed is

As an example, the initial sample weight distribution may indicate that the sampling probabilities of each data item to be processed in the data set to be processed are equal. Assuming that the data set to be processed includes N data items to be processed, and the initial sample weight of each data item to be processed is 1, then the sampling probability of each data item to be processed is initialized to 1/N.

It can be understood that while determining the initial sample weight distribution, the first sample weight distribution and the second sample weight distribution can be correspondingly determined.

At 320, the first data set is sampled based on the first sample weight distribution, and the classification model is trained iteratively.

At 330, the classification model trained in S320 is evaluated based on the second data set, and an evaluation result is obtained.

Exemplarily, the evaluation result may be obtained based on the comparison of the predicted result of the trained classification model for the irrelevant data item in the second data set with the label of the irrelevant data item. As an example, irrelevant data items can be input into the trained classification model to obtain prediction results about the irrelevant data items, and the evaluation results are determined based on the comparison results of the prediction results of the irrelevant data items and the labels of the irrelevant data items. The evaluation result may include at least one of the following: accuracy rate, precision rate, recall rate, F1 index, precision rate-recall rate curve, average precision index, false positive rate, false negative rate, and the like. Regarding the evaluation results, reference may be made to the above description, and details are not repeated here.

At 340, it is determined whether the significance of the bias indicated by the evaluation result is high.

If it is determined at 340 that the evaluation result indicates that the bias is significant, for example, the evaluation result is greater than a preset threshold, then proceed to 350 . Otherwise, if it is determined at 340 that the evaluation result indicates that the bias is not significant, for example, the evaluation result is not greater than the preset threshold, then the process may proceed to 360 .

The preset threshold can be set based on the processing accuracy and application scenarios of the data set to be processed. The preset threshold may be related to the specific meaning of the evaluation result. For example, the evaluation result includes a correct rate, and the preset threshold may be set to, for example, 30% or 50% or other numerical values.

At 350, the sample weight distribution is updated.

Referring to FIG. 3 , as shown by the dotted arrow in FIG. 3 , after 350 , return to 310 or 320 to continue execution.

In one example, it may return to 310 to continue execution, that is to say, rebuild the first data set and the second data set. In this way, an irrelevant data item may belong to the first data set in the previous cycle, but the irrelevant data item may belong to the first data set or the second data set in the next cycle.

In another example, it may return to 320 to continue execution, that is to say, the irrelevant data items in the first data set and the second data set do not change, but the first sample weight distribution and/or the second sample weight distribution are updated.

After 350, the first data set may be re-sampled based on the updated first sample weight distribution, and the classification model may be re-trained iteratively. And the retrained classification model is evaluated based on the second data set, and the evaluation result is obtained again.

In this way, 310 to 350 or 320 to 350 may be iteratively performed until the evaluation result indicates that the bias is not significant (for example, the evaluation result is not greater than the preset threshold).

The embodiments of the present disclosure do not limit the specific implementation manner of updating the sample weight distribution.

As an example, the sample weight distribution may be updated in a random manner. For example, the sample weights of some data items to be processed can be randomly updated, for example, the sample weight of a data item to be processed is updated from 1 to 2, and the sample weight of another data item to be processed is updated from 1 to 3 and many more. It can be understood that the random method has uncertainty, which may make the process of obtaining the weight distribution of the recommended samples take a long time.

As another example, a predetermined rule may be used to update the sample weight distribution. For example, the second sample weight distribution may be updated. For example, if the evaluation result indicates that the prediction result of the classification model for the irrelevant data item in the second data set is different from the label of the irrelevant data item, then the sample weight of the irrelevant data item may be increased. For example, the sample weight of the irrelevant data item is updated from a1 to a1+1 or 2*a1 or others. In this example, the weight distribution of the first sample may remain unchanged or the weight distribution of the first sample may be updated in other ways. Optionally, in this example, after the sample weight distribution is updated, the second data set may be exchanged with the first data set before performing the next cycle. For example, in the next cycle, the classification model will be trained based on the second data set of the previous cycle and the updated second sample weight distribution.

As another example, the distribution of sample weights may be optimized through a genetic algorithm to update the distribution of sample weights. For example, the sample weight distribution can be used as the genetic initial value of the genetic algorithm, and the objective function can be constructed based on the evaluation result obtained at 330, so that the genetic algorithm can be used to optimize the sample weight distribution, and the optimized sample weight distribution is updated immediately The subsequent sample weight distribution. The embodiment of the present disclosure does not limit the construction method of the objective function of the genetic algorithm. For example, the evaluation result includes the mean difference and the correct rate of the positive sample and the negative sample, then the sum of the mean difference and the correct rate can be used as the objective function. It is understandable that other methods can also be used to construct the objective function, which will not be listed here.

In this way, the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation and has a high degree of automation.

As another example, user modifications to the sample weight distribution may be acquired to update the sample weight distribution. For example, the user can empirically infer what modification to the sample weight distribution should be made by referring to the evaluation results and/or the displayed overlay results (as described above), and then input the modification through the input/output module 110 to update the sample weight distribution.

In this way, this method can fully consider the user's needs, and update the sample weight distribution based on the user's modification, so that the obtained recommended sample weight distribution can better meet the user's expectations and improve user satisfaction.

At 360, a recommendation sample weight distribution is obtained.

If it is determined at 340 that the evaluation result indicates that the significance of the bias is not high, for example, the evaluation result is not greater than the preset threshold, then the sample weight distribution obtained from the current evaluation result may be used as the recommended sample weight distribution.

In this way, the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can view the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and an effective high-quality Referential recommended sample weight distribution.

The input/output module 110 can also present the recommended sample weight distribution for the user as a reference for further adjustment of the data set to be processed. For example, the recommended sample weight distribution is presented visually through a graphical user interface.

Exemplarily, the data set processing module 120 may add or delete the data set to be processed based on the obtained recommended sample weight distribution, so as to construct an unbiased data set.

As an example, the data set processing module 120 may copy the data items to be processed with a large recommended sample weight, so as to expand the number of data items to be processed in the data set to be processed. The data set processing module 120 may delete the unprocessed data items whose recommended sample weights are small, so as to reduce the number of unprocessed data items in the unprocessed data set.

As an example, a user's deletion instruction for some data items to be processed may be obtained via the input/output module 110, so as to delete some data items to be processed. Other data items input by the user may be obtained via the input/output module 110 to be added to the current data set to be processed.

For example, users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set. As an example, other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).

In this way, in the embodiments of the present disclosure, the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.

It can be understood that the system 100 shown in FIG. 1 may be a system capable of interacting with users, and the system 10 may be a software system, a hardware system, or a system combining hardware and software.

In some examples, the system 100 can be implemented as a computing device or a part of a computing device, where the computing device includes but not limited to a desktop computer, a mobile terminal, a wearable device, a server, a cloud server, and the like.

It can be understood that the system 100 shown in FIG. 1 can be implemented as an artificial intelligence platform (AI platform). The AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users. Various AI models or AI sub-models for solving different problems can be built into the AI platform, and the AI platform can establish applicable AI models according to the needs input by users. That is, users only need to determine their own needs in the AI platform, and follow the prompts to prepare the data set and upload it to the AI platform, and the AI platform can train the user an AI model that can be used to realize the user's needs. The AI model in the embodiments of the present disclosure can be used to evaluate the data bias of the data set to be processed input by the user.

FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure. In the scenario 400 , the system 100 is fully deployed in the cloud environment 410 .

The cloud environment 410 is an entity that provides cloud services to users by using basic resources in the cloud computing mode. The cloud environment 410 includes a cloud data center 412 and a cloud service platform 414. The cloud data center 412 includes a large number of basic resources (comprising computing resources, storage resources and network resources) owned by the cloud service provider. The computing resources included in the cloud data center 412 can be A large number of computing devices (such as servers). The system 100 can be independently deployed on servers or virtual machines in the cloud data center 412, and the system 100 can also be deployed on multiple servers in the cloud data center 412 in a distributed manner, or distributed in the cloud data center 412 on multiple virtual machines, or distributedly deployed on servers and virtual machines in the cloud data center 412.

As shown in Figure 4, the system 100 can be abstracted into an AI development cloud service 424 by the cloud service provider on the cloud service platform 414 and provided to the user. Settlement based on usage conditions), the cloud environment 410 utilizes the system platform 100 deployed in the cloud data center 412 to provide the AI development cloud service 424 to the user. When using the AI development cloud service 424, the user can upload the data set to be processed through an application program interface (application program interface, API) or GUI. The system 100 in the cloud environment 410 receives the data set to be processed uploaded by the user, and can perform operations such as data set processing, model training, and data set adjustment. The system 100 can return the evaluation result of the model, the weight distribution of recommended samples, etc. to the user through API or GUI.

In another embodiment of the present application, when the system 100 under the cloud environment 410 is abstracted into an AI development cloud service 424 and provided to users, it can be divided into two parts, such as data set bias evaluation cloud service and data set adjustment cloud service . On the cloud service platform 414, the user can only purchase the data set bias evaluation cloud service. The cloud service platform 414 can construct an irrelevant data set based on the data set to be processed uploaded by the user, obtain a classification model through training, and return the evaluation result of the classification model to the user , so that the user is informed of the significance of bias in the dataset being processed. The user can also further purchase the data set adjustment cloud service on the cloud service platform 414. The cloud service platform 414 can iteratively train the classification model based on the sample weight distribution, update the sample weight distribution, and return the recommended sample weight distribution to the user. So that the user can add or delete the data set to be processed with reference to the weight distribution of the recommended samples to construct an unbiased data set.

FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure. In scenario 500 , the system 100 is distributed and deployed in different environments, which may include but not limited to at least two of cloud environment 510 , edge environment 520 and terminal computing device 530 .

System 100 may be logically divided into multiple sections, each section having a different function. For example, as shown in FIG. 1 , the system 100 includes an input/output module 110 , a data set processing module 120 , a model training module 130 , a model storage module 140 and a data storage module 150 . Each part of the system 100 can be deployed in any two or three environments of the terminal computing device 530 , the edge environment 520 and the cloud environment 510 . Various parts of the system 100 deployed in different environments cooperate to provide users with various functions. For example, in one scenario, the input/output module 110 and the data storage module 150 of the system 100 are deployed in the terminal computing device 530, the data set processing module 120 of the system 100 is deployed in the edge computing device of the edge environment 520, and the data set processing module 120 of the system 100 is deployed in the cloud environment 510 The model training module 130 and the model storage module 140 of the deployment system 100 are deployed. The user sends the data set to be processed to the input/output module 110 in the terminal computing device 530 , and the terminal computing device 530 stores the data set to be processed to the data storage module 150 . The data set processing module 120 in the edge computing device of the edge environment 520 constructs an irrelevant data set based on the data set to be processed from the terminal computing device 530 . The model training module 130 in the cloud environment 510 trains a classification model based on an unrelated dataset from the edge environment 520 . The cloud environment 510 may also store the trained classification model to the model storage module 140. It should be understood that this application does not limit which parts of the system 100 are deployed in which environment, and can be adapted according to the computing capability of the terminal computing device 530, the resource occupancy of the edge environment 520 and the cloud environment 510, or specific application requirements during actual application. Sexual deployment.

The edge environment 520 is an environment including a collection of edge computing devices that are closer to the terminal computing device 530 , and the edge computing devices include but are not limited to: edge servers, edge small stations with computing capabilities, and the like. It can be understood that the system 100 may also be independently deployed on one edge server in the edge environment 520 , or may be deployed on multiple edge servers in the edge environment 520 in a distributed manner.

The terminal computing device 530 includes, but is not limited to: a terminal server, a smart phone, a notebook computer, a tablet computer, a personal desktop computer, a smart camera, and the like. It can be understood that the system 100 may also be independently deployed on one terminal computing device 530 , or may be deployed on multiple terminal computing devices 530 in a distributed manner.

FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure. The computing device 600 in FIG. 6 may be implemented as a device in the cloud environment 510 in FIG. 5 , a device in the edge environment 520 , or a terminal computing device 530 . It should be understood that the computing device 600 shown in FIG. 6 can also be regarded as a computing device cluster, that is, the computing device 600 includes one or more of the aforementioned devices in the cloud environment 510, devices in the edge environment 520, and terminal computing devices 530. devices.

As shown in FIG. 6 , the computing device 600 includes a memory 610 , a processor 620 , a communication interface 630 and a bus 640 , wherein the bus 640 is used for communication between various components of the computing device 600 .

The memory 610 may be a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a hard disk, a flash memory or any combination thereof. The memory 610 can store programs, and when the programs stored in the memory 610 are executed by the processor 620, the processor 620 and the communication interface 630 are used to perform the processes that can be performed by the various modules in the system 100 as described above. It should be understood that the processor 620 and the communication interface 630 may also be used to execute part or all of the content in the embodiments of the data processing method described below in this specification. The memory can also store datasets and classification models. For example, a part of the storage resources in the memory 610 is divided into a data storage module for storing data sets, such as data sets to be processed, irrelevant data sets, etc., and a part of the storage resources in the memory 610 is divided into a model storage module for Store classification models.

The processor 620 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a graphics processing unit (Graphics Processing Unit, GPU) or any combination thereof. Processor 620 may include one or more chips. The processor 620 may include an accelerator, such as a Neural Processing Unit (Neural Processing Unit, NPU).

The communication interface 630 uses a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .

Bus 640 may include pathways for communicating information between various components of computing device 600 (eg, memory 610 , processor 620 , communication interface 630 ).

FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure. The method 700 shown in FIG. 7 can be executed by the system 100 .

As shown in Figure 7, in block 710, an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.

Exemplarily, the data set to be processed includes a plurality of data items to be processed, and each data item to be processed has a label. Data items to be processed may include tag-related parts and tag-independent parts.

In some embodiments, the part associated with the tag of the target data item to be processed may be removed from the target data item to be processed in the data set to be processed to obtain the remainder of the target data item to be processed. Use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.

In some embodiments, the data set to be processed is an image data set, that is, the data item to be processed is an image. Then image segmentation may be performed on the target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed. A background image is used to construct an unrelated data item in an unrelated data set.

Specifically, the part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.

In some embodiments, the data items to be processed in the data set to be processed are video sequences. Then the binary image of the video sequence can be determined based on the gradient information between one frame image in the video sequence and the previous frame image of the one frame image. And based on the binary image, the background image of the video sequence is generated. The background image of the video sequence is then used to construct an unrelated data item in the unrelated data set.

FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure. Specifically, what is shown in FIG. 8 is the process of constructing irrelevant data items based on the data items to be processed (video sequences).

As shown in FIG. 8 , at block 810 , the gradient information between two adjacent frames of images in the target video sequence is calculated.

Exemplarily, the gradient of the feature vectors of the two frames of images along the time dimension may be calculated, so as to obtain gradient information. In this way, the static and unchanging background parts in the video sequence can be obtained, such as image borders and the like.

At block 820, a gradient overlay map is obtained based on the overlay of the gradient information.

Exemplarily, the gradient information obtained in step 810 may be weighted and summed, maximized or minimized, etc., to complete superposition and obtain a gradient superposition map.

In block 830, thresholding is performed on the gradient overlay image to obtain an initial binary image.

In block 840, perform morphological processing on the initial binary image to obtain a binary image.

Exemplarily, the initial binary image is subjected to several times of morphological expansion, and then the same number of times of morphological erosion is performed, so as to obtain the binary image.

At block 850, a background image is obtained based on the binary image, and the background image is used as an irrelevant data item corresponding to the video sequence.

Exemplarily, a matting operation may be performed on a binary image, for example, a matrix dot product may be used to obtain a background image.

In addition, the label of the irrelevant data item is determined based on the label of the data item to be processed. Specifically, if the target data item to be processed has label A, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation), then the label of the target unrelated data item is also label A.

At block 720, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.

The sample weights of the irrelevant data items are determined based on the sample weights of the data items to be processed. Specifically, the target data item to be processed has a sample weight w, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation, etc.), then the sample weight of the target unrelated data item is also the sample weight w.

In the embodiment of the present disclosure, the manner of dividing the first data set and the second data set is not limited. For example, it may be divided in a manner of 9:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is about 9:1. For example, it may be divided in a manner of 1:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is approximately 1:1. In addition, the first data set can also be further divided into the first sub-data set and the second sub-data set, for example, the ratio of the number of irrelevant data items in the first sub-data set to the number of irrelevant data items in the second sub-data set is about 7:2. It can be understood that the ratios listed here are only for illustration, and are not intended to limit the embodiments of the present disclosure.

At block 730, the classification model is trained based on the first data set and the first sample weight distribution.

Specifically, the first data set may be sampled based on the first sample weight distribution, and the classification model may be trained based on the first data set based on labels of irrelevant data items in the first data set.

That is to say, the classification model can be trained by using the first data set as a training set. Optionally, before training, the first data set may be preprocessed, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, and the like.

The embodiment of the present disclosure does not limit the specific structure of the classification model, for example, it may be a convolutional neural network, including at least a convolutional layer and a fully connected layer.

At block 740, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias for the processing data set having the sample weight distribution.

That is to say, the second data set can be used as a test set to obtain an evaluation result. Specifically, the evaluation result may be obtained based on a prediction result of the classification model for the irrelevant data item in the second data set and a comparison result between labels of the unrelated data item in the second data set.

As an example, the evaluation result may include a first accuracy rate for positive samples in the second data set and a second accuracy rate for negative samples in the second data set.

In this way, in the embodiments of the present disclosure, by constructing an unrelated data set, training and evaluating based on the unrelated data set, a quantitative representation of the bias significance of the data set to be processed can be obtained. It can provide quantitative bias reference, which is convenient for further adjustment of the data set to be processed.

Exemplarily, if the evaluation result obtained in block 740 indicates that the bias is significant (or there is a significant bias), then the sample weight distribution of the data set to be processed may be updated.

In some embodiments, if the evaluation result is greater than a preset threshold, the sample weight distribution of the data set to be processed is updated. Further, after this, return to 720 to obtain the first data set and the second data set again, and repeatedly execute 730 and 740 until the evaluation result obtained in block 740 indicates that the bias is not significant (or there is no significant bias). bias), for example, the evaluation result is not greater than a preset threshold. Subsequently, the sample weight distribution when the evaluation result is not greater than the preset threshold may be used as the recommended sample weight distribution, and the recommended sample weight distribution may be output.

The embodiment of the present disclosure does not limit the specific method of updating the sample weight distribution. For example, at least one of the following methods can be used to update: use predetermined rules to update the sample weight distribution, use a random method to update the sample weight distribution, and obtain the user's sample weight distribution. Modify the weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.

In some implementations of the present disclosure, updating the sample weight distribution may update the first sample weight distribution of the first data set, so that when returning to execution 720, the first sample weight of the first data set in 720 executed again The distribution is updated and thus the classification model trained at 730 is also updated.

In another implementation manner of the present disclosure, updating the sample weight distribution may update the first sample weight distribution of the first data set and update the second sample weight distribution of the second data set. As an example, the distribution of sample weights for a data set to be processed can be updated, and an unrelated data set can be repartitioned. As another example, the sample weight distribution of the data set to be processed may be updated, so as to adaptively update the first sample weight distribution and the second sample weight distribution, but irrelevant data items in the first data set and the second data set remain unchanged. In this way, when returning to execute 720 , the first data set in 720 executed again is updated or the first sample weight distribution of the first data set is updated, and then the classification model trained at 730 is also updated.

In another implementation manner of the present disclosure, updating the sample weight distribution may update the second sample weight distribution of the second data set. Optionally, the first sample weight distribution may include invariance. As an example, in this implementation manner, when the execution of 720 is returned, the first data set and the second data set in 720 executed last time may be exchanged. As such, the first data set at the time of return execution 730 is the second data set during the previous execution. In this way, a more comprehensive consideration of the data set to be processed can be realized, so that the evaluation result of the classification model for the significance of the bias is more accurate.

FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure.

As shown in Figure 9, in block 910, an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.

At block 920, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.

At block 930, a classification model is trained based on the first data set and the first sample weight distribution.

At block 940, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of the bias of the processed data set having the sample weight distribution.

Regarding 910 to 940 in FIG. 9 , reference may be made to 710 to 740 described above in conjunction with FIG. 7 , and details are not repeated here for brevity.

In FIG. 9, at block 950, it is determined whether the evaluation result is greater than a preset threshold. If it is determined that the evaluation result is greater than the preset threshold, go to 960 . If it is determined that the evaluation result is not greater than the preset threshold, go to 980 .

At block 960, the second sample weight distribution for the second data set is updated.

As some examples, the sample weights of all irrelevant data items in the second data set may be updated, or the sample weights of some irrelevant data items in the second data set may be updated.

As some examples, the weight distribution of the second sample may be updated based on a prediction result of the classification model for irrelevant data items in the second data set at 940 .

Specifically, the sample weights of irrelevant data items in the second data set with correct predictions may be increased, or the sample weights of irrelevant data items in the second data set with wrong predictions may be adjusted small. For example, assuming that the sample weight of the first irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the first irrelevant data item in the second data set into the classification model is consistent with its label, then the second data The sample weight of the first irrelevant data item in the set is increased, for example, from 2 to 3 or 4 or other values. For example, assuming that the sample weight of the second irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the second irrelevant data item in the second data set into the classification model is inconsistent with its label, the second data can be The sample weight of the second irrelevant data item in the set is reduced, for example, from 2 to 1.

At block 970, the first data set having the first sample weight distribution is exchanged with the second data set having the updated second sample weight distribution.

It can be understood that the first data set after exchange is the second data set in block 920, and the first sample weight distribution of the first data set after exchange is the second sample weight distribution updated in block 960 . The second data set after the swap is the first data set in block 920, and the second sample weight distribution of the second data set after the swap is the first sample weight distribution in block 920.

After block 970 , execution returns to 930 . That is, the classification model is retrained using the first data set after the exchange in 970 .

At block 980, the recommended sample weight distribution is output.

Exemplarily, the sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution. Specifically, the recommended sample weight distribution may be determined based on the first sample weight distribution and the second sample weight distribution.

In some embodiments of the present disclosure, the focus area of data set bias can be presented in a visual manner, specifically, a class activation map can be obtained by inputting target-independent data items into a trained classification model. Then the overlay result is obtained by overlaying the class activation map with the target-independent data item, and the overlay result is displayed. As an example, the overlay results can be obtained by weighting and summing the heatmap, so that by displaying the overlay results, it is possible to visually see which are the attention areas of the classification model, and these attention areas are important factors that cause bias.

In some embodiments of the present disclosure, after obtaining the recommended sample weight distribution, it may optionally further include adjusting the data set to be processed based on the recommended sample weight distribution to obtain an unbiased data set.

Exemplarily, an unbiased data set can be constructed by adding or deleting the data set to be processed.

As an example, data items to be processed with a large recommended sample weight may be copied to expand the number of data items to be processed in the data set to be processed. As an example, data items to be processed with small recommended sample weights may be deleted, so as to reduce the number of data items to be processed in the data set to be processed.

As an example, a user's deletion instruction for some data items to be processed may be obtained, so as to delete some data items to be processed. Other data items entered by the user can be obtained to be added to the current pending data set.

It can be understood that, the processes described in the embodiment of the present disclosure with reference to FIG. 7 to FIG. 9 may refer to the functions of the modules and the like described above in conjunction with FIG. 1 to FIG. 6 , and are not repeated for brevity.

Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure. Apparatus 1000 may be implemented by software, hardware or a combination of both. In some embodiments, the device 1000 may be a software or hardware device that implements part or all of the functions in the system 100 shown in FIG. 1 .

As shown in FIG. 10 , the device 1000 includes a construction unit 1010 , a division unit 1020 , a training unit 1030 and an evaluation unit 1040 .

The construction unit 1010 is configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.

The division unit 1020 is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight The distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed.

The training unit 1030 is configured to train the classification model based on the first data set and the first sample weight distribution.

The evaluation unit 1040 is configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias in the data set to be processed with the sample weight distribution.

In some embodiments, the device 1000 may further include an update unit 1050 , an adjustment unit 1060 and a display unit 1070 .

The update unit 1050 is configured to update the sample weight distribution of the data set to be processed if the evaluation result obtained by the evaluation unit 1040 is greater than a preset threshold.

As an example, the updating unit 1050 may be configured to update a part of the sample weight distribution, so that the second sample weight distribution is updated without updating the first sample weight distribution.

In some embodiments, the update unit 1050 may be configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, update the sample weight distribution in a random manner, and acquire user modification to the sample weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.

In some embodiments, the updating unit 1050 may be configured to use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.

The adjustment unit 1060 is configured to add or delete the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.

The update unit 1050 is further configured to: obtain a class activation map by inputting target-independent data items into the trained classification model; and obtain a superposition result by superimposing the activation map and target-independent data items.

The display unit 1070 is configured to display the recommended sample weight distribution and/or the superposition result.

In some embodiments, the construction unit 1010 may be configured to remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, so as to obtain the remaining part of the target data item to be processed ; and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.

In some embodiments, the data set to be processed is an image data set, and the construction unit 1010 may be configured to perform image segmentation on the target data item to be processed in the data set to be processed, so as to obtain a background image corresponding to the target data item to be processed; and A background image is used to construct an unrelated data item in an unrelated data set.

In some embodiments, the data item to be processed in the data set to be processed is a video sequence, and the construction unit 1010 may be configured to determine the video A binary image of the sequence; a background image of the video sequence is generated based on the binary image; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.

The division of units in the embodiments of the present disclosure is schematic, and it is only a logical function division. In actual implementation, there may be other division methods. In addition, the functional units in the disclosed embodiments can be integrated into one A processor may also exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

The data processing device 1000 shown in FIG. 10 can be used to implement the above data processing process shown in conjunction with FIGS. 7 to 9 .

The present disclosure can also be implemented as a computer program product. A computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure. The present disclosure may be implemented as a computer-readable storage medium, on which computer-readable program instructions are stored, and when a processor executes the instructions, the processor is made to execute the above-mentioned data processing process.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM) or flash memory, Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (Digital Versatile Discs, DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer readable program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or in the form of a or any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby implementing various aspects of the present disclosure.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer readable program instructions.

Claims

A data processing method, comprising:

Constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed;

Dividing the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight distribution This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;

training a classification model based on the first data set and the first sample weight distribution; and

Evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating that the bias of the to-be-processed data set having the sample weight distribution is significant sex.
The method according to claim 1, further comprising:

If the evaluation result is greater than a preset threshold, updating the sample weight distribution of the data set to be processed; and

Based on the updated sample weight distribution, the training and the evaluation are repeatedly performed until the evaluation result is not greater than the preset threshold.
The method according to claim 2, wherein said updating sample weight distribution comprises:

The portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
The method according to claim 2 or 3, further comprising:

The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
The method according to claim 4, further comprising:

Based on the weight distribution of the recommended samples, the data set to be processed is added or deleted to construct an unbiased data set.
The method according to any one of claims 2 to 5, wherein said updating sample weight distribution comprises at least one of the following:

updating the sample weight distribution using a predetermined rule,

updating the sample weight distribution in a random manner,

obtaining user modifications to the sample weight distribution to update the sample weight distribution, or

Optimizing the sample weight distribution by genetic algorithm to update the sample weight distribution.
The method according to any one of claims 1 to 6, wherein constructing an irrelevant data set based on the data set to be processed comprises:

removing from the target data item to be processed a part associated with the tag of the target data item to be processed to obtain a remaining part of the target data item to be processed; and

Using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
The method according to any one of claims 1 to 6, wherein the data set to be processed is an image data set, and wherein constructing the irrelevant data set based on the data set to be processed comprises:

performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and

A piece of irrelevant data item in the irrelevant data set is constructed by using the background image.
The method according to any one of claims 1 to 6, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein constructing the irrelevant data set based on the data set to be processed comprises:

determining a binary image of the video sequence based on gradient information between a frame image in the video sequence and a previous frame image of the one frame image;

generating a background image of the video sequence based on the binary image; and

A piece of irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
The method according to any one of claims 1 to 9, further comprising:

Obtaining a class activation map CAM by inputting target-independent data items into said trained classification model;

Obtaining an overlay result by overlaying the CAM and the target-independent data item; and

Display the overlay result.
A data processing device, comprising:

A construction unit configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are based on the unprocessed data items in the unprocessed data set label determined;

a division unit configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution , the first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;

a training unit configured to train a classification model based on the first data set and the first sample weight distribution; and

An evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating that the classifier with the sample weight distribution is to be Dealing with salience of bias in datasets.
The apparatus according to claim 11, further comprising an update unit configured to:

If the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
The apparatus of claim 12, wherein the update unit is configured to:

The portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
The apparatus according to claim 12 or 13, wherein the updating unit is configured to:

The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
The apparatus of claim 14, further comprising an adjustment unit configured to:

Based on the weight distribution of the recommended samples, the data set to be processed is added or deleted to construct an unbiased data set.
The apparatus according to any one of claims 12 to 15, wherein the update unit is configured to update the sample weight distribution by at least one of the following:

updating the sample weight distribution using a predetermined rule,

updating the sample weight distribution in a random manner,

obtaining user modifications to the sample weight distribution to update the sample weight distribution, or

Optimizing the sample weight distribution by genetic algorithm to update the sample weight distribution.
The device according to any one of claims 11 to 16, wherein the construction unit is configured to:

removing from the target data item to be processed a part associated with the tag of the target data item to be processed to obtain a remaining part of the target data item to be processed; and

Using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
The apparatus according to any one of claims 11 to 16, wherein the data set to be processed is an image data set, and wherein the building blocks are configured to:

performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and

A piece of irrelevant data item in the irrelevant data set is constructed by using the background image.
The device according to any one of claims 11 to 16, wherein the data item to be processed in the data set to be processed is a video sequence, and wherein the construction unit is configured to:

determining a binary image of the video sequence based on gradient information between a frame image in the video sequence and a previous frame image of the one frame image;

generating a background image of the video sequence based on the binary image; and

A piece of irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
Apparatus according to any one of claims 11 to 19, further comprising:

An update unit configured to: obtain a class activation map CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items; and

A display unit configured to display the superposition result.
A computing device, characterized in that it includes a processor and a memory, the processor reads and executes the computer program stored in the memory, so that the computing device executes the computer program according to any one of claims 1 to 10 method.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method according to any one of claims 1 to 10 is realized.