WO2022247448A1 - Procédé et appareil de traitement de données, dispositif informatique et support de stockage lisible par ordinateur - Google Patents
Procédé et appareil de traitement de données, dispositif informatique et support de stockage lisible par ordinateur Download PDFInfo
- Publication number
- WO2022247448A1 WO2022247448A1 PCT/CN2022/083841 CN2022083841W WO2022247448A1 WO 2022247448 A1 WO2022247448 A1 WO 2022247448A1 CN 2022083841 W CN2022083841 W CN 2022083841W WO 2022247448 A1 WO2022247448 A1 WO 2022247448A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data set
- weight distribution
- processed
- sample weight
- data
- Prior art date
Links
- 238000003860 storage Methods 0.000 title claims abstract description 33
- 238000003672 processing method Methods 0.000 title claims abstract description 9
- 238000009826 distribution Methods 0.000 claims abstract description 261
- 238000013145 classification model Methods 0.000 claims abstract description 101
- 238000011156 evaluation Methods 0.000 claims abstract description 85
- 238000000034 method Methods 0.000 claims abstract description 54
- 238000012549 training Methods 0.000 claims description 72
- 238000012545 processing Methods 0.000 claims description 66
- 238000004422 calculation algorithm Methods 0.000 claims description 34
- 238000003709 image segmentation Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 14
- 238000010276 construction Methods 0.000 claims description 14
- 230000002068 genetic effect Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 11
- 238000012986 modification Methods 0.000 claims description 9
- 230000004048 modification Effects 0.000 claims description 9
- 238000013473 artificial intelligence Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 26
- 230000008569 process Effects 0.000 description 25
- 230000006870 function Effects 0.000 description 22
- 238000012360 testing method Methods 0.000 description 17
- 230000000877 morphologic effect Effects 0.000 description 13
- 238000010801 machine learning Methods 0.000 description 12
- 238000013500 data storage Methods 0.000 description 11
- 230000011218 segmentation Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 238000011161 development Methods 0.000 description 6
- 230000018109 developmental process Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 238000000926 separation method Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 3
- 238000007621 cluster analysis Methods 0.000 description 3
- 238000003708 edge detection Methods 0.000 description 3
- 230000003628 erosive effect Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000009469 supplementation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000010339 dilation Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 210000001747 pupil Anatomy 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 210000003462 vein Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the present disclosure relates to the field of artificial intelligence, and more particularly, to a data processing method, device, computing device, and computer-readable storage medium.
- Dataset bias is a widespread problem that has a huge negative impact in machine learning, especially deep learning, and is difficult to detect and easily overlooked. Especially for scenarios with high requirements for model security, if the training is based on a biased data set, the resulting model may cause serious accidents in actual use.
- the bias of the data set is checked by guessing or based on experience, but this solution needs to consume a lot of human resources, which is not only inefficient, but also has low accuracy and cannot meet actual needs.
- Exemplary embodiments of the present disclosure provide a data processing method including a scheme for assessing bias in a data set, enabling a more precise check of the bias in the data set.
- a data processing method includes: constructing an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; Divided into the first data set and the second data set, the first data set has the first sample weight distribution, the second data set has the second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the The sample weights of the data items to be processed in the processing data set are determined; the classification model is trained based on the first data set and the first sample weight distribution; and the classification model is trained based on the second data set and the second sample weight distribution. Evaluate, to obtain an evaluation result indicating the significance of the bias for the dataset under processing with the distribution of sample weights.
- the significance of bias of a data set can be more accurately assessed.
- This evaluation scheme is convenient for users to adjust the data set and other processing.
- the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation, and is highly efficient and highly automated.
- updating the sample weight distribution includes: updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
- it further includes: using the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
- the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can check the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and effective and accurate Weight distribution of recommended samples with high degree.
- it further includes: adding or deleting the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
- the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased data set can be used to train a more robust and unbiased task-specific model to meet actual needs.
- updating the sample weight distribution includes at least one of the following: using a predetermined rule to update the sample weight distribution, using a random method to update the sample weight distribution, obtaining user modification to the sample weight distribution to update the sample Weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
- constructing the irrelevant data set based on the data set to be processed includes: removing the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, to obtain The remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
- the data set to be processed is an image data set
- constructing an unrelated data set based on the data set to be processed comprises: performing image segmentation on a target data item to be processed in the data set to be processed to obtain a A background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
- the background image is used as a representative of bias, so that the data set can be checked for bias.
- the data item to be processed in the data set to be processed is a video sequence
- constructing an irrelevant data set based on the data set to be processed includes: The gradient information between a frame of images determines the binary image of the video sequence; generates the background image of the video sequence based on the binary image; and uses the background image of the video sequence to construct an irrelevant data item in the irrelevant data set.
- the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
- it also includes: obtaining the class activation map CAM by inputting the target-independent data item into the trained classification model; by superimposing the CAM and the target-independent data item to obtain an overlay result; and displaying the overlay result .
- the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be presented visually. In this way, users can more intuitively and comprehensively know the bias of the data set.
- This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
- a data processing device in a second aspect, includes: a construction unit configured to construct an irrelevant data set based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed.
- the dividing unit is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first same This weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed;
- the training unit is configured to train the classification model based on the first data set and the first sample weight distribution and an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of the bias of the data set to be processed with the sample weight distribution.
- an updating unit is further included, configured to: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
- the update unit is configured to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
- the update unit is configured to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
- an adjustment unit is further included, configured to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
- the update unit is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution by using a predetermined rule, update the sample weight distribution in a random manner, and obtain the user's weight on the sample The distribution is modified to update the sample weight distribution, or the sample weight distribution is optimized by the genetic algorithm to update the sample weight distribution.
- the construction unit is configured to: remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to obtain the target data item to be processed and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
- the data set to be processed is an image data set
- the construction unit is configured to: perform image segmentation on the target data item to be processed in the data set to be processed to obtain the target data item to be processed a corresponding background image; and using the background image to construct an unrelated data item in the unrelated data set.
- the data item to be processed in the data set to be processed is a video sequence
- the construction unit is configured to: The gradient information of the video sequence is determined to determine the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
- an update unit configured to: obtain a CAM by inputting target-independent data items into the trained classification model; and obtain an overlay result by superimposing the CAM and the target-independent data items ; and a display unit configured to display an overlay result.
- a computing device including a processor and a memory, the memory stores instructions executed by the processor, and when the instructions are executed by the processor, the computing device realizes: based on the data set to be processed Constructing an irrelevant data set, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; the irrelevant data sets are divided into the first data set and the second data set Data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, the first sample weight distribution and the second sample weight distribution are based on the data items to be processed in the data set to be processed The sample weight is determined; based on the first data set and the first sample weight distribution, the classification model is trained; and based on the second data set and the second sample weight distribution, the classification model is evaluated to obtain an evaluation result, and the evaluation result Indicates the significance of bias for the dataset to be processed with a distribution of sample weights.
- the computing device when the instructions are executed by the processor, the computing device is enabled to implement: if the evaluation result is greater than a preset threshold, update the sample weight distribution of the data set to be processed.
- the instructions when executed by the processor, cause the computing device to: update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
- the computing device when the instructions are executed by the processor, the computing device is enabled to: use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
- the computing device when the instruction is executed by the processor, the computing device is enabled to: add or delete the data set to be processed based on the recommended sample weight distribution, so as to construct an unbiased data set.
- the device when the instruction is executed by the processor, the device is configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, and update the sample weight distribution in a random manner , obtain the modification of the sample weight distribution by the user to update the sample weight distribution, or optimize the sample weight distribution through the genetic algorithm to update the sample weight distribution.
- the computing device when the instructions are executed by the processor, the computing device is caused to: remove the part associated with the tag of the target data item to be processed from the target data item to be processed in the data set to be processed , to obtain the remainder of the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant data set, the label of an irrelevant data item corresponds to the label of the target data item to be processed.
- the data set to be processed is an image data set
- the instructions when executed by a processor, cause the computing device to: perform image segmentation on a target data item to be processed in the data set to be processed , to obtain a background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant data set.
- the data item to be processed in the data set to be processed is a video sequence
- the computing device realizes: based on a frame image and a frame image in the video sequence The gradient information between the previous frame image of the frame image determines the binary image of the video sequence; based on the binary image, the background image of the video sequence is generated; and the background image of the video sequence is used to construct an irrelevant data item in the irrelevant data set .
- the instructions when executed by a processor, cause the computing device to: obtain a CAM by inputting a target-independent data item into a trained classification model; superimposing, obtaining the superimposing result; and displaying the superimposing result.
- a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the operation according to the method in the above-mentioned first aspect or any embodiment is realized .
- a chip or a chip system in a fifth aspect, includes a processing circuit configured to perform operations according to the method in the above first aspect or any one of the embodiments.
- a computer program or computer program product is provided.
- the computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause the device to implement operations according to the method in the first aspect or any of the above-mentioned embodiments .
- FIG. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure
- FIG. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure
- FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure
- FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure
- FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure
- FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure
- FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure
- FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure
- FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure
- Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
- Artificial Intelligence uses computers to simulate certain human thinking processes and intelligent behaviors.
- the history of artificial intelligence research has a natural and clear vein from focusing on “reasoning”, to focusing on “knowledge”, and then focusing on “learning”.
- Artificial intelligence has been widely applied to various industries such as security, medical care, transportation, education, and finance.
- Machine learning is a branch of artificial intelligence, which studies how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their own performance. In other words, machine learning studies how to improve the performance of specific algorithms during empirical learning.
- Deep learning is a type of machine learning technology based on deep neural network algorithms. Its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
- Data Bias data bias
- factors in the data that are related to the task but not non-causal, such as sample imbalance, artificial markers in the data, etc. Such factors can be considered as data bias.
- Dataset bias refers to the presence of spurious features in a dataset that some machine learning models may learn.
- image data set there may be some information related to the data acquisition model and acquisition parameters in the image, which has nothing to do with the acquisition task.
- the machine learning model may make speculations based on this information and directly guess the classification results, instead of learning the image features that are really related to the target task.
- a machine learning model When a machine learning model is trained on an image dataset with dataset bias, it may not be able to learn objectively and realistically for the training task as expected. As a result, it is difficult for the learned machine learning model to complete the target tasks as expected in the actual use environment, resulting in serious performance degradation; or even if the performance does not decrease, the reasons for errors may be unacceptable, and even lead to ethical lawsuits.
- a model for predicting lipstick hardly affects the prediction results after covering the mouth. It can be seen that the model does not actually learn mouth-related features.
- Another example is a medical image recognition model that infers the collection location based on the markers placed by the doctor, which affects the prediction results.
- the embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, so that the impact of data set bias can be effectively determined, and the data set can be adjusted based on this to ensure that the adjusted data set The model will not be negatively affected by data bias.
- Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure.
- the system 100 may be shown in FIG. 1 , and the system architecture 100 includes an input/output (Input/Output, I/O) module 110 , a data set processing module 120 and a model training module 130 .
- the system 100 may further include a model storage module 140 and a data storage module 150 .
- the various modules shown in FIG. 1 can communicate with each other.
- the input/output module 110 can be used to acquire data sets to be processed. For example, a data set to be processed input by a user may be received.
- the data set to be processed may be stored in the data storage module 150 .
- the data storage module 150 may be a data storage resource corresponding to an object storage service (Object Storage Service, OBS) provided by a cloud service provider.
- OBS Object Storage Service
- the data set to be processed includes a large number of data items to be processed, and each data item to be processed has a label.
- the data set to be processed contains a plurality of data items to be processed marked with .
- tags may be marked manually, or may be obtained through machine learning, which is not limited in the present disclosure.
- Tags can also be called task tags, annotation information, or other names, which will not be listed in this article.
- the annotation information may be annotated by an annotator for a specific part of the data item to be processed based on experience.
- the annotation information may be annotated through an image recognition model and an annotation model.
- tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face.
- tags such as gender, age, whether to wear glasses, whether to wear a hat, and the size of the human face can be labeled for the human face.
- a medical image such as an ultrasound image
- whether there is a lesion can be marked for the detected part.
- the data item to be processed may include a tag-related part and a tag-independent part.
- the face area in the image is the part related to the label, and the face area in the image The rest of the area is not related to the label.
- the label is aimed at the eyes (for example, the pupil color is marked by "black”, "brown”, etc.)
- the eye area in the image is the part related to the label, while other areas in the image are related to the eye area. Tags don't matter.
- the data items to be processed in the data set to be processed may be any type of data, such as images, videos, voices, texts, and so on.
- images such as images, videos, voices, texts, and so on.
- images such as images, videos, voices, texts, and so on.
- the embodiment of the present disclosure does not limit the source of the data items to be processed.
- images may be collected from open source data sets, for example, they may be collected by different image acquisition devices, for example, they may be collected by the same image acquisition device Captured at different times, for example, may be image frames in a video sequence captured by an image capture device, or any combination of the above, or others.
- the input/output module 110 may be implemented as an input module and an output module that are independent of each other, or may also be implemented as a coupling module having both input and output functions.
- a graphical user interface Graphical User Interface, GUI
- a command line interface Common-Line Interface, CLI
- the data set processing module 120 can obtain the data set to be processed from the input/output module 110 , or alternatively, can obtain the data set to be processed from the data storage module 150 . Further, the data set processing module 120 can construct an irrelevant data set based on the data set to be processed.
- the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
- the unrelated data set may be stored in the data storage module 150 .
- the data item to be processed has a label
- the data item to be processed includes a part related to the label and a part not related to the label. Then, the part related to the label in the data item to be processed can be removed, and only the part irrelevant to the label in the data item to be processed can be reserved as an irrelevant data item, and the label of the unrelated data item is the label of the data item to be processed .
- This process may also be called splitting, division, separation or other names, etc., which is not limited in the present disclosure.
- the part associated with its label can be removed from the target data item to be processed to obtain the target data item to be processed the remainder of the . Then use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
- the data item to be processed is a face image
- the label represents the skin color of the face, such as "white”.
- the face area in the face image can be removed, and the remaining part after removing the face area can be used as the corresponding irrelevant data item, and the irrelevant data item still has the label "white" of the face skin color.
- the irrelevant data item can be obtained by means of image segmentation.
- the part of the image associated with the label is the foreground area, and the other areas in the image except the foreground area are the background area, then the foreground-background separation can be used to determine irrelevant data items based only on the background area.
- image segmentation is performed on the target data item to be processed (target image) in the data set to be processed to obtain a background image corresponding to the target image, and then use the background image to construct an irrelevant data item.
- the embodiment of the present disclosure does not limit the specific algorithm used for image segmentation.
- one or more of the algorithms listed below can be used to perform, and other algorithms can also be used to perform: threshold-based image segmentation algorithm, based on Image segmentation algorithm based on region, image segmentation algorithm based on edge detection, image segmentation algorithm based on wavelet analysis and wavelet transform, image segmentation algorithm based on genetic algorithm, image segmentation algorithm based on active contour model, image segmentation algorithm based on deep learning, etc.
- image segmentation algorithms based on deep learning include but are not limited to: segmentation algorithms based on feature encoder based, segmentation algorithms based on regional proposal based, segmentation algorithms based on RNN, based on upsampling/inversion Convolution segmentation algorithm, segmentation algorithm based on feature resolution enhancement, feature enhancement based segmentation algorithm, segmentation algorithm using Conditional Random Field (CRF)/Marcov Random Field (MRF), etc. .
- the data item to be processed in the data set to be processed is a video sequence.
- Different data items to be processed may have the same or different durations.
- the first data item to be processed in the data set to be processed is a first video sequence, the length of which is m1 frames, including m1 frame images.
- the second data item to be processed in the data set to be processed is a second video sequence with a length of m2 frames, including m2 frames of images. m1 and m2 may or may not be equal.
- video segmentation is performed on the target data item to be processed (target video sequence) in the data set to be processed to obtain a background image corresponding to the target video sequence, and then use the background image to construct an irrelevant data item.
- image segmentation may be performed for each frame image in the target video sequence, and the background regions after segmentation of each frame image are fused to obtain a background image corresponding to the target video sequence.
- the background image corresponding to the target video sequence may be obtained based on the gradient between two adjacent frames in the target video sequence.
- the binary image corresponding to the video sequence may be obtained based on the gradient information of the video sequence. The background image of the video sequence is then generated based on this binary image, as described below in conjunction with FIG. 2 .
- Fig. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure.
- the data set processing module 200 can be used as an implementation of the data set processing module 120 in Figure 1, and the data set processing module 200 can be used to determine irrelevant data sets based on the data sets to be processed, wherein The data item is a video sequence, and the irrelevant data item in the irrelevant data set may be a background image corresponding to the video sequence.
- the dataset processing module 200 may include a gradient calculation submodule 210 , a gradient superposition submodule 220 , a thresholding submodule 230 , a morphological processing submodule 240 and a separation submodule 250 .
- the gradient calculation sub-module 210 can be used to calculate the gradient information between a frame image and the previous frame image in the target video sequence.
- the target video sequence includes m1 frame images, which are frame 0, image 1, ..., frame m1-1 respectively. Then the gradient information between every two adjacent frames of images can be calculated, specifically, between the 1st frame image and the 0th frame image, between the 2nd frame image and the 1st frame image, ... the m1-1th frame image The gradient information between the m1-2th frame image.
- the embodiments of the present disclosure do not limit the specific manner of calculating the gradient information, for example, the frame difference may be calculated.
- the gradient of the feature vectors of two frames of images along a specific dimension (such as the time dimension T) can be calculated, so that fixed background parts, such as image borders, can be extracted from video sequences through motion information.
- the difference between the image and the grayscaled image can be calculated, so that the colored part in the video frame image can be extracted, so that the color mark can be avoided as the foreground part, such as some color marks or text added in the later stage after video capture.
- the gradient superposition sub-module 220 can be used to superimpose the gradient information obtained by the gradient calculation sub-module 210 to obtain a gradient superposition map.
- the manner of superposition by the gradient superposition sub-module 220 may include but not limited to weighted summation (such as average value), maximum value, minimum value or others.
- the thresholding sub-module 230 may be configured to perform thresholding processing on the gradient overlay image obtained by the gradient overlay sub-module 220 to obtain an initial binary image.
- the morphological processing sub-module 240 may perform morphological processing on the initial binary image obtained by the thresholding sub-module 230 to obtain a binary image corresponding to the video sequence.
- the pixel value of this pixel can be reset to 0.
- morphological processing may include, but not limited to, morphological dilation, morphological erosion, and the like.
- the morphological processing submodule 240 may perform several times of morphological expansion on the initial binary image obtained by the thresholding submodule 230, and then perform the same number of morphological erosions to obtain a binary image.
- the separation sub-module 250 can obtain the background image corresponding to the video sequence based on the binary image obtained by the morphological processing sub-module 240 .
- a matting operation may be performed on a binary image to obtain a background image.
- the background image can be obtained by matrix dot product.
- the background image corresponding to the video sequence can be obtained by fully considering the similarity of the background among the frame images in the video sequence.
- the background image is used as a representative of bias, so that the data set can be checked for bias. Understandably, if the dataset is not biased, then the features of the background image should not have any relationship to the labels associated with the foreground regions.
- the constructed irrelevant data set can be divided into two parts: the first part of irrelevant data items and the second part of irrelevant data items, wherein the first part of irrelevant data items can be used to train the model, and the second part of irrelevant data items Data items can be used to test the model.
- the embodiment of the present disclosure does not limit the division method.
- the irrelevant data set may be divided into the first part and the second part according to 9:1 or 1:1 or other ratios.
- the set composed of the first part of irrelevant data items may be called an irrelevant training set, and the set composed of the second part of irrelevant data items may be called an irrelevant test set.
- the first part of the set of irrelevant data items may include an irrelevant training set and an irrelevant verification set.
- the unrelated data set can be divided into an unrelated training set, an unrelated verification set and an unrelated test set according to 7:2:1.
- the set composed of the first part of irrelevant data items is called the first data set (or training set), and the set composed of the second part of irrelevant data items is called the second data set (or test set).
- the dataset processing module 120 may preprocess the dataset to be processed first, and then construct an irrelevant dataset based on the preprocessed dataset to be processed. Preprocessing includes but is not limited to: cluster analysis, data denoising, etc.
- the model training module 130 may include a training submodule 132 and an evaluation submodule 134 .
- the training sub-module 132 can be used to train the classification model.
- the classification model may be trained based on the first part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the first part.
- the first part of irrelevant data items used for training may be all of the irrelevant data sets, so that more data items may be used for training, making the trained classification model more robust.
- the first part of irrelevant data items used for training may be part of an irrelevant data set. As mentioned above, the irrelevant data set is divided into the first part of irrelevant data items and the second part of irrelevant data items.
- the set of the first part of irrelevant data items used for training is called a training set, and correspondingly, the first part of irrelevant data items may be training items.
- the training here may be to train an initial classification model or may be to update a previously trained classification model, wherein the initial classification model may be a classification model that has not been trained.
- the previously trained classification model may be obtained after training the initial classification model.
- the training sub-module 132 can obtain an initial classification model or a previously trained classification model from the model storage module 140 .
- the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training and the label of each irrelevant data item in the first part from the data set processing module 120 or the data storage module 150 . Or, the training sub-module 132 can obtain the first part of irrelevant data items in the irrelevant data set used for training from the data set processing module 120, and obtain the information of each irrelevant data item in the first part of irrelevant data items from the input/output module 110. Label.
- the training submodule 132 can preprocess the training set, including but not limited to: feature extraction, cluster analysis, edge detection, Image denoising, etc.
- the training data item after feature extraction can be characterized as an S-dimensional feature vector, where S is greater than 1.
- the classification model can be a convolutional neural network (Convolutional Neural Network, CNN) model, which can optionally include an input layer, a convolutional layer, Deconvolution layer, pooling layer, fully connected layer, output layer, etc.
- CNN convolutional Neural Network
- the classification model includes a large number of parameters, which can represent the calculation formula or the weight of the calculation factor in the model, and the parameters can be updated iteratively through training.
- the classification model also includes hyper-parameters, which are used to guide the construction or training of classifications, such as the number of iterations of model training, learning rate, batch size, model The number of layers, the number of neurons in each layer, etc.
- the hyperparameters can be parameters obtained by training the model through the training set, or they can be preset parameters, and the preset parameters will not be updated through the training of the model.
- the process of training the classification model by the training sub-module 132 can refer to the existing training process.
- the training process can be: input the training data items in the training set to the classification model, use the label corresponding to the training data as a reference, and use the loss function (loss function) to obtain the relationship between the output of the classification model and the corresponding label.
- the loss value of and adjust the parameters of the classification model according to the loss value.
- Each training data item in the training set iteratively trains the classification model, and the parameters of the classification model are continuously adjusted until the classification model can output an output value closer to the label corresponding to the training data item with higher accuracy according to the input training data item , such as the loss function is minimal or smaller than a reference threshold.
- the loss function in the training process is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value).
- the loss function is a function used to measure the degree to which the classification model is trained (that is, to calculate the difference between the result predicted by the classification model and the true value).
- the process of training the classification model because it is hoped that the output of the classification model is as close as possible to the real value (that is, the corresponding label), it is possible to compare the predicted value and the real value of the current classification model, and then according to the difference between the two to update the parameters in the classification model.
- Each training uses the loss function to judge the difference between the value predicted by the current classification model and the real value, and updates the parameters of the classification model until the classification model can predict a value very close to the real value, then the classification model is considered to be trained .
- the "classification model” in the embodiments of the present disclosure may also be called a machine learning model, a convolutional classification model, a background classification model, a data bias model, or other names, or may also be referred to as a "model” for short. Publicity is not limited to this.
- the trained classification model may be stored in the model storage module 140 .
- model storage module 140 may be part of model training module 130 .
- the evaluation sub-module 134 can be used to evaluate the classification model. Specifically, the evaluation result of the trained classification model may be determined based on the second part of irrelevant data items in the irrelevant data set and the label of each irrelevant data item in the second part. The evaluation results can be used to characterize the significance of data bias in the data set to be processed.
- the set of the second part of irrelevant data items may be a test set, and correspondingly, the second part of irrelevant data items may be test data items.
- the evaluation process may include: inputting a test data item into a trained classification model, obtaining a prediction result about the test data item, and determining an evaluation result based on a comparison result of the prediction result with a label of the test data item.
- the evaluation result may include at least one of the following: correct rate, precision rate, recall rate, F1 index, precision-recall rate (Precision-Recall, P-R) curve, average precision (Average Precision, AP) index , false positive rate, false negative rate, etc.
- a confusion matrix may be constructed, which shows the number of positive examples (Positive, also called positive) and negative examples (Negative, also called negative), real values, predicted values, and the like.
- the accuracy rate refers to the proportion of correctly classified samples to the total samples. For example, the number of test data items in the test set is N2, and the number of predicted results consistent with the label is N21, then the accuracy rate can be expressed as N21/N2.
- the correct rate also known as the precision rate, refers to the proportion of samples that are actually positive among the samples that are predicted to be positive. For example, the number of test data items in the test set is N2, if the number of positive examples in the prediction result is N22, and the number of positive examples in the N22 test data items is N23, then the correct rate can be expressed as N23/ N22.
- Recall refers to the proportion of samples that are actually positive that are predicted to be positive.
- the number of test data items in the test set is N2
- the number marked as positive examples is N31.
- the recall rate can be expressed as N32 /N31.
- the P-R curve defines the horizontal axis as the recall rate and the vertical axis as the precision rate.
- a point on the P-R curve represents: under a certain threshold, the model judges the result greater than the threshold as a positive sample, and the result smaller than the threshold is judged as a negative sample. At this time, the recall rate and precision rate corresponding to the result are returned.
- the entire P-R curve is generated by shifting the threshold from high to low. Near the origin represents the precision and recall of the model when the threshold is maximum.
- the F1 index also known as the F1 score (score) is the harmonic mean of precision and recall.
- the ratio of twice the product of the precision rate and the recall rate to the sum of the precision rate and the recall rate can be used as the F1 index.
- the evaluation result may include a positive example characterization value, such as a first accuracy rate and/or a first recall rate.
- the first correct rate indicates the proportion of samples that are actually positive among the samples that are predicted to be positive.
- the first recall rate represents the proportion of the samples that are actually positive that are predicted to be positive.
- the evaluation result may include a negative example characterization value, such as a second accuracy rate and/or a second recall rate.
- the second correct rate indicates the proportion of samples that are actually negative among the samples that are predicted to be negative.
- the second recall rate represents the proportion of samples that are actually negative that are predicted to be negative.
- the evaluation result may include a first predicted mean value and/or a second predicted mean value.
- the first predicted mean represents the average of predicted values for samples that are actually positive.
- the second predicted mean represents the average of predicted values for samples that are actually negative.
- the evaluation result may include mean difference, which is used to represent the difference between the first predicted mean and the second predicted mean, such as by the difference between the first predicted mean and the second predicted mean or by the difference between the first predicted mean and the second predicted mean Ratio, etc. to represent the mean difference.
- the evaluation result can be presented to the user by the input/output module 110 .
- it may be presented through a graphical user interface, which is convenient for users to view.
- the bias significance of the data set can be characterized in a quantitative form.
- This quantitative evaluation scheme can provide users with a clear reference, which is convenient for users to adjust the data set and other processing.
- the input/output module 110 can also visually present representations of dataset biases through the graphical user interface.
- a Class Activation Map (CAM) is obtained. Then an overlay result is obtained by overlaying the CAM and the target-independent data item, and the overlay result is displayed.
- the class activation map is the class activation heat map.
- the embodiments of the present disclosure can use the CAM to characterize the attention areas of the classification model, specifically, which areas (ie, the attention areas of the model) cause bias.
- CAM can be obtained by using Gradient-based CAM (Grad-CAM).
- Gradient-based CAM Gradient-based CAM
- the output of the last convolutional layer of the classification model that is, the feature map of the last layer
- the extracted feature maps of the last layer can be weighted and summed to obtain CAM.
- the weighted and summed results can also be used as a CAM after being processed by a Rectified Linear Unit (ReLU) activation function.
- the weights for weighted summation here can be the weights of the top fully connected layer.
- the partial derivative of the output of the last layer of softmax (Softmax) of the classification model to all pixels of the last layer feature map can be calculated, and then the global average in the width and height dimensions can be taken as the corresponding weight.
- Embodiments of the present disclosure do not limit the manner in which the CAM and the target-independent data item (such as the background image) are superimposed.
- weighted summation may be used for superimposition.
- the weights of the CAM and the background graphics may be equal.
- the embodiments of the present disclosure provide a solution for quantitatively evaluating and visually presenting data set bias, so that the significance of data set bias can be clearly characterized, and the specific location where bias occurs can be visually presented. In this way, users can more intuitively and comprehensively know the bias of the data set.
- This solution does not require too much user participation, can be automated, and can improve the efficiency of processing while ensuring the accuracy of the quantitative assessment of bias.
- the model training module 130 can also be used to adjust the data set to be processed based on the classification model.
- the data set to be processed may have an initial sample weight distribution, correspondingly, the first data set has a first sample weight distribution, and the second data set has a second sample weight distribution.
- the initial sample weight of the target data item to be processed is a
- the sample weight of the irrelevant data item generated based on the target data item to be processed is also a.
- model training module 130 can be used to obtain the weight distribution of recommended samples based on the iterative training of the classification model, as described below in conjunction with FIG. 3 .
- FIG. 3 shows a schematic diagram of a process 300 in which the model training module 130 obtains recommended sample weights according to an embodiment of the present disclosure.
- a first data set having a first distribution of sample weights and a second data set having a second distribution of sample weights are determined.
- an unrelated data set may be constructed based on the data set to be processed, and the unrelated data set may be divided into a first data set and a second data set, as described in the above embodiments.
- the data items to be processed in the data set to be processed may have initial sample weights, that is, the data set to be processed may have an initial sample weight distribution.
- the initial sample weight may be input by the user through the input/output module 110 .
- initialization sample weights may be determined through an initialization process.
- the sample weight can be used to indicate the sampling probability of the data item to be processed. For example, assuming that the sample weight of the i-th data item to be processed is w i , then the sampling probability of the i-th data item to be processed is
- the initial sample weight distribution may indicate that the sampling probabilities of each data item to be processed in the data set to be processed are equal. Assuming that the data set to be processed includes N data items to be processed, and the initial sample weight of each data item to be processed is 1, then the sampling probability of each data item to be processed is initialized to 1/N.
- the first sample weight distribution and the second sample weight distribution can be correspondingly determined.
- the first data set is sampled based on the first sample weight distribution, and the classification model is trained iteratively.
- the classification model trained in S320 is evaluated based on the second data set, and an evaluation result is obtained.
- the evaluation result may be obtained based on the comparison of the predicted result of the trained classification model for the irrelevant data item in the second data set with the label of the irrelevant data item.
- irrelevant data items can be input into the trained classification model to obtain prediction results about the irrelevant data items, and the evaluation results are determined based on the comparison results of the prediction results of the irrelevant data items and the labels of the irrelevant data items.
- the evaluation result may include at least one of the following: accuracy rate, precision rate, recall rate, F1 index, precision rate-recall rate curve, average precision index, false positive rate, false negative rate, and the like.
- the process may proceed to 360 .
- the preset threshold can be set based on the processing accuracy and application scenarios of the data set to be processed.
- the preset threshold may be related to the specific meaning of the evaluation result. For example, the evaluation result includes a correct rate, and the preset threshold may be set to, for example, 30% or 50% or other numerical values.
- the sample weight distribution is updated.
- it may return to 310 to continue execution, that is to say, rebuild the first data set and the second data set.
- an irrelevant data item may belong to the first data set in the previous cycle, but the irrelevant data item may belong to the first data set or the second data set in the next cycle.
- it may return to 320 to continue execution, that is to say, the irrelevant data items in the first data set and the second data set do not change, but the first sample weight distribution and/or the second sample weight distribution are updated.
- the first data set may be re-sampled based on the updated first sample weight distribution, and the classification model may be re-trained iteratively. And the retrained classification model is evaluated based on the second data set, and the evaluation result is obtained again.
- 310 to 350 or 320 to 350 may be iteratively performed until the evaluation result indicates that the bias is not significant (for example, the evaluation result is not greater than the preset threshold).
- the sample weight distribution may be updated in a random manner.
- the sample weights of some data items to be processed can be randomly updated, for example, the sample weight of a data item to be processed is updated from 1 to 2, and the sample weight of another data item to be processed is updated from 1 to 3 and many more. It can be understood that the random method has uncertainty, which may make the process of obtaining the weight distribution of the recommended samples take a long time.
- a predetermined rule may be used to update the sample weight distribution.
- the second sample weight distribution may be updated. For example, if the evaluation result indicates that the prediction result of the classification model for the irrelevant data item in the second data set is different from the label of the irrelevant data item, then the sample weight of the irrelevant data item may be increased. For example, the sample weight of the irrelevant data item is updated from a1 to a1+1 or 2*a1 or others. In this example, the weight distribution of the first sample may remain unchanged or the weight distribution of the first sample may be updated in other ways.
- the second data set may be exchanged with the first data set before performing the next cycle. For example, in the next cycle, the classification model will be trained based on the second data set of the previous cycle and the updated second sample weight distribution.
- the distribution of sample weights may be optimized through a genetic algorithm to update the distribution of sample weights.
- the sample weight distribution can be used as the genetic initial value of the genetic algorithm, and the objective function can be constructed based on the evaluation result obtained at 330, so that the genetic algorithm can be used to optimize the sample weight distribution, and the optimized sample weight distribution is updated immediately The subsequent sample weight distribution.
- the embodiment of the present disclosure does not limit the construction method of the objective function of the genetic algorithm.
- the evaluation result includes the mean difference and the correct rate of the positive sample and the negative sample, then the sum of the mean difference and the correct rate can be used as the objective function. It is understandable that other methods can also be used to construct the objective function, which will not be listed here.
- the embodiments of the present disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution. This process does not require user participation and has a high degree of automation.
- user modifications to the sample weight distribution may be acquired to update the sample weight distribution.
- the user can empirically infer what modification to the sample weight distribution should be made by referring to the evaluation results and/or the displayed overlay results (as described above), and then input the modification through the input/output module 110 to update the sample weight distribution.
- this method can fully consider the user's needs, and update the sample weight distribution based on the user's modification, so that the obtained recommended sample weight distribution can better meet the user's expectations and improve user satisfaction.
- the sample weight distribution obtained from the current evaluation result may be used as the recommended sample weight distribution.
- the embodiments of the present disclosure can update the sample weight distribution based on iteratively training the classification model, and can view the changes in the bias of the data set as the sample weight distribution is updated, so that the data set to be processed can be detected iteratively, and an effective high-quality Referential recommended sample weight distribution.
- the input/output module 110 can also present the recommended sample weight distribution for the user as a reference for further adjustment of the data set to be processed.
- the recommended sample weight distribution is presented visually through a graphical user interface.
- the data set processing module 120 may add or delete the data set to be processed based on the obtained recommended sample weight distribution, so as to construct an unbiased data set.
- the data set processing module 120 may copy the data items to be processed with a large recommended sample weight, so as to expand the number of data items to be processed in the data set to be processed.
- the data set processing module 120 may delete the unprocessed data items whose recommended sample weights are small, so as to reduce the number of unprocessed data items in the unprocessed data set.
- a user's deletion instruction for some data items to be processed may be obtained via the input/output module 110, so as to delete some data items to be processed.
- Other data items input by the user may be obtained via the input/output module 110 to be added to the current data set to be processed.
- users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set.
- other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
- the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
- system 100 shown in FIG. 1 may be a system capable of interacting with users, and the system 10 may be a software system, a hardware system, or a system combining hardware and software.
- the system 100 can be implemented as a computing device or a part of a computing device, where the computing device includes but not limited to a desktop computer, a mobile terminal, a wearable device, a server, a cloud server, and the like.
- the system 100 shown in FIG. 1 can be implemented as an artificial intelligence platform (AI platform).
- AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users.
- Various AI models or AI sub-models for solving different problems can be built into the AI platform, and the AI platform can establish applicable AI models according to the needs input by users. That is, users only need to determine their own needs in the AI platform, and follow the prompts to prepare the data set and upload it to the AI platform, and the AI platform can train the user an AI model that can be used to realize the user's needs.
- the AI model in the embodiments of the present disclosure can be used to evaluate the data bias of the data set to be processed input by the user.
- FIG. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment according to an embodiment of the present disclosure.
- the system 100 is fully deployed in the cloud environment 410 .
- the cloud environment 410 is an entity that provides cloud services to users by using basic resources in the cloud computing mode.
- the cloud environment 410 includes a cloud data center 412 and a cloud service platform 414.
- the cloud data center 412 includes a large number of basic resources (comprising computing resources, storage resources and network resources) owned by the cloud service provider.
- the computing resources included in the cloud data center 412 can be A large number of computing devices (such as servers).
- the system 100 can be independently deployed on servers or virtual machines in the cloud data center 412, and the system 100 can also be deployed on multiple servers in the cloud data center 412 in a distributed manner, or distributed in the cloud data center 412 on multiple virtual machines, or distributedly deployed on servers and virtual machines in the cloud data center 412.
- the system 100 can be abstracted into an AI development cloud service 424 by the cloud service provider on the cloud service platform 414 and provided to the user. Settlement based on usage conditions), the cloud environment 410 utilizes the system platform 100 deployed in the cloud data center 412 to provide the AI development cloud service 424 to the user.
- the user can upload the data set to be processed through an application program interface (application program interface, API) or GUI.
- the system 100 in the cloud environment 410 receives the data set to be processed uploaded by the user, and can perform operations such as data set processing, model training, and data set adjustment.
- the system 100 can return the evaluation result of the model, the weight distribution of recommended samples, etc. to the user through API or GUI.
- the system 100 under the cloud environment 410 when the system 100 under the cloud environment 410 is abstracted into an AI development cloud service 424 and provided to users, it can be divided into two parts, such as data set bias evaluation cloud service and data set adjustment cloud service .
- the user can only purchase the data set bias evaluation cloud service.
- the cloud service platform 414 can construct an irrelevant data set based on the data set to be processed uploaded by the user, obtain a classification model through training, and return the evaluation result of the classification model to the user , so that the user is informed of the significance of bias in the dataset being processed.
- the user can also further purchase the data set adjustment cloud service on the cloud service platform 414.
- the cloud service platform 414 can iteratively train the classification model based on the sample weight distribution, update the sample weight distribution, and return the recommended sample weight distribution to the user. So that the user can add or delete the data set to be processed with reference to the weight distribution of the recommended samples to construct an unbiased data set.
- FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments according to an embodiment of the present disclosure.
- the system 100 is distributed and deployed in different environments, which may include but not limited to at least two of cloud environment 510 , edge environment 520 and terminal computing device 530 .
- System 100 may be logically divided into multiple sections, each section having a different function.
- the system 100 includes an input/output module 110 , a data set processing module 120 , a model training module 130 , a model storage module 140 and a data storage module 150 .
- Each part of the system 100 can be deployed in any two or three environments of the terminal computing device 530 , the edge environment 520 and the cloud environment 510 .
- Various parts of the system 100 deployed in different environments cooperate to provide users with various functions.
- the input/output module 110 and the data storage module 150 of the system 100 are deployed in the terminal computing device 530, the data set processing module 120 of the system 100 is deployed in the edge computing device of the edge environment 520, and the data set processing module 120 of the system 100 is deployed in the cloud environment 510
- the model training module 130 and the model storage module 140 of the deployment system 100 are deployed.
- the user sends the data set to be processed to the input/output module 110 in the terminal computing device 530 , and the terminal computing device 530 stores the data set to be processed to the data storage module 150 .
- the data set processing module 120 in the edge computing device of the edge environment 520 constructs an irrelevant data set based on the data set to be processed from the terminal computing device 530 .
- the model training module 130 in the cloud environment 510 trains a classification model based on an unrelated dataset from the edge environment 520 .
- the cloud environment 510 may also store the trained classification model to the model storage module 140. It should be understood that this application does not limit which parts of the system 100 are deployed in which environment, and can be adapted according to the computing capability of the terminal computing device 530, the resource occupancy of the edge environment 520 and the cloud environment 510, or specific application requirements during actual application. Sexual deployment.
- the edge environment 520 is an environment including a collection of edge computing devices that are closer to the terminal computing device 530 , and the edge computing devices include but are not limited to: edge servers, edge small stations with computing capabilities, and the like. It can be understood that the system 100 may also be independently deployed on one edge server in the edge environment 520 , or may be deployed on multiple edge servers in the edge environment 520 in a distributed manner.
- the terminal computing device 530 includes, but is not limited to: a terminal server, a smart phone, a notebook computer, a tablet computer, a personal desktop computer, a smart camera, and the like. It can be understood that the system 100 may also be independently deployed on one terminal computing device 530 , or may be deployed on multiple terminal computing devices 530 in a distributed manner.
- FIG. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the present disclosure.
- the computing device 600 in FIG. 6 may be implemented as a device in the cloud environment 510 in FIG. 5 , a device in the edge environment 520 , or a terminal computing device 530 .
- the computing device 600 shown in FIG. 6 can also be regarded as a computing device cluster, that is, the computing device 600 includes one or more of the aforementioned devices in the cloud environment 510, devices in the edge environment 520, and terminal computing devices 530. devices.
- the computing device 600 includes a memory 610 , a processor 620 , a communication interface 630 and a bus 640 , wherein the bus 640 is used for communication between various components of the computing device 600 .
- the memory 610 may be a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a hard disk, a flash memory or any combination thereof.
- the memory 610 can store programs, and when the programs stored in the memory 610 are executed by the processor 620, the processor 620 and the communication interface 630 are used to perform the processes that can be performed by the various modules in the system 100 as described above. It should be understood that the processor 620 and the communication interface 630 may also be used to execute part or all of the content in the embodiments of the data processing method described below in this specification.
- the memory can also store datasets and classification models.
- a part of the storage resources in the memory 610 is divided into a data storage module for storing data sets, such as data sets to be processed, irrelevant data sets, etc., and a part of the storage resources in the memory 610 is divided into a model storage module for Store classification models.
- the processor 620 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a graphics processing unit (Graphics Processing Unit, GPU) or any combination thereof.
- Processor 620 may include one or more chips.
- the processor 620 may include an accelerator, such as a Neural Processing Unit (Neural Processing Unit, NPU).
- NPU Neural Processing Unit
- the communication interface 630 uses a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .
- a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data may be acquired through communication interface 630 .
- Bus 640 may include pathways for communicating information between various components of computing device 600 (eg, memory 610 , processor 620 , communication interface 630 ).
- FIG. 7 shows a schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure.
- the method 700 shown in FIG. 7 can be executed by the system 100 .
- an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
- the data set to be processed includes a plurality of data items to be processed, and each data item to be processed has a label.
- Data items to be processed may include tag-related parts and tag-independent parts.
- the part associated with the tag of the target data item to be processed may be removed from the target data item to be processed in the data set to be processed to obtain the remainder of the target data item to be processed. Use the remaining part to construct an irrelevant data item in the irrelevant data set, and the label of an irrelevant data item corresponds to the label of the target data item to be processed.
- the data set to be processed is an image data set, that is, the data item to be processed is an image.
- image segmentation may be performed on the target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed.
- a background image is used to construct an unrelated data item in an unrelated data set.
- the part of the image associated with the label is the foreground area
- the other areas in the image except the foreground area are the background area
- the foreground-background separation can be used to determine irrelevant data items based only on the background area.
- the data items to be processed in the data set to be processed are video sequences. Then the binary image of the video sequence can be determined based on the gradient information between one frame image in the video sequence and the previous frame image of the one frame image. And based on the binary image, the background image of the video sequence is generated. The background image of the video sequence is then used to construct an unrelated data item in the unrelated data set.
- FIG. 8 shows a schematic flowchart of a process 800 of constructing an unrelated data item according to an embodiment of the present disclosure. Specifically, what is shown in FIG. 8 is the process of constructing irrelevant data items based on the data items to be processed (video sequences).
- the gradient information between two adjacent frames of images in the target video sequence is calculated.
- the gradient of the feature vectors of the two frames of images along the time dimension may be calculated, so as to obtain gradient information.
- the static and unchanging background parts in the video sequence can be obtained, such as image borders and the like.
- a gradient overlay map is obtained based on the overlay of the gradient information.
- the gradient information obtained in step 810 may be weighted and summed, maximized or minimized, etc., to complete superposition and obtain a gradient superposition map.
- thresholding is performed on the gradient overlay image to obtain an initial binary image.
- the initial binary image is subjected to several times of morphological expansion, and then the same number of times of morphological erosion is performed, so as to obtain the binary image.
- a background image is obtained based on the binary image, and the background image is used as an irrelevant data item corresponding to the video sequence.
- a matting operation may be performed on a binary image, for example, a matrix dot product may be used to obtain a background image.
- the background image corresponding to the video sequence can be obtained in consideration of the similarity between the frame images in the video sequence and the fact that the background in the video sequence is basically unchanged.
- the label of the irrelevant data item is determined based on the label of the data item to be processed. Specifically, if the target data item to be processed has label A, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation), then the label of the target unrelated data item is also label A.
- the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
- the sample weights of the irrelevant data items are determined based on the sample weights of the data items to be processed. Specifically, the target data item to be processed has a sample weight w, and the target irrelevant data item is obtained by processing the target data item to be processed (such as image segmentation, etc.), then the sample weight of the target unrelated data item is also the sample weight w.
- the manner of dividing the first data set and the second data set is not limited. For example, it may be divided in a manner of 9:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is about 9:1. For example, it may be divided in a manner of 1:1, so that the ratio of the number of irrelevant data items in the first data set to the number of irrelevant data items in the second data set is approximately 1:1.
- the first data set can also be further divided into the first sub-data set and the second sub-data set, for example, the ratio of the number of irrelevant data items in the first sub-data set to the number of irrelevant data items in the second sub-data set is about 7:2. It can be understood that the ratios listed here are only for illustration, and are not intended to limit the embodiments of the present disclosure.
- the classification model is trained based on the first data set and the first sample weight distribution.
- the first data set may be sampled based on the first sample weight distribution, and the classification model may be trained based on the first data set based on labels of irrelevant data items in the first data set.
- the classification model can be trained by using the first data set as a training set.
- the first data set may be preprocessed, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, and the like.
- the embodiment of the present disclosure does not limit the specific structure of the classification model, for example, it may be a convolutional neural network, including at least a convolutional layer and a fully connected layer.
- the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias for the processing data set having the sample weight distribution.
- the second data set can be used as a test set to obtain an evaluation result.
- the evaluation result may be obtained based on a prediction result of the classification model for the irrelevant data item in the second data set and a comparison result between labels of the unrelated data item in the second data set.
- the evaluation result may include a first accuracy rate for positive samples in the second data set and a second accuracy rate for negative samples in the second data set.
- the sample weight distribution of the data set to be processed may be updated.
- the sample weight distribution of the data set to be processed is updated. Further, after this, return to 720 to obtain the first data set and the second data set again, and repeatedly execute 730 and 740 until the evaluation result obtained in block 740 indicates that the bias is not significant (or there is no significant bias). bias), for example, the evaluation result is not greater than a preset threshold. Subsequently, the sample weight distribution when the evaluation result is not greater than the preset threshold may be used as the recommended sample weight distribution, and the recommended sample weight distribution may be output.
- the embodiment of the present disclosure does not limit the specific method of updating the sample weight distribution.
- at least one of the following methods can be used to update: use predetermined rules to update the sample weight distribution, use a random method to update the sample weight distribution, and obtain the user's sample weight distribution. Modify the weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
- updating the sample weight distribution may update the first sample weight distribution of the first data set, so that when returning to execution 720, the first sample weight of the first data set in 720 executed again The distribution is updated and thus the classification model trained at 730 is also updated.
- updating the sample weight distribution may update the first sample weight distribution of the first data set and update the second sample weight distribution of the second data set.
- the distribution of sample weights for a data set to be processed can be updated, and an unrelated data set can be repartitioned.
- the sample weight distribution of the data set to be processed may be updated, so as to adaptively update the first sample weight distribution and the second sample weight distribution, but irrelevant data items in the first data set and the second data set remain unchanged. In this way, when returning to execute 720 , the first data set in 720 executed again is updated or the first sample weight distribution of the first data set is updated, and then the classification model trained at 730 is also updated.
- updating the sample weight distribution may update the second sample weight distribution of the second data set.
- the first sample weight distribution may include invariance.
- the first data set and the second data set in 720 executed last time may be exchanged.
- the first data set at the time of return execution 730 is the second data set during the previous execution. In this way, a more comprehensive consideration of the data set to be processed can be realized, so that the evaluation result of the classification model for the significance of the bias is more accurate.
- FIG. 9 shows a schematic diagram of a process 900 for updating sample weight distribution of a data set to be processed according to an embodiment of the present disclosure.
- an irrelevant data set is constructed based on the data set to be processed, the irrelevant data set includes irrelevant data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed of.
- the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first sample weight distribution and The second sample weight distribution is determined based on the sample weights of the data items to be processed in the data set to be processed.
- a classification model is trained based on the first data set and the first sample weight distribution.
- the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of the bias of the processed data set having the sample weight distribution.
- the second sample weight distribution for the second data set is updated.
- sample weights of all irrelevant data items in the second data set may be updated, or the sample weights of some irrelevant data items in the second data set may be updated.
- the weight distribution of the second sample may be updated based on a prediction result of the classification model for irrelevant data items in the second data set at 940 .
- the sample weights of irrelevant data items in the second data set with correct predictions may be increased, or the sample weights of irrelevant data items in the second data set with wrong predictions may be adjusted small. For example, assuming that the sample weight of the first irrelevant data item in the second data set is 2, and the prediction result obtained by inputting the first irrelevant data item in the second data set into the classification model is consistent with its label, then the second data The sample weight of the first irrelevant data item in the set is increased, for example, from 2 to 3 or 4 or other values.
- the second data can be The sample weight of the second irrelevant data item in the set is reduced, for example, from 2 to 1.
- the first data set having the first sample weight distribution is exchanged with the second data set having the updated second sample weight distribution.
- the first data set after exchange is the second data set in block 920
- the first sample weight distribution of the first data set after exchange is the second sample weight distribution updated in block 960
- the second data set after the swap is the first data set in block 920
- the second sample weight distribution of the second data set after the swap is the first sample weight distribution in block 920.
- execution returns to 930 . That is, the classification model is retrained using the first data set after the exchange in 970 .
- the recommended sample weight distribution is output.
- the sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.
- the recommended sample weight distribution may be determined based on the first sample weight distribution and the second sample weight distribution.
- the focus area of data set bias can be presented in a visual manner, specifically, a class activation map can be obtained by inputting target-independent data items into a trained classification model. Then the overlay result is obtained by overlaying the class activation map with the target-independent data item, and the overlay result is displayed.
- the overlay results can be obtained by weighting and summing the heatmap, so that by displaying the overlay results, it is possible to visually see which are the attention areas of the classification model, and these attention areas are important factors that cause bias.
- the recommended sample weight distribution after obtaining the recommended sample weight distribution, it may optionally further include adjusting the data set to be processed based on the recommended sample weight distribution to obtain an unbiased data set.
- an unbiased data set can be constructed by adding or deleting the data set to be processed.
- data items to be processed with a large recommended sample weight may be copied to expand the number of data items to be processed in the data set to be processed.
- data items to be processed with small recommended sample weights may be deleted, so as to reduce the number of data items to be processed in the data set to be processed.
- a user's deletion instruction for some data items to be processed may be obtained, so as to delete some data items to be processed.
- Other data items entered by the user can be obtained to be added to the current pending data set.
- users can add or delete data sets to be processed based on the weight distribution of recommended samples. For example, the user can find other samples that are similar to the data item to be processed with a large weight of the recommended sample, and add them to the data set as new data items, thereby realizing data supplementation to the data set.
- other similar samples may be other images collected by the same (or the same model) image collection device in a similar environment (such as care conditions, etc.).
- the data set to be processed can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Furthermore, this unbiased dataset can be used to train more robust and unbiased task-specific models.
- Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
- Apparatus 1000 may be implemented by software, hardware or a combination of both.
- the device 1000 may be a software or hardware device that implements part or all of the functions in the system 100 shown in FIG. 1 .
- the device 1000 includes a construction unit 1010 , a division unit 1020 , a training unit 1030 and an evaluation unit 1040 .
- the construction unit 1010 is configured to construct an unrelated data set based on the unprocessed data set, the unrelated data set includes unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the unprocessed data items in the unprocessed data set.
- the division unit 1020 is configured to divide the irrelevant data set into a first data set and a second data set, the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight The distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed.
- the training unit 1030 is configured to train the classification model based on the first data set and the first sample weight distribution.
- the evaluation unit 1040 is configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the significance of bias in the data set to be processed with the sample weight distribution.
- the device 1000 may further include an update unit 1050 , an adjustment unit 1060 and a display unit 1070 .
- the update unit 1050 is configured to update the sample weight distribution of the data set to be processed if the evaluation result obtained by the evaluation unit 1040 is greater than a preset threshold.
- the updating unit 1050 may be configured to update a part of the sample weight distribution, so that the second sample weight distribution is updated without updating the first sample weight distribution.
- the update unit 1050 may be configured to update the sample weight distribution by at least one of the following: update the sample weight distribution using a predetermined rule, update the sample weight distribution in a random manner, and acquire user modification to the sample weight distribution to update the sample weight distribution, or optimize the sample weight distribution by genetic algorithm to update the sample weight distribution.
- the updating unit 1050 may be configured to use the sample weight distribution when the evaluation result is not greater than the preset threshold as the recommended sample weight distribution.
- the adjustment unit 1060 is configured to add or delete the data set to be processed based on the weight distribution of the recommended samples, so as to construct an unbiased data set.
- the update unit 1050 is further configured to: obtain a class activation map by inputting target-independent data items into the trained classification model; and obtain a superposition result by superimposing the activation map and target-independent data items.
- the display unit 1070 is configured to display the recommended sample weight distribution and/or the superposition result.
- the construction unit 1010 may be configured to remove the part associated with the label of the target data item to be processed from the target data item to be processed in the data set to be processed, so as to obtain the remaining part of the target data item to be processed ; and using the remaining part to construct an irrelevant data item in the irrelevant data set, the label of the unrelated data item corresponds to the label of the target data item to be processed.
- the data set to be processed is an image data set
- the construction unit 1010 may be configured to perform image segmentation on the target data item to be processed in the data set to be processed, so as to obtain a background image corresponding to the target data item to be processed; and A background image is used to construct an unrelated data item in an unrelated data set.
- the data item to be processed in the data set to be processed is a video sequence
- the construction unit 1010 may be configured to determine the video A binary image of the sequence; a background image of the video sequence is generated based on the binary image; and an irrelevant data item in the irrelevant data set is constructed by using the background image of the video sequence.
- the division of units in the embodiments of the present disclosure is schematic, and it is only a logical function division. In actual implementation, there may be other division methods.
- the functional units in the disclosed embodiments can be integrated into one
- a processor may also exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
- the data processing device 1000 shown in FIG. 10 can be used to implement the above data processing process shown in conjunction with FIGS. 7 to 9 .
- the present disclosure can also be implemented as a computer program product.
- a computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure.
- the present disclosure may be implemented as a computer-readable storage medium, on which computer-readable program instructions are stored, and when a processor executes the instructions, the processor is made to execute the above-mentioned data processing process.
- a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
- a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM) or flash memory, Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (Digital Versatile Discs, DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable Read Only Memory
- SRAM Static Random Access Memory
- CD-ROM Portable Compact Disc Read-Only Memory
- DVDs Digital Versatile Disk
- memory sticks floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
- computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
- Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
- Computer readable program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or in the form of a or any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages.
- Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
- the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet).
- electronic circuits such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby implementing various aspects of the present disclosure.
- These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
- each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer readable program instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne un procédé et un appareil de traitement de données, un dispositif informatique et un support de stockage lisible par ordinateur. Dans le procédé, un ensemble de données non pertinent comportant une étiquette est construit en fonction d'un ensemble de données à traiter ; l'ensemble de données non pertinent est divisé en un premier ensemble de données ayant une première distribution de pondérations d'échantillons et un deuxième ensemble de données ayant une deuxième distribution de pondérations d'échantillons, la première et la deuxième distribution de pondérations d'échantillons étant déterminées en fonction d'une pondération d'échantillon d'un élément de données à traiter dans l'ensemble de données à traiter ; un modèle de classification est entraîné en fonction du premier ensemble de données et de la première distribution de pondérations d'échantillons ; le modèle de classification est évalué en fonction du deuxième ensemble de données et de la deuxième distribution de pondérations d'échantillons pour obtenir un résultat d'évaluation qui indique l'importance du biais de l'ensemble de données à traiter ayant des distributions de pondérations d'échantillons.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110574231.3A CN115471714A (zh) | 2021-05-25 | 2021-05-25 | 数据处理方法、装置、计算设备和计算机可读存储介质 |
CN202110574231.3 | 2021-05-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022247448A1 true WO2022247448A1 (fr) | 2022-12-01 |
Family
ID=84229488
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/083841 WO2022247448A1 (fr) | 2021-05-25 | 2022-03-29 | Procédé et appareil de traitement de données, dispositif informatique et support de stockage lisible par ordinateur |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115471714A (fr) |
WO (1) | WO2022247448A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135065A (zh) * | 2024-05-07 | 2024-06-04 | 山东汉鑫科技股份有限公司 | 隧道的动态灰度图生成方法、系统、存储介质及电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915450A (zh) * | 2012-09-28 | 2013-02-06 | 常州工学院 | 一种在线自适应调整的目标图像区域跟踪方法 |
US20200167653A1 (en) * | 2018-11-27 | 2020-05-28 | Wipro Limited | Method and device for de-prejudicing artificial intelligence based anomaly detection |
US20200372406A1 (en) * | 2019-05-22 | 2020-11-26 | Oracle International Corporation | Enforcing Fairness on Unlabeled Data to Improve Modeling Performance |
CN112115963A (zh) * | 2020-07-30 | 2020-12-22 | 浙江工业大学 | 一种基于迁移学习生成无偏见深度学习模型的方法 |
CN112508580A (zh) * | 2021-02-03 | 2021-03-16 | 北京淇瑀信息科技有限公司 | 基于拒绝推断方法的模型构建方法、装置和电子设备 |
CN112639843A (zh) * | 2018-09-10 | 2021-04-09 | 谷歌有限责任公司 | 使用机器学习模型来抑制偏差数据 |
-
2021
- 2021-05-25 CN CN202110574231.3A patent/CN115471714A/zh active Pending
-
2022
- 2022-03-29 WO PCT/CN2022/083841 patent/WO2022247448A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915450A (zh) * | 2012-09-28 | 2013-02-06 | 常州工学院 | 一种在线自适应调整的目标图像区域跟踪方法 |
CN112639843A (zh) * | 2018-09-10 | 2021-04-09 | 谷歌有限责任公司 | 使用机器学习模型来抑制偏差数据 |
US20200167653A1 (en) * | 2018-11-27 | 2020-05-28 | Wipro Limited | Method and device for de-prejudicing artificial intelligence based anomaly detection |
US20200372406A1 (en) * | 2019-05-22 | 2020-11-26 | Oracle International Corporation | Enforcing Fairness on Unlabeled Data to Improve Modeling Performance |
CN112115963A (zh) * | 2020-07-30 | 2020-12-22 | 浙江工业大学 | 一种基于迁移学习生成无偏见深度学习模型的方法 |
CN112508580A (zh) * | 2021-02-03 | 2021-03-16 | 北京淇瑀信息科技有限公司 | 基于拒绝推断方法的模型构建方法、装置和电子设备 |
Non-Patent Citations (2)
Title |
---|
CORRELL MICHAEL; HEER JEFFREY: "Surprise! Bayesian Weighting for De-Biasing Thematic Maps", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, IEEE, USA, vol. 23, no. 1, 1 January 2017 (2017-01-01), USA, pages 651 - 660, XP011634791, ISSN: 1077-2626, DOI: 10.1109/TVCG.2016.2598618 * |
JINYIN CHEN, CHEN YIPENG; CHEN YIMING; ZHENG HAIBIN; JI SHOULING; SHI JIE; CHENG YAO: "Fairness Research on Deep Learning", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, KEXUE CHUBANSHE, BEIJING, CN, vol. 58, no. 2, 8 February 2021 (2021-02-08), CN , pages 264 - 280, XP093007463, ISSN: 1000-1239, DOI: 10.7544/issn1000-1239.2021.20200758 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135065A (zh) * | 2024-05-07 | 2024-06-04 | 山东汉鑫科技股份有限公司 | 隧道的动态灰度图生成方法、系统、存储介质及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN115471714A (zh) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11416772B2 (en) | Integrated bottom-up segmentation for semi-supervised image segmentation | |
WO2018121690A1 (fr) | Procédé et dispositif de détection d'attribut d'objet, procédé et dispositif d'apprentissage de réseau neuronal et procédé et dispositif de détection zonale | |
CN111724083A (zh) | 金融风险识别模型的训练方法、装置、计算机设备及介质 | |
CN109993102B (zh) | 相似人脸检索方法、装置及存储介质 | |
US11875512B2 (en) | Attributionally robust training for weakly supervised localization and segmentation | |
CN111582409A (zh) | 图像标签分类网络的训练方法、图像标签分类方法及设备 | |
WO2024060416A1 (fr) | Procédé de segmentation sémantique et d'étiquetage faiblement supervisés de bout en bout pour image pathologique | |
Ayyar et al. | Review of white box methods for explanations of convolutional neural networks in image classification tasks | |
Li et al. | Localizing and quantifying infrastructure damage using class activation mapping approaches | |
CN117095180B (zh) | 基于分期识别的胚胎发育阶段预测与质量评估方法 | |
CN116258937A (zh) | 基于注意力机制的小样本分割方法、装置、终端及介质 | |
Lin et al. | An analysis of English classroom behavior by intelligent image recognition in IoT | |
WO2022247448A1 (fr) | Procédé et appareil de traitement de données, dispositif informatique et support de stockage lisible par ordinateur | |
Kajabad et al. | YOLOv4 for urban object detection: Case of electronic inventory in St. Petersburg | |
Anggoro et al. | Classification of Solo Batik patterns using deep learning convolutional neural networks algorithm | |
CN116245157A (zh) | 人脸表情表示模型训练方法、人脸表情识别方法及装置 | |
CN116029760A (zh) | 消息推送方法、装置、计算机设备和存储介质 | |
CN114724174A (zh) | 基于增量学习的行人属性识别模型训练方法及装置 | |
CN114170625A (zh) | 一种上下文感知、噪声鲁棒的行人搜索方法 | |
CN113763313A (zh) | 文本图像的质量检测方法、装置、介质及电子设备 | |
КАЛИТА | Information technology of facial emotion recognition for visual safety surveillance | |
Wang et al. | Framework for facial recognition and reconstruction for enhanced security and surveillance monitoring using 3D computer vision | |
Yang et al. | [Retracted] Optimization Algorithm of Moving Object Detection Using Multiscale Pyramid Convolutional Neural Networks | |
CN116703933A (zh) | 图像分割训练方法、装置、电子设备和存储介质 | |
Tsai et al. | Real-time salient object detection based on accuracy background and salient path source selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22810181 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22810181 Country of ref document: EP Kind code of ref document: A1 |