WO2024131408A1 - 一种模型构建方法、装置、电子设备、计算机可读介质 - Google Patents
一种模型构建方法、装置、电子设备、计算机可读介质 Download PDFInfo
- Publication number
- WO2024131408A1 WO2024131408A1 PCT/CN2023/132631 CN2023132631W WO2024131408A1 WO 2024131408 A1 WO2024131408 A1 WO 2024131408A1 CN 2023132631 W CN2023132631 W CN 2023132631W WO 2024131408 A1 WO2024131408 A1 WO 2024131408A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- image data
- network
- data
- region
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 264
- 238000012545 processing Methods 0.000 claims abstract description 260
- 238000012549 training Methods 0.000 claims abstract description 164
- 230000008569 process Effects 0.000 claims abstract description 139
- 238000001514 detection method Methods 0.000 claims description 86
- 238000004590 computer program Methods 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 11
- 238000010845 search algorithm Methods 0.000 claims description 11
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 abstract description 18
- 230000000694 effects Effects 0.000 description 41
- 230000006870 function Effects 0.000 description 29
- 230000002411 adverse Effects 0.000 description 13
- 238000003709 image segmentation Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 230000008859 change Effects 0.000 description 9
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 8
- 230000006854 communication Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 4
- 238000013145 classification model Methods 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Definitions
- the present disclosure relates to the field of image processing technology, and in particular to a model building method, device, electronic device, and computer-readable medium.
- these image processing fields can use machine learning models to implement the image processing tasks involved in the image processing field (for example, target detection tasks, semantic segmentation tasks, or key point detection tasks, etc.).
- the present disclosure provides a model building method, device, electronic device, and computer-readable medium, which can achieve the purpose of building and processing a machine learning model in a certain image processing field.
- the present disclosure provides a model building method, the method comprising:
- the model to be processed is trained to obtain a first model;
- the first data set includes at least one first image data;
- the first model includes a backbone network;
- a second model is constructed;
- the second model includes the backbone network and a first processing network, and the first processing network refers to all or part of the networks in the second model except the backbone network;
- the second model is trained using a second data set to obtain a model to be used;
- the model to be used includes the backbone network and a second processing network, the network parameters of the backbone network in the second model remain unchanged during the training process for the second model, and the second processing network refers to the training result of the first processing network in the second model;
- the second data set includes at least one second image data.
- the first processing network is used to process output data of the backbone network to obtain an output result of the second model.
- the first image data belongs to single-object image data
- At least two objects exist in the second image data.
- the method further includes:
- the step of training the second model using the second data set to obtain a model to be used includes:
- the model to be used is determined according to the second data set, the online model and the momentum model.
- the process of determining the model to be used includes:
- the online model and the momentum model are updated, and the step of selecting the image data to be processed from the at least one second image data is continued until the preset stop condition is reached, and the model to be used is determined according to the online model.
- the at least two image data to be used include at least one third image data and at least one fourth image data;
- the object region prediction result corresponding to the third image data is determined using the online model
- the object region prediction result corresponding to the fourth image data is determined using the momentum model.
- updating the online model and the momentum model according to the object region prediction results corresponding to the at least two image data to be used and the object region labels corresponding to the at least two image data to be used includes:
- the momentum model is updated according to the updated online model.
- updating the online model and the momentum model according to the object region prediction results corresponding to the at least two image data to be used and the object region labels corresponding to the at least two image data to be used includes:
- the network parameters of the first processing network in the momentum model are updated according to the updated network parameters of the first processing network in the online model.
- updating the network parameters of the first processing network in the momentum model according to the updated network parameters of the first processing network in the online model includes:
- the network parameters of the first processing network in the momentum model before updating and the network parameters of the first processing network in the online model after updating are weighted summed to obtain the network parameters of the first processing network in the momentum model after updating.
- the object region label includes at least one target region representation data;
- the object region prediction result includes at least one prediction region feature;
- the method further comprises:
- the determining, according to the object region prediction result corresponding to the at least one third image data and the object region prediction result corresponding to the at least one fourth image data, the contrast loss corresponding to the online model comprises:
- the contrast loss corresponding to the online model is determined according to at least one prediction region feature corresponding to the at least one third image data, and positive samples and negative samples of each prediction region feature corresponding to the at least one third image data.
- the object region prediction result further includes prediction region representation data corresponding to each of the prediction region features
- the at least one predicted region feature corresponding to the third image data includes a to-be-used region feature
- the target region representation data corresponding to the positive sample is determined according to the size of the overlapping region between the prediction region representation data corresponding to the positive sample and each target region representation data corresponding to the fourth image data to which the positive sample belongs;
- the target region representation data corresponding to the to-be-used region feature is determined according to the size of the overlapping region between the predicted region representation data corresponding to the to-be-used region feature and each target region representation data corresponding to the third image data to which the to-be-used region feature belongs;
- the target region representation data corresponding to the negative sample is determined according to the size of an overlapping region between the prediction region representation data corresponding to the negative sample and each target region representation data corresponding to the fourth image data to which the negative sample belongs.
- the process of acquiring the object region label corresponding to the image data to be processed includes:
- the object area label corresponding to the image data to be processed is searched from a pre-constructed mapping relationship; the mapping relationship includes the correspondence between each second image data and the object area label corresponding to each second image data; the object area label corresponding to the second image data is determined by performing object area search processing on the second image data using a selective search algorithm.
- the output result of the second model is a target detection result, a semantic segmentation result, or a key point detection result.
- the using the first data set to train the model to be processed to obtain the first model includes:
- the first data set is used to perform self-supervisory training on the model to be processed to obtain a first model.
- the method further includes:
- the model to be used is fine-tuned using a preset image data set to obtain an image processing model;
- the image processing model includes a target detection model, a semantic segmentation model or a key point detection model.
- the present disclosure provides a model building device, comprising:
- a first training unit is used to train the model to be processed using a first data set to obtain a first model;
- the first data set includes at least one first image data;
- the first model includes a backbone network;
- a model building unit configured to build a second model according to the backbone network in the first model;
- the second model includes the backbone network and a first processing network, and the first processing network refers to all or part of the networks in the second model except the backbone network;
- the second training unit is used to train the second model using a second data set to obtain a model to be used;
- the model to be used includes the backbone network and a second processing network, the network parameters of the backbone network in the second model remain unchanged during the training process for the second model, and the second processing network refers to the training result of the first processing network in the second model;
- the second data set includes at least one second image data.
- the present disclosure provides an electronic device, the device comprising: a processor and a memory;
- the memory is used to store instructions or computer programs
- the processor is used to execute the instructions or computer programs in the memory so that the electronic device executes the model building method provided by the present disclosure.
- the present disclosure provides a computer-readable medium, in which instructions or computer programs are stored.
- the device executes the model building method provided by the present disclosure.
- the present disclosure provides a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, wherein the computer program contains program codes for executing the model building method provided by the present disclosure.
- FIG1 is a flow chart of a model building method provided by the present disclosure
- FIG2 is a schematic diagram of a pre-training process for a backbone network provided by the present disclosure
- FIG. 3 is an example of a pre-training process for other networks in the model except the backbone network provided by the present disclosure. intention;
- FIG4 is a flow chart of another model building method provided by the present disclosure.
- FIG5 is a schematic diagram of the structure of a model building device provided by an embodiment of the present disclosure.
- FIG6 is a schematic diagram of the structure of another model building device provided in an embodiment of the present disclosure.
- FIG. 7 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure.
- the image processing models used in these image processing fields can usually be constructed and processed using a method of pre-training + fine-tuning.
- the inconsistency in training objects The specific reasons for this inconsistency are as follows: in the above implementation scheme, the pre-training process usually only trains the backbone network in the image processing model (for example, the target detection model, etc.), but in the fine-tuning process, all networks in the image processing model need to be trained. As a result, the objects to be trained in the pre-training process are different from the objects to be trained in the fine-tuning process, which leads to the difference in training objects between the pre-training process and the fine-tuning process.
- the present disclosure provides a model construction method that can be applied to certain image processing fields (for example, target detection, semantic segmentation, or key point detection, etc.), the method comprising: for machine learning models used in these image processing fields (for example, target detection models, semantic segmentation models, or key point detection models, etc.), first use a first data set (for example, a large amount of single-object image data) to train the model to be processed to obtain a first model, so that the backbone network in the first model has a better image feature extraction function, so as to realize the pre-training process of the backbone network in the machine learning model; then, according to the backbone network in the first model, build a second model, so that the image processing function implemented by the second model is consistent with the image processing function required to be implemented by the machine learning model; then, use a second data set (for example, some multi-object image data) to train the second model, and ensure that the network parameters of the backbone network in the second model remain unchanged during the training process for the second model, so that
- the backbone network in the above-mentioned image processing model for example, the target detection model, etc.
- other networks in the image processing model except the backbone network for example, the detection head network
- all networks in the final pre-trained model have relatively good data processing performance.
- This can effectively avoid the adverse effects caused by pre-training only the backbone network, thereby effectively improving the image processing effect (for example, target detection effect) of the finally constructed image processing model.
- model building method it not only utilizes single-object image data to participate in model pre-training, but also utilizes multi-object image data to participate in the model pre-training, so that the final pre-trained model has better image processing functions for multi-object image data.
- This can effectively avoid the adverse effects caused by using only single-object image data for model pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- model building method provided in the present invention not only focuses on classification tasks, but also focuses on regression tasks, so that the final pre-trained model has better image processing performance. This can effectively avoid the adverse effects caused by focusing only on classification tasks for pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- the present disclosure does not limit the execution subject of the above model building method.
- the model building method provided in the embodiment of the present disclosure can be applied to a device with data processing function such as a terminal device or a server.
- the model building method provided in the embodiment of the present disclosure can also be implemented with the help of the data communication process between the terminal device and the server.
- the terminal device can be a smart phone, a computer, a personal digital assistant (PDA) or a tablet computer.
- PDA personal digital assistant
- the server can be an independent server, a cluster server or a cloud server.
- the model building method provided by the present disclosure includes the following S101-S103.
- Figure 1 is a flow chart of a model building method provided by the present disclosure.
- S101 Using a first data set, training a model to be processed to obtain a first model; the first data set includes at least one first image data; the first model includes a backbone network.
- the first data set refers to the image data set required for pre-training the backbone network (Backbone) in the image processing model for the target application field.
- the target application field refers to the application field of the model building method provided by the present disclosure; and the present disclosure does not limit the target application field, for example, it can be the field of target detection, the field of image segmentation, or the field of key point detection.
- the present disclosure does not limit the implementation method of the first data set mentioned above.
- it can be implemented by any existing or future image data set that can be used for pre-training processing of the backbone network (for example, the ImageNet image data set).
- the first data set may include at least one first image data.
- the first image data refers to the image data used in the pre-training process for the backbone network; and the present disclosure does not limit the first image data.
- the first image data may belong to single-object image data (for example, the single-object image data of image 1 shown in FIG. 2), so that there is only one object in the first image data (for example, there is only one object, a cat, in the image 1).
- the model to be processed refers to the model used when pre-training the backbone network; and the model to be processed may at least include the backbone network.
- the present disclosure does not limit the implementation method of the above-mentioned model to be processed. For ease of understanding, two situations are described below.
- the above model to be processed can be a classification model
- the training process for the model to be processed can be specifically as follows: using the above at least one first image data and the classification label corresponding to the at least one first image data, the model to be processed is subjected to fully supervised training (for example, the training process shown in the "fully supervised pre-training" part in Figure 2), and the trained model to be processed is determined as the first model.
- the "classification label corresponding to the first image data” is used to indicate the category to which the first image data actually belongs; and the present disclosure does not limit the acquisition process of the "classification label corresponding to the first image data", for example, it can be implemented by means of manual labeling.
- the present disclosure does not limit the implementation method of the "classification model" in the above paragraph.
- the classification model may include a backbone network and a fully connected (FC) layer; the input data of the FC layer includes the output data of the backbone network.
- the present disclosure does not limit the implementation method of the step "performing a fully supervised training process on the model to be processed” in the above paragraph.
- the above model to be processed can be a classification model.
- Case 2 In some application scenarios, self-supervised pre-training can be performed on the backbone network.
- the above model to be processed may include the backbone network and the prediction layer (Predictor), and the input data of the prediction layer includes the output data of the backbone network.
- the training process for the model to be processed may specifically be: using the above at least one first image data, the model to be processed is subjected to self-supervised training (for example, the training process shown in the "self-supervised pre-training" part in FIG. 2), and the trained model to be processed is determined as the first model.
- the present disclosure does not limit the implementation method of the "prediction layer” in the above paragraph.
- the present disclosure does not limit the implementation method of the step "performing self-supervisory training processing on the model to be processed” in the above paragraph.
- the above model to be processed may include a backbone network and a Predictor, and the input data of the Predictor includes the output data of the backbone network.
- both Image 2 and Image 3 shown in FIG2 are obtained by performing data enhancement processing on the same image data (for example, Image 1 shown in FIG2 ), but the data enhancement parameters used in generating Image 2 are different from the data enhancement parameters used in generating Image 3, so that there is a difference between Image 2 and Image 3 in at least one aspect (for example, color, aspect ratio, size, image information, etc.).
- the "first model” above refers to the training result of the model to be processed above, and the backbone network in the first model refers to the result obtained by training the backbone network in the model to be processed above, so that the backbone network in the first model is used to represent the pre-trained backbone network, thereby making the backbone network in the first model have better image representation performance.
- the present disclosure does not limit the determination process of the above “first model”.
- the determination process of the “first model” may specifically be: using the first data set to perform full-supervision training on the model to be processed (for example, the training process shown in the “full-supervision pre-training” section in FIG. 2 ) to obtain the first model.
- the determination process of the “first model” may specifically be: using the first data set to perform self-supervision training on the model to be processed (for example, the training process shown in the “self-supervision pre-training” section in FIG. 2 ) to obtain the first model.
- large-scale image data for example, large-scale single-object image data
- S102 constructing a second model according to the backbone network in the first model; the second model includes the backbone network and a first processing network, and the first processing network refers to all or part of the network in the second model except the backbone network.
- the second model refers to a model constructed using the backbone network in the first model above, which can realize the image processing function (for example, target detection function, image segmentation function or key point detection function) required to be realized in the target application field above.
- the second model may refer to a model with target detection function constructed using the backbone network in the first model.
- the target application field is the image segmentation field
- the second model may refer to a model with image segmentation function constructed using the backbone network in the first model.
- the target application field is the key point detection field
- the second model may refer to a model with key point detection function constructed using the backbone network in the first model.
- the second model may include a first processing network and a backbone network in the first model above.
- the first processing network refers to all or part of the network in the second model except the backbone network.
- the first processing network may be a network located after the backbone network in the second model (e.g., a detection head network, etc.), so that the input data of the first processing network includes the output data of the backbone network, so that the first processing network can be used to process the output data of the backbone network to obtain the output result of the second model (e.g., target detection result, image segmentation result, or key point detection result, etc.).
- the present disclosure does not limit the implementation method of the above "first processing network".
- it may include: other parts or all networks except the backbone network in the image processing model under the above target application field.
- the first processing network may refer to a network existing in the image processing model and used to process the output data of the backbone network in the image processing model. It can be seen that in one possible implementation method, when the target application field is the target detection field, the first processing network may be a detection head network.
- the present disclosure is not limited to the above “detection head network”.
- the detection head network may include two networks, Neck and Head.
- the detection head network may only include one network, Head.
- the pre-trained backbone network can be used to construct an image processing model under the target application field, so that the image processing model includes the pre-trained backbone network, so that the image processing model can be used to subsequently achieve the purpose of pre-training all networks other than the backbone network in the image processing model.
- the second model is trained to obtain a model to be used;
- the model to be used includes a backbone network and a second processing network, the network parameters of the backbone network in the second model remain unchanged during the training process of the second model, and the second processing network refers to the training result of the first processing network in the second model;
- the second data set includes at least one second image data.
- the second data set refers to the image data set required for pre-training of parts or all of the networks other than the backbone network in the image processing model for the target application field mentioned above.
- the second data set may include at least one second image data.
- the second image data refers to the image data required for pre-training processing of other parts or all networks except the backbone network in the image processing model under the target application field mentioned above; and the present disclosure does not limit the second image data.
- the second image data may belong to multi-object image data (for example, the multi-object image data of image 4 shown in FIG. 3), so that there are at least two objects in the second image data (for example, there are two objects, a cat and a dog, in the image 4).
- the "model to be used” mentioned above refers to the training result of the second model mentioned above; and the model to be used includes a backbone network and a second processing network. Among them, because the network parameters of the backbone network in the second model remain unchanged during the training process of the second model, the backbone network in the model to be used is the "backbone network in the first model" mentioned above (that is, the backbone network pre-trained by S101 above).
- the second processing network in the model to be used refers to the training result of the first processing network in the second model, so that the second processing network can better cooperate with the backbone network to complete the image processing tasks under the above target application field.
- the present disclosure also provides a determination process of the above-mentioned "model to be used", which may specifically include the following steps 11 and 12.
- Step 11 Using the second model above, initialize the online model and momentum model.
- the online model refers to an image processing model that is required to be referenced when pre-training the other parts or all networks except the backbone network in the image processing model for the target application field described above.
- the online model may refer to the online model shown in FIG3 .
- the momentum model refers to another image processing model that is required to be referenced when pre-training the other parts or all networks except the backbone network in the image processing model for the target application field above.
- the momentum model may refer to the momentum model shown in FIG3 .
- the present disclosure does not limit the association relationship between the above online model and the above momentum model.
- the network parameters in the momentum model are determined by using the moving exponential average processing result of the online model (for example, the result shown in formula (1) below).
- V t ⁇ V t-1 +(1- ⁇ ) ⁇ D t (1)
- Vt represents the parameter value of the network parameter in the momentum model when executing the t-th round of training process
- Vt -1 represents the parameter value of the network parameter in the momentum model when executing the t-1th round of training process
- Dt represents the parameter value of the network parameter in the online model when executing the t-th round of training process
- D1 refers to the parameter value of the network parameter in the second model above
- the second model can be directly determined as the initial value of the online model, so that the parameter values of the network parameters in the initialized online model are consistent with the parameter values of the network parameters in the second model; and then the moving exponential average processing result of the initialized online model is determined as the initial value of the momentum model, so that the parameter values of the network parameters in the initialized momentum model are the moving exponential average processing results of the parameter values of the network parameters in the initialized online model (for example, the result shown in formula (1) above), so that the purpose of initializing the online model and the momentum model can be achieved.
- this step can be used to initialize the above online model and the above momentum model to the same network architecture as the second model, and the initialization process of the network parameters of the momentum model can be performed according to the above formula (1).
- the backbone network parameters in the momentum model and the online model should be the same as the backbone network parameters in the second model, and only the network parts other than the backbone need to be initialized.
- Step 12 Determine the model to be used based on the second data set, the online model initialized above, and the momentum model initialized above.
- step 12 may specifically include the following steps 121 to 127.
- Step 121 Select image data to be processed from at least one second image data.
- the image data to be processed refers to any image data existing in at least one of the second image data above and which has not yet participated in the model training process.
- the present disclosure does not limit the determination process of the above-mentioned image data to be processed.
- it may specifically be: first, all image data that have not participated in the model training process are screened out from at least one second image data above; then, one image data is randomly selected from all the screened image data, and determined as the image data to be processed, so as to perform some data processing on the image data to be processed during the current round of training (for example, the processing process shown in steps 122-123 below, etc.).
- Step 122 Obtain the object region label corresponding to the above image data to be processed.
- the object region label is used to indicate the region occupied by each object in the above-mentioned image data to be processed in the image data to be processed.
- the present disclosure does not limit the implementation methods of the above object region labels.
- the object region labels in the field of target detection, can be implemented with the help of object frames (for example, frames 1 and 2 shown in FIG. 3 ).
- the object region labels in the field of image segmentation, can be implemented with the help of masks.
- the object region labels in the field of key point detection, can be implemented with the help of key point position identification frames.
- the present disclosure does not limit the method for obtaining the above object region label.
- two cases are described below.
- Case 1 In some application scenarios (for example, scenarios with sufficient storage resources), the object area label corresponding to each second image data can be determined in advance, and the object area labels corresponding to these second image data can be stored in a certain storage space, so that in each subsequent round of training, the object area label corresponding to a certain second image data can be directly read from the storage space.
- the above step 122 can be specifically: searching for the object region label corresponding to the above image data to be processed from the pre-built mapping relationship.
- the mapping relationship includes the corresponding relationship between each second image data and the object region label corresponding to each second image data; and the embodiment of the present disclosure does not limit the mapping relationship, for example, it can be implemented using a database.
- the present disclosure does not limit the determination process of the object area label corresponding to the i-th second image data recorded in the above mapping relationship.
- it can be implemented by means of manual labeling.
- the automatic determination process of the object area label corresponding to the i-th second image data can be specifically: using a selective search algorithm (Selective Search), an object area search process is performed on the i-th second image data (for example, image 4 shown in Figure 3), and the object area label corresponding to the i-th second image data is obtained (for example, ⁇ box 1, box 2 ⁇ shown in Figure 3).
- the selective search algorithm is an unsupervised algorithm. i is a positive integer, i ⁇ I, I is a positive integer, and I represents the number of images in the above "at least one second image data".
- the object area label corresponding to each second image data can be pre-determined through offline mode, and the object area labels corresponding to all the second image data can be stored in a certain storage space in a certain manner (for example, a key-value pair manner), so that the correspondence between each second image data and the object area label corresponding to each second image data is stored in the storage space in the above mapping relationship manner, so that in each subsequent round of training, the object area label corresponding to a certain second image data can be directly read from the storage space, which can effectively save the resources required for real-time determination of the object area label corresponding to each second image data, thereby helping to improve the network training effect.
- a certain manner for example, a key-value pair manner
- Case 2 In some application scenarios (eg, scenarios with limited storage resources), the object region labels corresponding to the above image data to be processed may be determined in real time during each round of training.
- the above step 122 may specifically be: using the above selective search algorithm, performing object region search processing on the above image data to be processed, and obtaining the object region label corresponding to the image data to be processed.
- step 122 Based on the relevant content of step 122 above, it can be known that for the current round of training process, after obtaining the image data to be processed, the object area label corresponding to the image data to be processed can be obtained so that the object area label can be used as supervision information later.
- Step 123 Determine at least two image data to be used and the object region labels corresponding to the at least two image data to be used according to the image data to be processed and the object region labels corresponding to the image data to be processed.
- the image data to be used refers to the image data determined by performing data enhancement processing on the above image data to be processed.
- each image data to be used refers to the data enhancement processing result of the above image data to be processed, but because the enhancement parameters used in generating each image data to be used are different, any two image data among these image data to be used are different in at least one aspect (for example, color, aspect ratio, size, image information, etc.), so that these image data to be used can represent the same object with the help of different pixel information (for example, image 5 and image 6 shown in Figure 3 can represent two objects, a cat and a dog, with the help of different pixel information, etc.).
- any two image data among these image data to be used are different in at least one aspect (for example, color, aspect ratio, size, image information, etc.), so that these image data to be used can represent the same object with the help of different pixel information (for example, image 5 and image 6 shown in Figure 3 can represent two objects, a cat and a dog, with the help of different pixel information, etc.).
- the present disclosure does not limit the implementation method of the above “at least two image data to be used”.
- the “at least two image data to be used” may include image 5 and image 6 shown in Figure 3.
- the present disclosure does not limit the number of image data in the above “at least two image data to be used”, for example, it may include N.
- N is a positive integer, N ⁇ 2.
- the object region label corresponding to the nth image data to be used is used to indicate the region occupied by each object in the nth image data to be used, where n is a positive integer, n ⁇ N.
- the present disclosure does not limit the method of obtaining the above-mentioned "object area label corresponding to the nth image data to be used".
- it can be implemented by any existing or future method that can perform object area determination processing on an image data (for example, manual labeling or the above-mentioned selective search algorithm).
- the present disclosure also provides a possible implementation method of the above-mentioned "object area label corresponding to the nth image data to be used".
- the determination process of the "object area label corresponding to the nth image data to be used" can be specifically: according to the enhancement parameter, the object area label corresponding to the image data to be processed is data enhanced to obtain the object area label corresponding to the nth image data to be used, so that the "object area label corresponding to the nth image data to be used" can represent the area occupied by each object in the nth image data to be used.
- the present disclosure does not limit the determination process of the information "enhancement parameters used when generating the nth image data to be used" in the above paragraph, for example, it can be determined randomly or preset.
- step 123 Based on the relevant content of step 123 above, it can be known that after obtaining the above-mentioned image data to be processed (for example, image 4 shown in Figure 3) and the object area label corresponding to the image data to be processed (for example, ⁇ frame 1, frame 2 ⁇ shown in Figure 3), N different data enhancement processes can be performed on the image data to be processed, and each data enhancement process is respectively determined as the image data to be used (for example, image 5 or image 6 shown in Figure 3); at the same time, the object area label corresponding to the image data to be processed will also be changed accordingly with each data enhancement process to obtain the object area label corresponding to each image data to be used (for example, ⁇ frame 3, frame 4 ⁇ or ⁇ frame 5, frame 6 ⁇ shown in Figure 3), so that the current round of training process can be continued based on these image data to be used and their corresponding object area labels.
- N different data enhancement processes can be performed on the image data to be processed, and each data enhancement process is respectively determined as the image data to be used (for example, image 5
- Step 124 Determine object region prediction results corresponding to at least two image data to be used by using the online model and the momentum model.
- the object region prediction result corresponding to the nth image data to be used refers to the result determined by the model performing object region prediction processing on the nth image data to be used, wherein n is a positive integer, n ⁇ N.
- the present disclosure does not limit the implementation methods of the above object region prediction results.
- the above “object region prediction results corresponding to the nth image data to be used” may include at least one prediction region representation data (for example, each object frame in the frame set 1 shown in FIG. 3, etc.) and the prediction region features corresponding to the at least one prediction region representation data (for example, each frame feature in the frame feature set 1 shown in FIG. 3, etc.).
- the e-th prediction region representation data is used to represent the area occupied by the e-th object in the n-th image data to be used in the n-th image data to be used.
- the prediction region features corresponding to the e-th prediction region representation data are used to characterize the features presented by the e-th prediction region representation data.
- e is a positive integer, e ⁇ E, and E represents the number of data in the “at least one prediction region representation data”.
- step 124 may specifically include the following steps 1241 - 1242 .
- Step 1241 using the above online model, determine the object region prediction result corresponding to each third image data.
- the third image data refers to the image data to be used for object region prediction processing by the online model above.
- the third image data may refer to the image 5 shown in FIG. 3 .
- the object region prediction result corresponding to the j-th third image data refers to the result determined by the above online model performing object region prediction processing on the j-th third image data.
- j is a positive integer
- j ⁇ J is a positive integer
- J represents the number of image data in the above “at least one third image data”.
- the present disclosure does not limit the determination process of the above "object area prediction result corresponding to the j-th third image data".
- it can be specifically: inputting the j-th third image data (for example, image 5 shown in Figure 3) into the above online model to obtain the object area prediction result corresponding to the j-th third image data output by the online model (for example, frame set 1 and frame feature set 1 shown in Figure 3).
- Step 1242 using the above momentum model, determine the object region prediction result corresponding to each fourth image data.
- the fourth image data refers to the image data to be used that needs to be processed by the momentum model for object region prediction.
- the fourth image data may refer to the image 6 shown in FIG. 3 .
- the object region prediction result corresponding to the mth fourth image data refers to the result determined by the above online model performing object region prediction processing on the mth fourth image data.
- m is a positive integer
- m ⁇ M is a positive integer
- the present disclosure does not limit the determination process of the above-mentioned "object area prediction result corresponding to the m-th fourth image data".
- it may specifically be: inputting the m-th fourth image data (for example, image 6 shown in FIG3 ) into the above-mentioned momentum model to obtain the object area prediction result corresponding to the m-th fourth image data output by the momentum model (for example, frame set 2 and frame feature set 2 shown in FIG3 ).
- these image data to be used can be divided into two parts, one part of the image data (for example, image 5 shown in FIG3 ) will be sent to the above online model to obtain the prediction result output by the online model for it; however, the other part of the image data (for example, image 6 shown in FIG3 ) will be sent to the above momentum model to obtain the prediction result output by the momentum model for it, so that the purpose of performing object area prediction processing on these image data to be used can be achieved with the help of the online model and the momentum model.
- one part of the image data for example, image 5 shown in FIG3
- the other part of the image data for example, image 6 shown in FIG3
- the present disclosure does not limit the determination process of the image data (that is, the J third image data) sent to the above online model. For example, it can be specifically as follows: after obtaining the above "at least two image data to be used", J image data are randomly selected from these image data to be used, and these selected image data are all regarded as third image data, so that these selected image data can be subsequently sent to the online model.
- the present disclosure also does not limit the determination process of the image data of the above momentum model (that is, the M fourth image data). For example, it can be specifically as follows: after randomly selecting J image data from these image data to be used, the remaining image data are all regarded as fourth image data, and the remaining image data are sent to the momentum model.
- each image data to be used can be sent to its corresponding model (for example, an online model or a momentum model) respectively, so that the model can obtain the prediction result predicted for the image data to be used (for example, the object area prediction result corresponding to the image data to be used), so that the model prediction performance of the online model can be determined with the help of these prediction results.
- model for example, an online model or a momentum model
- Step 125 Determine whether the preset stop condition is reached, if so, execute the following step 127; if not, execute the following step 126.
- the preset stop condition refers to the training stop condition required to be referred to when pre-training the other parts or all of the network except the backbone network in the image processing model for the above target application field; and the present disclosure does not limit the preset stop condition, for example, it may include: the number of iterations of the training process reaches a preset number threshold.
- the preset stop condition may include: the model loss of the above online model is lower than a preset loss threshold.
- the preset stop condition may include: the rate of change of the model loss of the online model is lower than a preset rate of change threshold (that is, the online model tends to converge).
- model loss of the online model is used to characterize the model prediction performance of the online model; and the present disclosure does not limit the determination process of the “model loss of the online model”.
- the present disclosure also provides a possible implementation method of the above-mentioned "model loss of the online model”.
- the "model loss of the online model” determination process can specifically include the following steps 21-23.
- Step 21 Determine the regression loss corresponding to the online model according to the object area prediction result corresponding to the at least one third image data and the object area label corresponding to the at least one third image data.
- the object region label corresponding to the j-th third image data is used to indicate the region occupied by each object in the j-th third image data.
- j is a positive integer, j ⁇ J.
- regression loss corresponding to the online model is used to represent the regression characteristics of the online model under the regression task during the current round of training.
- the regression task is: after an image data is input into the online model, the object area prediction result output by the online model for the image data should be as consistent as possible with the object area label corresponding to the image data.
- the “regression loss corresponding to the online model” can be the regression loss shown in FIG3.
- the present disclosure does not limit the determination process of the above “regression loss corresponding to the online model”.
- the determination process of the “regression loss corresponding to the online model” can be specifically as follows: according to a preset regression loss calculation formula, at least one prediction area representation data corresponding to the above at least one third image data and the at least one third image data are compared. A regression loss calculation process is performed on an object area label corresponding to a third image data to obtain a regression loss corresponding to the online model, so that the regression loss can represent the regression characteristics of the online model.
- the present disclosure does not limit the implementation method of the above regression loss calculation formula.
- it can be implemented by any existing or future regression loss calculation method.
- it can be implemented by a regression loss calculation method set according to the actual application scenario.
- Step 22 Determine the contrast loss corresponding to the online model according to the object region prediction result corresponding to the at least one third image data and the object region prediction result corresponding to the at least one fourth image data.
- the contrast loss corresponding to the online model (for example, the contrast loss shown in FIG3 ) is used to represent the classification characteristics of the online model under the classification task in the current round of training.
- the classification task is a self-supervised classification task; and the classification task can be implemented with the help of contrastive learning.
- the present disclosure does not limit the determination process of the above “contrast loss corresponding to the online model”.
- the above object area label includes at least one target area representation data
- the above object area prediction result includes at least one prediction area feature (for example, box feature set 1 or box feature set 2 shown in Figure 3) and prediction area representation data corresponding to the at least one prediction area feature (for example, box set 1 or box set 2 shown in Figure 3)
- the determination process of the “contrast loss corresponding to the online model” may specifically include the following steps 31-33.
- Step 31 Obtain a correspondence between at least one target region representation data corresponding to the jth third image data and at least one target region representation data corresponding to the mth fourth image data, wherein j is a positive integer, j ⁇ J, and m is a positive integer, m ⁇ M.
- the kth target region representation data corresponding to the jth third image data is used to represent the region occupied by the kth object in the jth third image data, so that the “kth target region representation data corresponding to the jth third image data” can represent the region label of the kth object.
- k is a positive integer
- k ⁇ K K is a positive integer
- K represents the number of data in the above “at least one target region representation data corresponding to the jth third image data”.
- the present disclosure is not limited to the above “at least one target area representation data corresponding to the j-th third image data”.
- the “at least one target area representation data corresponding to the j-th third image data” may include frame 3 and frame 4 shown in Figure 3.
- the h-th target region representation data corresponding to the m-th fourth image data is used to represent the region occupied by the h-th object in the m-th fourth image data, so that the “h-th target region representation data corresponding to the m-th fourth image data” can represent the region label of the h-th object.
- h is a positive integer
- h ⁇ H H is a positive integer
- H represents the number of data in the above “at least one target region representation data corresponding to the m-th fourth image data”.
- the present disclosure is not limited to the above “at least one target area representation data corresponding to the mth fourth image data”.
- the “at least one target area representation data corresponding to the mth fourth image data” may include frame 5 and frame 6 shown in Figure 3.
- the present disclosure does not limit the implementation method of the above step 31.
- it can specifically be: reading the correspondence between at least one target area representation data corresponding to the j-th third image data and at least one target area representation data corresponding to the m-th fourth image data from a preset storage space.
- the above step 31 may specifically include the following steps 311 to 313.
- Step 311 obtaining a correspondence between at least one target region representation data corresponding to the j-th third image data and at least one target region representation data corresponding to the image data to be processed as a first correspondence.
- the d-th target region representation data corresponding to the image data to be processed is used to represent the region occupied by the d-th object in the image data to be processed, so that the "d-th target region representation data corresponding to the image data to be processed" can represent the region label corresponding to the d-th object.
- d is a positive integer
- d ⁇ D is a positive integer
- D represents the number of data in the above "at least one target region representation data corresponding to the image data to be processed”.
- the present disclosure is not limited to the above “at least one target area representation data corresponding to the image data to be processed”.
- the “at least one target area representation data corresponding to the image data to be processed” may include frame 1 and frame 2 shown in Figure 3.
- step 311 does not limit the implementation of step 311.
- it can be specifically as follows: if the above "k-th target region representation data corresponding to the j-th third image data" is determined by a certain change in the above "d-th target region representation data corresponding to the image data to be processed", it can be determined that there is a corresponding relationship between the "k-th target region representation data corresponding to the j-th third image data" and the "d-th target region representation data corresponding to the image data to be processed”; if the "k-th target region representation data corresponding to the j-th third image data” is not determined by a certain change in the "d-th target region representation data corresponding to the image data to be processed", it can be determined that there is no corresponding relationship between the "k-th target region representation data corresponding to the j-th third image data” and the "d-th target region representation data corresponding to the image data to be processed”.
- k is a positive integer
- d is a positive
- Step 312 Obtain a correspondence between at least one target region representation data corresponding to the m-th fourth image data and at least one target region representation data corresponding to the image data to be processed as a second correspondence.
- step 312 is similar to the implementation of step 311 above, for example, it can be specifically: if the above "h-th target region representation data corresponding to the m-th fourth image data" is determined by a certain change of the above "d-th target region representation data corresponding to the image data to be processed", it can be determined that there is a corresponding relationship between the "h-th target region representation data corresponding to the m-th fourth image data" and the "d-th target region representation data corresponding to the image data to be processed”; if the "h-th target region representation data corresponding to the m-th fourth image data” is not determined by a certain change of the "d-th target region representation data corresponding to the image data to be processed", it can be determined that there is no corresponding relationship between the "h-th target region representation data corresponding to the m-th fourth image data” and the "d-th target region representation data corresponding to the image data to be processed".
- h is a positive integer
- Step 313 Determine a correspondence between at least one target region representation data corresponding to the j-th third image data and at least one target region representation data corresponding to the m-th fourth image data according to the first correspondence and the second correspondence.
- the present disclosure does not limit the implementation method of the above step 313.
- the step 313 can be specifically as follows: if the above first correspondence relationship indicates that there is a correspondence between the above "kth target area representation data corresponding to the jth third image data" and the above “dth target area representation data corresponding to the image data to be processed", and the above second correspondence relationship indicates that there is a correspondence between the above "hth target area representation data corresponding to the mth fourth image data" and the "dth target area representation data corresponding to the image data to be processed", then it can be determined that the "kth target area representation data corresponding to the jth third image data” and the "hth target area representation data corresponding to the mth fourth image data" correspond to the same object in the image data to be processed, so it can be determined that the "kth target area representation data corresponding to the jth third image data" There is a
- step 31 Based on the relevant content of step 31 above, it can be known that after obtaining at least one third image data and at least one fourth image data above, the correspondence between at least one target area representation data corresponding to each third image data (for example, box 3 and box 4 shown in Figure 3) and at least one target area representation data corresponding to each fourth image data (for example, box 5 and box 6 shown in Figure 3) can be determined, so that the contrast loss between the prediction result of the at least one third image data and the prediction result of the at least one fourth image data can be determined based on the corresponding relationship.
- the correspondence between at least one target area representation data corresponding to each third image data for example, box 3 and box 4 shown in Figure 3
- at least one target area representation data corresponding to each fourth image data for example, box 5 and box 6 shown in Figure 3
- Step 32 According to the above correspondence, determine the positive samples and negative samples of each prediction region feature corresponding to the above at least one third image data from the at least one prediction region feature corresponding to the above at least one fourth image data.
- the h-th predicted region representation data corresponding to the m-th fourth image data is used to represent the region predicted for the h-th object in the m-th fourth image data.
- h is a positive integer, h ⁇ H.
- the h-th prediction region feature corresponding to the m-th fourth image data is used to characterize the features possessed by the above “h-th prediction region characterization data corresponding to the m-th fourth image data”.
- h is a positive integer, h ⁇ H.
- the k-th predicted region representation data corresponding to the j-th third image data is used to represent the region predicted for the k-th object in the j-th third image data.
- k is a positive integer, k ⁇ K.
- the k-th prediction region feature corresponding to the j-th third image data is used to characterize the features possessed by the above “k-th prediction region characterization data corresponding to the j-th third image data”.
- k is a positive integer, k ⁇ K.
- the positive sample of the kth prediction region feature corresponding to the jth third image data refers to the prediction region feature existing in the object region prediction result of any fourth image data and having a corresponding relationship with the prediction region represented by the prediction region feature.
- k is a positive integer, k ⁇ K.
- the negative sample of the kth prediction region feature corresponding to the jth third image data refers to a prediction region feature that exists in the object region prediction result of any fourth image data and has no corresponding relationship with the prediction region represented by the prediction region feature.
- k is a positive integer, k ⁇ K.
- the present disclosure does not limit the implementation of the above step 32.
- the step 32 may specifically include the following steps 321 and 322.
- Step 321 If the above correspondence relationship indicates that there is a correspondence between the above “h-th target region representation data corresponding to the m-th fourth image data” and the above “k-th target region representation data corresponding to the j-th third image data”, then the “h-th prediction region feature corresponding to the m-th fourth image data” having a correspondence relationship with the “h-th target region representation data corresponding to the m-th fourth image data” is determined as a positive sample of the above “k-th prediction region feature corresponding to the j-th third image data”.
- h is a positive integer, h ⁇ H
- k is a positive integer, k ⁇ K.
- the above correspondence relationship indicates that there is a correspondence between the above “h-th target region representation data corresponding to the m-th fourth image data” and the above “k-th target region representation data corresponding to the j-th third image data”
- the prediction results corresponding to the “k-th target region representation data corresponding to the j-th third image data” are all predicted for the same object, and thus it can be determined that the previous prediction result is a positive sample of the latter prediction result. Therefore, the prediction region feature in the previous prediction result (that is, the “h-th prediction region feature corresponding to the m-th fourth image data” above) can be determined as a positive sample of the prediction region feature in the latter prediction result (that is, the “k-th prediction region feature corresponding to the j-th third image data” above).
- Step 322 If the above correspondence indicates that there is no correspondence between the above “h-th target region representation data corresponding to the m-th fourth image data” and the above “k-th target region representation data corresponding to the j-th third image data”, then the “h-th prediction region feature corresponding to the m-th fourth image data” having a correspondence with the “h-th target region representation data corresponding to the m-th fourth image data” is determined as a negative sample of the above “k-th prediction region feature corresponding to the j-th third image data”.
- h is a positive integer, h ⁇ H
- k is a positive integer, k ⁇ K.
- the above correspondence relationship indicates that there is no correspondence between the above “h-th target region representation data corresponding to the m-th fourth image data” and the above “k-th target region representation data corresponding to the j-th third image data”
- it can be determined that the “h-th target region representation data corresponding to the m-th fourth image data” and the “k-th target region representation data corresponding to the j-th third image data” correspond to different objects in the above image data to be processed, so that the prediction result (for example, the prediction result) corresponding to the “h-th target region representation data corresponding to the m-th fourth image data” can be determined.
- the prediction results corresponding to the “k-th target region representation data corresponding to the j-th third image data” and the “k-th target region representation data corresponding to the j-th third image data” are all predicted for different objects, and thus it can be determined that the previous prediction result is a negative sample of the latter prediction result. Therefore, the prediction region feature in the previous prediction result (that is, the “h-th prediction region feature corresponding to the m-th fourth image data” above) can be determined as a negative sample of the prediction region feature in the latter prediction result (that is, the “k-th prediction region feature corresponding to the j-th third image data” above).
- any third image data when at least one prediction region feature corresponding to the third image data includes a region feature to be used (for example, "the kth prediction region feature corresponding to the jth third image data" above), and the region feature to be used is used to represent any prediction region feature corresponding to the third image data, the positive and negative samples of the region feature to be used respectively satisfy the conditions shown in 1-2 below.
- target region representation data corresponding to the positive sample of the region feature to be used refers to the region label of the object corresponding to the positive sample in the fourth image data above.
- the region feature to be used is the above “kth predicted region feature corresponding to the jth third image data”
- the positive sample of the region feature to be used is the above “mth fourth image data
- the “h-th predicted regional feature corresponding to the m-th fourth image data” is referred to as the “h-th target region representation data corresponding to the positive sample of the regional feature to be used”.
- the present disclosure does not limit the determination process of the above “target region representation data corresponding to the positive sample of the regional features to be used”, for example, it can be specifically: according to the overlapping area size between the predicted region representation data corresponding to the positive sample and each target region representation data corresponding to the fourth image data to which the positive sample belongs, determine the target region representation data corresponding to the positive sample so that the overlapping area size between the predicted region representation data corresponding to the positive sample and the target region representation data corresponding to the positive sample is maximized.
- the “prediction region representation data corresponding to the positive sample” refers to the regional prediction result of the object corresponding to the positive sample in the fourth image data above (for example, the “h-th prediction region representation data corresponding to the m-th fourth image data” above).
- target region representation data corresponding to the region feature to be used refers to the region label of the object corresponding to the region feature to be used in the above third image data.
- the region feature to be used is the above “kth prediction region feature corresponding to the jth third image data”
- the “target region representation data corresponding to the region feature to be used” refers to the above “kth target region representation data corresponding to the jth third image data”.
- the acquisition process of the above “target region representation data corresponding to the regional feature to be used” can be specifically: according to the overlapping area size between the predicted region representation data corresponding to the regional feature to be used and each target region representation data corresponding to the third image data to which the regional feature to be used belongs, the target region representation data corresponding to the regional feature to be used is determined so that the overlapping area size between the predicted region representation data corresponding to the regional feature to be used and the target region representation data corresponding to the regional feature to be used is maximized.
- the “predicted region representation data corresponding to the regional feature to be used” refers to the regional prediction result of the object corresponding to the regional feature to be used in the third image data above.
- target region representation data corresponding to the negative sample of the region feature to be used refers to the region label of the object corresponding to the negative sample in the fourth image data above.
- the region feature to be used is the above “kth predicted region feature corresponding to the jth third image data”
- the negative sample of the region feature to be used includes the above “hth predicted region feature corresponding to the mth fourth image data”
- the “target region representation data corresponding to the negative sample of the region feature to be used” refers to the above “hth target region representation data corresponding to the mth fourth image data”.
- the acquisition process of the "target region representation data corresponding to the negative sample of the region feature to be used" can be specifically: determining the target region representation data corresponding to the negative sample according to the overlapping area size between the predicted region representation data corresponding to the negative sample and each target region representation data corresponding to the fourth image data to which the negative sample belongs, so that the predicted region representation data corresponding to the negative sample overlaps with the target region representation data corresponding to the positive sample.
- the overlapping area size between the feature data reaches the maximum.
- the "prediction area representation data corresponding to the negative sample” refers to the area prediction result of the object corresponding to the negative sample in the fourth image data above.
- step 32 it can be known that after obtaining the correspondence between at least one target region representation data corresponding to the j-th third image data above and at least one target region representation data corresponding to the m-th fourth image data above, the prediction results (for example, predicted region features) of the target region representation data corresponding to the same object in the two image data are determined as positive samples, and the prediction results of the target region representation data corresponding to different objects in the two image data are determined as negative samples, so that the contrast loss between the prediction results of the two image data can be determined with the help of these positive samples and these negative samples.
- j is a positive integer, j ⁇ J
- m is a positive integer, m ⁇ M.
- Step 33 Determine the contrast loss corresponding to the above online model based on at least one prediction region feature corresponding to the above at least one third image data, and the positive samples and negative samples of each prediction region feature corresponding to the at least one third image data.
- step 33 does not limit the implementation of step 33.
- it can be implemented by using any existing or future method for determining contrast loss.
- the contrast loss corresponding to the online model can be determined with the help of a contrast learning method, so that the contrast loss can represent the classification performance of the online model.
- Step 23 Determine the model loss of the online model based on the regression loss and the contrast loss.
- step 23 does not limit the implementation method of step 23.
- it can be implemented by any existing or future method that can integrate the two losses (for example, weighted summation, aggregation, etc.).
- the regression loss and contrast loss of the online model can be determined with the help of these object area prediction results respectively; and then based on these two losses, the model loss of the online model is determined so that the model loss can better represent the prediction performance of the online model (for example, the prediction performance of the area occupied by the object, the classification performance, etc.).
- step 125 Based on the relevant content of step 125 above, it can be known that in a possible implementation mode, for the current round of training process, after obtaining the object area prediction result corresponding to at least one third image data output by the above online model and the object area prediction result corresponding to at least one fourth image data output by the above momentum model, these object area prediction results can be used to determine the model loss of the online model so that the model loss can represent the prediction performance of the online model; then determine whether the model loss reaches the preset loss condition.
- the preset loss condition is preset, for example, it can specifically include: the model loss is lower than the preset loss threshold. For example, it can also include: the change rate of the model loss is lower than the preset change rate threshold.
- Step 126 When it is determined that the preset stop condition is not met, update the online model and the momentum model according to the object area prediction results corresponding to the at least two image data to be used and the object area labels corresponding to the at least two image data to be used, and continue to execute the above step 121 and its subsequent steps.
- the present disclosure does not limit the updating process of the above online model.
- the updating process of the online model may include the following steps 41 to 43.
- Step 41 Determine the regression loss corresponding to the above online model according to the object region prediction result corresponding to the at least one third image data and the object region label corresponding to the at least one third image data.
- step 41 can be found in the above step 21, and for the sake of brevity, it will not be repeated here.
- Step 42 Determine the contrast loss corresponding to the above online model according to the object region prediction result corresponding to the above at least one third image data and the object region prediction result corresponding to the above at least one fourth image data.
- step 42 the relevant content of step 42 can be found in the above step 22, and for the sake of brevity, it will not be repeated here.
- Step 43 Update the above online model according to the above regression loss and the above contrast loss.
- step 43 may specifically be: updating the network parameters of the first processing network in the online model according to the above regression loss and the above contrast loss, so as to achieve the purpose of fixing the network parameters of the backbone network and updating the network parameters of other networks in the online model except the backbone network.
- the present disclosure does not limit the updating method of the "network parameters" in the previous paragraph.
- it can be implemented by any existing or future method that can update the network parameters based on model loss (for example, gradient update, etc.).
- the model loss of the online model above can be determined based on the object area prediction results corresponding to the at least two image data to be used above, and the object area labels corresponding to the at least two image data to be used; and then using the model loss, the network parameters of all networks other than the backbone network in the online model are gradient updated to obtain an updated online model, so that the network parameters of the backbone network in the updated online model are consistent with the network parameters of the backbone network in the online model before the update, thereby achieving the purpose of updating the network parameters of other networks in the online model except the backbone network.
- the present disclosure does not limit the updating process of the above momentum model.
- it can be specifically: using the updated online model to update the momentum model.
- the moving exponential average processing result of the updated online model (for example, the result shown in formula (1) above) can be determined as the updated momentum model.
- the present disclosure also provides a possible implementation method of the above step "using the updated online model to update the momentum model", which can be specifically: according to the network parameters of the first processing network in the updated online model, the network parameters of the first processing network in the momentum model are updated (for example, the moving exponential average processing results of the network parameters of the first processing network in the updated online model are determined as the network parameters of the first processing network in the updated momentum model, etc.), so as to achieve the purpose of updating the network parameters of other networks in the momentum model except the backbone network.
- the network parameters of the first processing network in the momentum model before the update and the network parameters of the first processing network in the online model after the update can be weighted and summed to obtain the network parameters of the first processing network in the updated momentum model.
- the relevant content of the weights involved in the weighted summation process can be found in the relevant content of the weights involved in (1) above, and for the sake of brevity, it will not be repeated here.
- step 126 Based on the relevant content of step 126 above, it can be known that for the current round of training process, when it is determined that the preset stop condition has not been reached, it can be determined that the prediction performance of the above online model still needs to be further improved, so the online model and momentum model can be updated according to the object area prediction results corresponding to the at least two image data to be used above, and the object area labels corresponding to the at least two image data to be used, so as to obtain an updated online model and an updated momentum model, so that these two models have better prediction performance; then use the updated online model and the updated momentum model to return to continue to execute step 121 and its subsequent steps to start the next round of training process, and iterate in this way until the preset stop condition is reached.
- Step 127 When it is determined that the preset stop condition is reached, the model to be used is determined according to the above online model.
- the model to be used can be determined directly based on the online model (for example, the online model used in the last round of training process can be directly determined as the model to be used), so that the model to be used has better prediction performance, thereby achieving the purpose of pre-training the image processing model for the target application field.
- a part of these enhanced images are sent to the online model, and the other part is sent to the momentum model to obtain the model prediction results of these enhanced images; then, the model loss of the online model is determined according to the model prediction results of these enhanced images and the target boxes of these enhanced images; subsequently, the model loss is used to perform gradient update on the network parameters of other networks in the online model except the backbone network, and the momentum model is updated using the moving exponential average processing result of the updated online model, so that the next round of training process can be continued based on the updated online model and the momentum model.
- the present disclosure can determine the classification features and regression features presented by the above online model on these enhanced images based on the model prediction results of these enhanced images and the target frames of these enhanced images. Therefore, the present disclosure can construct a self-supervised classification task based on the classification features, and in this classification task, the prediction results corresponding to the same target frame can usually be regarded as positive samples, and the prediction results corresponding to different target frames can be regarded as negative samples, so as to construct comparative learning.
- the present disclosure can also construct a regression task, and the purpose of the regression task is to ensure that the coordinates of the prediction frame predicted for the enhanced image are consistent with the target frame of the enhanced image, so as to achieve the regression purpose.
- the present disclosure can realize the unsupervised pre-training of a target detection model except Backbone
- the purpose of other networks besides is to make it possible to achieve the purpose of relatively complete pre-training of all networks of any target detection model in an unsupervised manner when the Backbone above is pre-trained in a self-supervised manner.
- the backbone network in the above-mentioned image processing model for example, the target detection model, etc.
- other networks in the image processing model except the backbone network for example, the detection head network
- all networks in the final pre-trained model have relatively good data processing performance.
- This can effectively avoid the adverse effects caused by pre-training only the backbone network, thereby effectively improving the image processing effect (for example, target detection effect) of the finally constructed image processing model.
- model building method it not only utilizes single-object image data to participate in model pre-training, but also utilizes multi-object image data to participate in the model pre-training, so that the final pre-trained model has better image processing functions for multi-object image data.
- This can effectively avoid the adverse effects caused by using only single-object image data for model pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- model building method provided in the present invention not only focuses on classification tasks, but also focuses on regression tasks, so that the final pre-trained model has better image processing performance. This can effectively avoid the adverse effects caused by focusing only on classification tasks for pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- the present disclosure also provides another model building method, which is described below in conjunction with the accompanying drawings for ease of understanding.
- the model building method may also include S104 below.
- the execution time of S104 is later than the execution time of S103;
- Figure 4 is a flow chart of another model building method provided by the present disclosure.
- S104 Using a preset image data set, fine-tune the model to be used to obtain an image processing model; the image processing model includes a target detection model, a semantic segmentation model, or a key point detection model.
- the preset image data set refers to an image data set used when fine-tuning the image processing model in the above target application field; and each image data in the preset image data set belongs to multi-object image data.
- the present disclosure does not limit the implementation methods of the above preset image datasets.
- the preset image dataset refers to the image dataset used when fine-tuning the target detection model (for example, a multi-object image dataset).
- the preset image dataset refers to the image dataset used when fine-tuning the image segmentation model.
- the preset image dataset refers to the image dataset used when fine-tuning the key point detection model.
- the present disclosure does not limit the implementation of the above S104.
- it can be implemented by using any existing or future method suitable for fine-tuning the image processing model in the above target application field.
- the present disclosure does not limit the "image processing model" in S104 above.
- the image processing model is the target detection model.
- the image processing model is the image segmentation model.
- the image processing model is the key point detection model.
- the model building method can be applied to multiple image processing fields such as target detection, image segmentation or key point detection; and the model building method can be specifically: first, with the help of the two-stage model building method provided in the present disclosure (for example, the two-stage pre-training process shown in Figures 2-3), all networks in the image processing model in the target detection field are pre-trained to obtain a pre-trained image processing model, so that all networks in the pre-trained image processing model have relatively good data processing performance; then, the pre-trained image processing model is fine-tuned to obtain a fine-tuned image processing model, so that the fine-tuned image processing model has better image processing performance in the target detection field, so that the fine-tuned image processing model can better complete the image processing tasks in the target detection field (for example, target detection tasks, image segmentation tasks or key point detection tasks, etc.), which is conducive to improving the image processing tasks in the target detection field (for example, target detection tasks, image segmentation tasks or key point detection tasks, etc.), which is conducive
- the pre-training process and fine-tuning process involved in the model building method both use multi-object image data, so that the pre-training process and the fine-tuning process can reach consistency in image data, thereby effectively avoiding the adverse effects caused when there are differences in image data between the pre-training process and the fine-tuning process, thereby making the image processing model constructed based on the model building method have better image processing performance.
- the pre-training process and fine-tuning process involved in the model building method need to be trained for all networks in the image data model so that the pre-training process and the fine-tuning process can reach consistency in the training objects, thereby effectively avoiding the adverse effects caused when there are differences in the training objects between the pre-training process and the fine-tuning process, thereby making the image processing model constructed based on the model building method have better image processing performance.
- the pre-training process and the fine-tuning process involved in the model building method both focus on the classification task and the regression task at the same time, so that the pre-training process and the fine-tuning process are more effective in learning.
- the two processes can reach a consensus on the tasks, thereby effectively avoiding the adverse effects caused by differences in the learning tasks between the pre-training process and the fine-tuning process, thereby making the image processing model constructed based on the model building method have better image processing performance.
- the present disclosure does not limit the execution subject of the above model building method.
- the model building method provided in the embodiment of the present disclosure can be applied to a device with data processing function such as a terminal device or a server.
- the model building method provided in the embodiment of the present disclosure can also be implemented by means of the data communication process between the terminal device and the server.
- Figure 5 is a structural schematic diagram of a model building device provided in the embodiment of the present disclosure. It should be noted that for the technical details of the model building device provided in the embodiment of the present disclosure, please refer to the relevant content of the model building method above.
- the model building device 500 provided in the embodiment of the present disclosure includes:
- a first training unit 501 is used to train a model to be processed using a first data set to obtain a first model; the first data set includes at least one first image data; the first model includes a backbone network;
- a model building unit 502 is used to build a second model according to the backbone network in the first model;
- the second model includes the backbone network and a first processing network, and the first processing network refers to all or part of the networks in the second model except the backbone network;
- the second training unit 503 is used to train the second model using a second data set to obtain a model to be used;
- the model to be used includes the backbone network and the second processing network, the network parameters of the backbone network in the second model remain unchanged during the training process of the second model, and the second processing network refers to the training result of the first processing network in the second model;
- the second data set includes at least one second image data.
- the first processing network is used to process output data of the backbone network to obtain an output result of the second model.
- the first image data belongs to single-object image data
- At least two objects exist in the second image data.
- the model building device 500 further includes:
- An initialization unit used to initialize the online model and the momentum model using the second model
- the second training unit 503 is specifically used to determine the model to be used based on the second data set, the online model and the momentum model.
- the second training unit 503 includes:
- An image selection subunit configured to select image data to be processed from the at least one second image data
- a first acquisition subunit is used to acquire at least two image data to be used and object region labels corresponding to the at least two image data to be used; the image data to be used is determined based on the image data to be processed; the object region labels corresponding to the image data to be used are determined based on the object region labels corresponding to the image data to be processed;
- a first determining subunit is used to determine the object region prediction results corresponding to the at least two image data to be used by using the online model and the momentum model;
- a first updating subunit configured to update the online model and the momentum model according to the object region prediction results corresponding to the at least two image data to be used and the object region labels corresponding to the at least two image data to be used, and return to the image selection subunit to continue to perform the step of selecting the image data to be processed from the at least one second image data;
- the second determining subunit is used to determine the model to be used according to the online model when a preset stop condition is reached.
- the at least two image data to be used include at least one third image data and at least one fourth image data;
- the object region prediction result corresponding to the third image data is determined using the online model
- the object region prediction result corresponding to the fourth image data is determined using the momentum model.
- the first updating subunit includes:
- a third determining subunit configured to determine a regression loss corresponding to the online model according to a prediction result of an object region corresponding to the at least one third image data and an object region label corresponding to the at least one third image data;
- a fourth determining subunit configured to determine a contrast loss corresponding to the online model according to an object region prediction result corresponding to the at least one third image data and an object region prediction result corresponding to the at least one fourth image data;
- a second updating subunit used for updating the online model according to the regression loss and the contrast loss
- the third updating subunit is used to update the momentum model according to the updated online model.
- the second updating subunit is specifically used to: update the network parameters of the first processing network in the online model according to the regression loss and the contrast loss;
- the third updating subunit is specifically used to update the network parameters of the first processing network in the momentum model according to the updated network parameters of the first processing network in the online model.
- the third updating subunit is specifically used to perform weighted sum processing on the network parameters of the first processing network in the momentum model before updating and the network parameters of the first processing network in the online model after updating to obtain the network parameters of the first processing network in the updated momentum model.
- the object region label includes at least one target region representation data;
- the object region prediction result includes at least one prediction region feature;
- the first updating subunit further includes:
- a fifth determining subunit configured to determine, based on a correspondence between at least one target region representation data corresponding to the third image data and at least one target region representation data corresponding to the fourth image data, positive samples and negative samples of each prediction region feature corresponding to the at least one third image data from at least one prediction region feature corresponding to the at least one fourth image data;
- the fourth determination subunit is specifically used to determine the contrast loss corresponding to the online model based on at least one prediction region feature corresponding to the at least one third image data, and positive samples and negative samples of each prediction region feature corresponding to the at least one third image data.
- the object region prediction result further includes prediction region representation data corresponding to each of the prediction region features
- the at least one predicted region feature corresponding to the third image data includes a to-be-used region feature
- the target region representation data corresponding to the positive sample is determined according to the size of the overlapping region between the prediction region representation data corresponding to the positive sample and each target region representation data corresponding to the fourth image data to which the positive sample belongs;
- the target region representation data corresponding to the to-be-used region feature is determined according to the size of the overlapping region between the predicted region representation data corresponding to the to-be-used region feature and each target region representation data corresponding to the third image data to which the to-be-used region feature belongs;
- the target region representation data corresponding to the negative sample is determined according to the size of an overlapping region between the prediction region representation data corresponding to the negative sample and each target region representation data corresponding to the fourth image data to which the negative sample belongs.
- the process of acquiring the object region label corresponding to the image data to be processed includes: using a selective search algorithm to perform object region search processing on the image data to be processed to obtain the object region label corresponding to the image data to be processed;
- the process of acquiring the object area label corresponding to the image data to be processed includes: searching for the object area label corresponding to the image data to be processed from a pre-constructed mapping relationship; the mapping relationship includes the correspondence between each second image data and the object area label corresponding to each second image data; the object area label corresponding to the second image data is determined by performing object area search processing on the second image data using a selective search algorithm.
- the output result of the second model is a target detection result, a semantic segmentation result, or a key point detection result.
- the first training unit 501 is specifically used to: perform full-supervisory training on the model to be processed using the first data set to obtain a first model;
- the first data set is used to perform self-supervisory training on the model to be processed to obtain a first model.
- the model building device 500 further includes:
- the fine-tuning unit 504 is used to use a preset image data set to fine-tune the model to be used to obtain an image processing model;
- the image processing model includes a target detection model, a semantic segmentation model or a key point detection model.
- the first data set (for example, a large amount of single-object image data) is first used to train the model to be processed to obtain the first model, so that the backbone network in the first model has a better image feature extraction function, so as to realize the pre-training process of the backbone network in the machine learning model under a certain image processing field; then, according to the backbone network in the first model, a second model is constructed, so that the image processing function realized by the second model is consistent with the image processing function required to be realized by the machine learning model; then, the second data set (for example, some multi-object image data) is used to train the second model, and it is ensured that the network parameters of the backbone network in the second model remain unchanged during the training process for the second model, so that when the trained second model is determined as the model to be used, the backbone network in the model to be used is consistent with the backbone network in the first model.
- the first data set for example, a large amount of single-object image data
- the network remains consistent, and the second processing network in the model to be used refers to the training result of the first processing network in the second model, so that the purpose of pre-training other networks in the machine learning model can be achieved under the premise of a fixed backbone network, so that a constructed image processing model (for example, a target detection model) can be obtained by fine-tuning the model to be used later, so that the image processing model has better image processing performance, thereby achieving the purpose of constructing and processing machine learning models in these image processing fields.
- a constructed image processing model for example, a target detection model
- the backbone network in the above-mentioned image processing model for example, the target detection model, etc.
- other networks in the image processing model except the backbone network for example, the detection head network
- all networks in the final pre-trained model have relatively good data processing performance.
- This can effectively avoid the adverse effects caused by pre-training only the backbone network, thereby effectively improving the image processing effect (for example, target detection effect) of the finally constructed image processing model.
- model building method it not only utilizes single-object image data to participate in model pre-training, but also utilizes multi-object image data to participate in the model pre-training, so that the final pre-trained model has better image processing functions for multi-object image data.
- This can effectively avoid the adverse effects caused by using only single-object image data for model pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- model building method provided in the present invention not only focuses on classification tasks, but also focuses on regression tasks, so that the final pre-trained model has better image processing performance. This can effectively avoid the adverse effects caused by focusing only on classification tasks for pre-training processing, thereby effectively improving the image processing effect (for example, target detection effect) of the final constructed image processing model.
- an embodiment of the present disclosure also provides an electronic device, which includes a processor and a memory: the memory is used to store instructions or computer programs; the processor is used to execute the instructions or computer programs in the memory, so that the electronic device executes any implementation of the model building method provided in the embodiment of the present disclosure.
- the terminal device in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
- mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
- PDAs personal digital assistants
- PADs tablet computers
- PMPs portable multimedia players
- vehicle-mounted terminals such as vehicle-mounted navigation terminals
- fixed terminals such as digital TVs, desktop computers, etc.
- the electronic device shown in FIG. 7 is only an example and should not bring any limitation to the functions and scope of
- the electronic device 700 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 701, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 702 or a program loaded from a storage device 708 to a random access memory (RAM) 703.
- a processing device e.g., a central processing unit, a graphics processing unit, etc.
- RAM random access memory
- various programs and data required for the operation of the electronic device 700 are also stored.
- the processing device 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704.
- An input/output (I/O) interface 705 is also connected to the bus 704.
- the following devices may be connected to the I/O interface 705: an input device 706 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 707 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 708 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 709.
- the communication device 709 may allow the electronic device 700 to communicate with other devices wirelessly or by wire to exchange data.
- FIG. 7 The electronic device 700 is shown with various devices, but it should be understood that it is not required to implement or possess all the devices shown. More or fewer devices may be implemented or possessed instead.
- an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
- the computer program can be downloaded and installed from the network through the communication device 709, or installed from the storage device 708, or installed from the ROM 702.
- the processing device 701 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
- the electronic device provided by the embodiment of the present disclosure and the method provided by the above embodiment belong to the same inventive concept.
- the technical details not fully described in this embodiment can be referred to the above embodiment, and this embodiment has the same beneficial effects as the above embodiment.
- the present disclosure also provides a computer-readable medium, in which instructions or computer programs are stored.
- the instructions or computer programs are executed on a device, the device executes any implementation of the model building method provided in the present disclosure.
- the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
- Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device.
- a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried.
- This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above.
- the computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
- the program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
- the client and server may communicate using any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network).
- HTTP Hyper Text Transfer Protocol
- Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.
- the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
- the computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device can execute the method.
- Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages.
- the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
- LAN local area network
- WAN wide area network
- Internet service provider e.g., via the Internet using an Internet service provider
- each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
- the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
- each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
- the units involved in the embodiments described in the present disclosure may be implemented by software or hardware, wherein the name of a unit/module does not, in some cases, constitute a limitation on the unit itself.
- exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOCs systems on chip
- CPLDs complex programmable logic devices
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing.
- a more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or flash memory erasable programmable read-only memory
- CD-ROM portable compact disk read-only memory
- CD-ROM compact disk read-only memory
- magnetic storage device or any suitable combination of the foregoing.
- At least one (item) means one or more, and “more” means two or more.
- “And/or” is used to describe the association relationship of associated objects, indicating that three relationships may exist.
- a and/or B can mean: only A exists, only B exists, and both A and B exist, where A and B can be singular. Or plural.
- the character “/” generally indicates that the objects before and after are in an “or” relationship.
- “At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
- At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c", where a, b, c can be single or multiple.
- the steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two.
- the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
本公开公开了一种模型构建方法、装置、电子设备、计算机可读介质,该方法包括:先利用第一数据集对待处理模型进行训练,得到第一模型,以实现针对骨干网络进行预训练的目的;再根据该第一模型中骨干网络,构建第二模型;然后,利用第二数据集,对该第二模型进行训练,并保证该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中始终保持不变,以得到待使用模型,以使实现在固定骨干网络的前提下针对模型中除了骨干网络其它网络进行预训练的目的,以便后续能够借助针对该待使用模型的微调处理得到一个构建好的图像处理模型,以使该图像处理模型具有较好的图像处理性能,如此实现了针对一些图像处理领域中机器学习模型进行构建处理的目的。
Description
本申请要求于2022年12月19日提交中国专利局、申请号为202211634668.2、发明名称为“一种模型构建方法、装置、电子设备、计算机可读介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本公开涉及图像处理技术领域,尤其涉及一种模型构建方法、装置、电子设备、计算机可读介质。
对于一些图像处理领域(例如,目标检测、语义分割或者关键点检测等领域)来说,这些图像处理领域均可以借助机器学习模型实现该图像处理领域下所涉及的图像处理任务(例如,目标检测任务、语义分割任务或者关键点检测任务等)。
然而,如何构建上文机器学习模型是一项亟待解决的技术问题。
发明内容
本公开提供了一种模型构建方法、装置、电子设备、计算机可读介质,能够实现针对某一图像处理领域下机器学习模型进行构建处理的目的。
为了实现上述目的,本公开提供的技术方案如下:
本公开提供一种模型构建方法,所述方法包括:
利用第一数据集,对待处理模型进行训练,得到第一模型;所述第一数据集包括至少一个第一图像数据;所述第一模型包括骨干网络;
根据所述第一模型中骨干网络,构建第二模型;所述第二模型包括所述骨干网络和第一处理网络,所述第一处理网络是指所述第二模型中除了所述骨干网络以外的其他全部或者部分网络;
利用第二数据集,对所述第二模型进行训练,得到待使用模型;所述待使用模型包括所述骨干网络和第二处理网络,所述第二模型中骨干网络的网络参数在针对所述第二模型的训练过程中保持不变,所述第二处理网络是指针对所述第二模型中第一处理网络的训练结果;所述第二数据集包括至少一个第二图像数据。
在一种可能的实施方式中,所述第一处理网络用于针对所述骨干网络的输出数据进行处理,以得到所述第二模型的输出结果。
在一种可能的实施方式中,所述第一图像数据属于单对象图像数据;
和/或,
所述第二图像数据中存在至少两个对象。
在一种可能的实施方式中,所述方法还包括:
利用所述第二模型,初始化在线模型和动量模型;
所述利用第二数据集,对所述第二模型进行训练,得到待使用模型,包括:
依据所述第二数据集、所述在线模型和所述动量模型,确定所述待使用模型。
在一种可能的实施方式中,所述待使用模型的确定过程,包括:
从所述至少一个第二图像数据中选择待处理图像数据;
获取至少两个待使用图像数据和所述至少两个待使用图像数据对应的对象区域标签;所述待使用图像数据是依据所述待处理图像数据所确定的;所述待使用图像数据对应的对象区域标签是依据所述待处理图像数据对应的对象区域标签所确定的;
利用所述在线模型和所述动量模型,确定所述至少两个待使用图像数据对应的对象区域预测结果;
根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,并继续执行所述从所述至少一个第二图像数据中选择待处理图像数据的步骤,直至在达到预设停止条件时,根据所述在线模型,确定所述待使用模型。
在一种可能的实施方式中,所述至少两个待使用图像数据包括至少一个第三图像数据和至少一个第四图像数据;
所述第三图像数据对应的对象区域预测结果是利用所述在线模型确定的;
所述第四图像数据对应的对象区域预测结果是利用所述动量模型确定的。
在一种可能的实施方式中,所述根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,包括:
根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第三图像数据对应的对象区域标签,确定所述在线模型对应的回归损失;
根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第四图像数据对应的对象区域预测结果,确定所述在线模型对应的对比损失;
根据所述回归损失和所述对比损失,更新所述在线模型;
根据更新后的在线模型,更新所述动量模型。
在一种可能的实施方式中,所述根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,包括:
根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,确定所述在线模型的模型损失;
根据所述模型损失,更新所述在线模型中第一处理网络的网络参数;
根据更新后的在线模型中第一处理网络的网络参数,更新所述动量模型中第一处理网络的网络参数。
在一种可能的实施方式中,所述根据更新后的在线模型中第一处理网络的网络参数,更新所述动量模型中第一处理网络的网络参数,包括:
将更新前的动量模型中第一处理网络的网络参数与更新后的在线模型中第一处理网络的网络参数进行加权求和处理,得到更新后的动量模型中第一处理网络的网络参数。
在一种可能的实施方式中,所述对象区域标签包括至少一个目标区域表征数据;所述对象区域预测结果包括至少一个预测区域特征;
所述方法还包括:
依据所述第三图像数据对应的至少一个目标区域表征数据与所述第四图像数据对应的至少一个目标区域表征数据之间的对应关系,从所述至少一个第四图像数据对应的至少一个预测区域特征中,确定所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本;
所述根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第四图像数据对应的对象区域预测结果,确定所述在线模型对应的对比损失,包括:
根据所述至少一个第三图像数据对应的至少一个预测区域特征、以及所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本,确定所述在线模型对应的对比损失。
在一种可能的实施方式中,所述对象区域预测结果还包括各所述预测区域特征对应的预测区域表征数据;
所述第三图像数据对应的至少一个预测区域特征包括待使用区域特征;
所述待使用区域特征的正样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间存在对应关系;
所述待使用区域特征的负样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间不存在对应关系;
所述正样本对应的目标区域表征数据是根据所述正样本对应的预测区域表征数据与所述正样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的;
所述待使用区域特征对应的目标区域表征数据是根据所述待使用区域特征对应的预测区域表征数据与所述待使用区域特征所属的第三图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的;
所述负样本对应的目标区域表征数据是根据所述负样本对应的预测区域表征数据与所述负样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的。
在一种可能的实施方式中,所述待处理图像数据对应的对象区域标签的获取过程,包括:
利用选择性搜索算法,对所述待处理图像数据进行对象区域搜索处理,得到所述待处理图像数据对应的对象区域标签;
或者,
所述待处理图像数据对应的对象区域标签的获取过程,包括:
从预先构建的映射关系中查找所述待处理图像数据对应的对象区域标签;所述映射关系包括各所述第二图像数据与各所述第二图像数据对应的对象区域标签之间的对应关系;所述第二图像数据对应的对象区域标签是利用选择性搜索算法针对所述第二图像数据进行对象区域搜索处理所确定的。
在一种可能的实施方式中,所述第二模型的输出结果为目标检测结果、语义分割结果或者关键点检测结果。
在一种可能的实施方式中,所述利用第一数据集,对待处理模型进行训练,得到第一模型,包括:
利用第一数据集,对待处理模型进行全监督训练,得到第一模型;
或者
利用第一数据集,对待处理模型进行自监督训练,得到第一模型。
在一种可能的实施方式中,所述方法还包括:
利用预设图像数据集,对所述待使用模型进行微调处理,得到图像处理模型;所述图像处理模型包括目标检测模型、语义分割模型或者关键点检测模型。
本公开提供了一种模型构建装置,包括:
第一训练单元,用于利用第一数据集,对待处理模型进行训练,得到第一模型;所述第一数据集包括至少一个第一图像数据;所述第一模型包括骨干网络;
模型构建单元,用于根据所述第一模型中骨干网络,构建第二模型;所述第二模型包括所述骨干网络和第一处理网络,所述第一处理网络是指所述第二模型中除了所述骨干网络以外的其他全部或者部分网络;
第二训练单元,用于利用第二数据集,对所述第二模型进行训练,得到待使用模型;所述待使用模型包括所述骨干网络和第二处理网络,所述第二模型中骨干网络的网络参数在针对所述第二模型的训练过程中保持不变,所述第二处理网络是指针对所述第二模型中第一处理网络的训练结果;所述第二数据集包括至少一个第二图像数据。
本公开提供了一种电子设备,所述设备包括:处理器和存储器;
所述存储器,用于存储指令或计算机程序;
所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行本公开提供的模型构建方法。
本公开提供了一种计算机可读介质,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行本公开提供的模型构建方法。
本公开提供了一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行本公开提供的模型构建方法的程序代码。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本公开提供的一种模型构建方法的流程图;
图2为本公开提供的一种针对骨干网络的预训练过程的示意图;
图3为本公开提供的一种针对模型中除了骨干网络以外的其他网络的预训练过程的示
意图;
图4为本公开提供的另一种模型构建方法的流程图;
图5为本公开实施例提供的一种模型构建装置的结构示意图;
图6为本公开实施例提供的另一种模型构建装置的结构示意图;
图7为本公开实施例提供的一种电子设备的结构示意图。
经研究发现,对于一些图像处理领域(例如,目标检测等领域)来说,该图像处理领域所使用的图像处理模型(例如,目标检测模型)通常可以借助预训练+微调这一种方式进行构建处理。
经研究还发现,对于上文预训练+微调这一种方式的一些实现方案来说,这些实现方案中所涉及的预训练过程与微调过程之间存在下文①-③所示的不一致性,以使这些不一致性对利用这些实现方案所构建的图像处理模型的图像处理效果造成了一些不良影响,从而使得利用这些实现方案所构建的图像处理模型的图像处理效果不太理想。
①在训练对象方面上所呈现的不一致性,该不一致性的产生原因具体为:在上述实现方案中,预训练过程通常只针对图像处理模型(例如,目标检测模型等)中骨干网络进行训练处理,但是在微调过程需要针对该图像处理模型中所有网络均进行训练处理,如此导致在预训练过程中所需训练的对象不同于在微调过程中所需训练的对象,从而导致了该预训练过程与该微调过程在训练对象方面上存在差异这一现象。
②在图像数据方面上所呈现的不一致性,该不一致性的产生原因具体为:在上述实现方案中,预训练过程通常只利用单物体图像数据进行预训练处理,但是在微调过程需要利用多物体图像数据进行微调处理,如此导致在预训练过程中所使用的图像数据的类型不同于在微调过程中所使用的图像数据的类型,从而导致了该预训练过程与该微调过程在图像数据方面上存在差异这一现象。
③在学习任务方面上所呈现的不一致性,该不一致性的产生原因具体为:在上述实现方案中,预训练过程通常只聚焦于分类任务,但是在微调过程需要同时聚焦于分类任务以及回归任务,如此导致在预训练过程中聚焦的学习任务少于在微调过程中聚焦的学习任务,从而导致了该预训练过程与该微调过程在学习任务方面上存在差异这一现象。
基于上文发现,本公开提供了一种可以应用于某些图像处理领域(例如,目标检测、语义分割或者关键点检测等领域)的模型构建方法,该方法包括:对于这些图像处理领域中所使用的机器学习模型(例如,目标检测模型、语义分割模型或者关键点检测模型等)来说,先利用第一数据集(例如,大量单对象图像数据),对待处理模型进行训练,得到第一模型,以使该第一模型中骨干网络具有较好的图像特征提取功能,以实现针对该机器学习模型中骨干网络的预训练过程;再根据该第一模型中骨干网络,构建第二模型,以使该第二模型所实现的图像处理功能与该机器学习模型所需实现的图像处理功能保持一致;然后,利用第二数据集(例如,一些多对象图像数据),对该第二模型进行训练,并保证该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中始终保持不变,以便在将训练好的第二模型确定为待使用模型时,该待使用模型中骨干网络与该第一模型中骨干网络保持一致,而且该待使用模型中第二处理网络是指针对该第二模型中第一处理网络的训练
结果,如此能够实现在固定骨干网络的前提下针对该机器学习模型中其它网络进行预训练的目的,以便后续能够借助针对该待使用模型的微调处理得到一个构建好的图像处理模型(例如,目标检测模型),以使该图像处理模型具有较好的图像处理性能,如此实现了针对这些图像处理领域中机器学习模型进行构建处理的目的。
另外,对于本公开所提供的模型构建方法来说,不仅会针对上文图像处理模型(例如,目标检测模型等)中骨干网络进行预训练,还会针对该图像处理模型中除了该骨干网络以外的其他网络(例如,检测头网络)也进行预训练,以使最终预训练后的模型中所有网络均具有比较好的数据处理性能,如此能够有效地避免只针对骨干网络进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
此外,对于本公开所提供的模型构建方法来说,其不仅利用单对象图像数据参与了模型预训练,还利用多对象图像数据参与了该模型预训练,以使最终预训练后的模型针对多对象图像数据具有较好的图像处理功能,如此能够有效地避免只利用单对象图像数据进行模型预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
还有,对于本公开所提供的模型构建方法来说,其不仅聚焦于分类任务,还聚焦于回归任务,以使最终预训练后的模型具有较好的图像处理性能,如此能够有效地避免只聚焦于分类任务进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
再者,本公开不限定上文模型构建方法的执行主体,例如,本公开实施例提供的模型构建方法可以应用于终端设备或服务器等具有数据处理功能的设备。又如,本公开实施例提供的模型构建方法也可以借助终端设备与服务器之间的数据通信过程进行实现。其中,终端设备可以为智能手机、计算机、个人数字助理(Personal Digital Assitant,PDA)或平板电脑等。服务器可以为独立服务器、集群服务器或云服务器。
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
为了更好地理解本公开所提供的技术方案,下面先结合一些附图对本公开提供的模型构建方法进行说明。如图1所示,本公开实施例提供的模型构建方法,包括下文S101-S103。其中,该图1为本公开提供的一种模型构建方法的流程图。
S101:利用第一数据集,对待处理模型进行训练,得到第一模型;该第一数据集包括至少一个第一图像数据;该第一模型包括骨干网络。
其中,第一数据集是指在针对目标应用领域下图像处理模型中骨干网络(Backbone)进行预训练处理时所需使用的图像数据集。该目标应用领域是指本公开提供的模型构建方法的应用领域;而且本公开不限定该目标应用领域,例如,其可以是目标检测领域、图像分割领域或者关键点检测领域。
另外,本公开不限定上文第一数据集的实施方式,例如,其可以是现有的或者未来出现的任意一种能够在针对骨干网络进行预训练处理时所使用的图像数据集(例如,ImageNet这一图像数据集)进行实施。
实际上,对于上文第一数据集来说,该第一数据集可以包括至少一个第一图像数据。其中,该第一图像数据是指在针对骨干网络进行预训练处理时所使用的图像数据;而且本公开不限定该第一图像数据,例如,在一些应用场景下,该第一图像数据可以属于单对象图像数据(例如,图2所示的图像1这一单物体图像数据),以使该第一图像数据中只存在一个对象(例如,该图像1中只存在猫这一个对象)。
待处理模型是指在针对骨干网络进行预训练处理时所使用的模型;而且该待处理模型可以至少包括骨干网络。
另外,本公开不限定上文待处理模型的实施方式,为了便于理解,下面结合两种情况进行说明。
情况1,在一些应用场景下,可以针对骨干网络进行全监督的预训练处理。
基于上述情况1可知,如果针对骨干网络进行全监督的预训练处理,则上文待处理模型可以是一种分类模型,而且针对该待处理模型的训练过程具体可以为:利用上文至少一个第一图像数据以及该至少一个第一图像数据对应的分类标签,对该待处理模型进行全监督的训练处理(例如,图2中“全监督的预训练”部分所示的训练过程),并将训练好的待处理模型,确定为第一模型。其中,该“第一图像数据对应的分类标签”用于表示该第一图像数据实际所属类别;而且本公开不限定该“第一图像数据对应的分类标签”的获取过程,例如,可以借助人工打标方式进行实施。
需要说明的是,本公开不限定上段中“分类模型”的实施方式,例如,当上文目标应用领域为目标检测领域时,如图2所示,该分类模型可以包括骨干网络和全连接(Fully Connected,FC)层;该FC层的输入数据包括该骨干网络的输出数据。另外,本公开也不限定上段中步骤“对该待处理模型进行全监督的训练处理”的实施方式。
基于上文情况1以及图2所示的“全监督的预训练”的相关内容可知,在一些应用场景下,可以利用大规模图像数据及其对应的分类标签,针对骨干网络进行全监督的预训练处理,以使预训练后的骨干网络具有较好的图像特征提取性能。可见,在一种可能的实施方式下,上文待处理模型可以是一个分类模型。
情况2,在一些应用场景下,可以针对骨干网络进行自监督的预训练处理。
基于上述情况2可知,如果针对骨干网络进行自监督的预训练处理,则上文待处理模型可以包括骨干网络和预测层(Predictor),而且该预测层的输入数据包括该骨干网络的输出数据。另外,针对该待处理模型的训练过程具体可以为:利用上文至少一个第一图像数据,对该待处理模型进行自监督的训练处理(例如,图2中“自监督的预训练”部分所示的训练过程),并将训练好的待处理模型,确定为第一模型。
需要说明的是,本公开不限定上段中“预测层”的实施方式。另外,本公开也不限定上段中步骤“对该待处理模型进行自监督的训练处理”的实施方式。
基于上文情况2以及图2所示的“自监督的预训练”的相关内容可知,在一些应用场景下,可以利用大规模图像数据,针对骨干网络进行自监督的预训练处理,以使预训练后的
骨干网络具有较好的图像特征提取性能。可见,在一种可能的实施方式下,上文待处理模型可以包括骨干网络和Predictor,而且该Predictor的输入数据包括该骨干网络的输出数据。
需要说明的是,对于在图2所示的图像2以及图像3来说,该图像2以及图像3均是通过针对同一张图像数据(例如,该图2所示的图像1)进行数据增强处理所得的,但是在生成该图像2时所使用的数据增强参数不同于在生成该图像3时所使用的数据增强参数,以使该图像2与该图像3之间在至少一个方面(例如,颜色、长宽比、尺寸、图像信息等方面)存在差异。
上文“第一模型”是指上文待处理模型的训练结果,而且该第一模型中骨干网络是指针对上文待处理模型中骨干网络训练所得的结果,以使该第一模型中骨干网络用于表示预训练后的骨干网络,从而使得该第一模型中骨干网络具有较好的图像表征性能。
另外,本公开不限定上文“第一模型”的确定过程,例如,在一些应用场景下,该“第一模型”的确定过程具体可以为:利用第一数据集,对待处理模型进行全监督训练(例如,图2中“全监督的预训练”部分所示的训练过程),得到第一模型。又如,在另一些应用场景下,该“第一模型”的确定过程具体可以为:利用第一数据集,对待处理模型进行自监督训练(例如,图2中“自监督的预训练”部分所示的训练过程),得到第一模型。
基于上文S101的相关内容可知,对目标应用领域(例如,目标检测领域等)来说,在一种可能的实施方式下,可以利用大规模的图像数据(例如,大规模的单物体图像数据),对该目标应用领域下图像处理模型中骨干网络进行全监督或者自监督的预训练处理,以使该骨干网络能够充分地学习到比较好的图像表征性能,从而使得预训练后的骨干网络具有较好的图像表征性能。
S102:根据第一模型中骨干网络,构建第二模型;该第二模型包括该骨干网络和第一处理网络,该第一处理网络是指该第二模型中除了该骨干网络以外的其他全部或者部分网络。
其中,第二模型是指利用上文第一模型中骨干网络所构建的、能够实现上文目标应用领域下所需实现的图像处理功能(例如,目标检测功能、图像分割功能或者关键点检测功能)的模型。例如,当该目标应用领域为目标检测领域时,该第二模型可以是指利用该第一模型中骨干网络所构建的、具有目标检测功能的模型。又如,当该目标应用领域为图像分割领域时,该第二模型可以是指利用该第一模型中骨干网络所构建的、具有图像分割功能的模型。还如,当该目标应用领域为关键点检测领域时,该第二模型可以是指利用该第一模型中骨干网络所构建的、具有关键点检测功能的模型。
实际上,对于上文第二模型来说,该第二模型可以包括第一处理网络以及上文第一模型中骨干网络。其中,该第一处理网络是指该第二模型中除了该骨干网络以外的其他全部或者部分网络。例如,在一种可能的实施方式下,该第一处理网络可以是该第二模型中位于该骨干网络之后的网络(例如,检测头网络等),以使该第一处理网络的输入数据包括该骨干网络的输出数据,从而使得该第一处理网络可以用于针对该骨干网络的输出数据进行处理,以得到该第二模型的输出结果(例如,目标检测结果、图像分割结果、或者关键点检测结果等)。
另外,本公开不限定上文“第一处理网络”的实施方式,例如,其可以包括:上文目标应用领域下图像处理模型中除了骨干网络以外的其他部分或者全部网络。又如,该第一处理网络可以是指在该图像处理模型中存在的、用于针对该图像处理模型中骨干网络的输出数据进行处理的网络。可见,在一种可能的实施方式下,当该目标应用领域为目标检测领域时,该第一处理网络可以是检测头网络。
需要说明的是,本公开不限定上段“检测头网络”,例如,在一些应用场景下,该检测头网络可以包括Neck和Head这两个网络。又如,在另一些应用场景下,该检测头网络可以只包括Head这一个网络。
基于上文S102的相关内容可知,对于目标应用领域(例如,目标检测领域、图像分割领域或者关键点检测领域)来说,在完成针对骨干网络的预训练处理之后,可以利用预训练后的骨干网络,构建该目标应用领域下的一个图像处理模型,以使该图像处理模型包括预训练后的骨干网络,以便后续能够借助该图像处理模型,实现针对该图像处理模型中除了该骨干网络以外的其他所有网络进行预训练处理的目的。
S103:利用第二数据集,对第二模型进行训练,得到待使用模型;该待使用模型包括骨干网络和第二处理网络,该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中保持不变,该第二处理网络是指针对该第二模型中第一处理网络的训练结果;该第二数据集包括至少一个第二图像数据。
其中,第二数据集是指在针对上文目标应用领域下图像处理模型中除了骨干网络以外的其他部分或者全部网络进行预训练处理时所需使用的图像数据集。
实际上,对于上文第二数据集来说,该第二数据集可以包括至少一个第二图像数据。其中,该第二图像数据是指在针对上文目标应用领域下图像处理模型中除了骨干网络以外的其他部分或者全部网络进行预训练处理时所需使用的图像数据;而且本公开不限定该第二图像数据,例如,为了更好地提高预训练效果,该第二图像数据可以属于多对象图像数据(例如,图3所示的图像4这一多物体图像数据),以使该第二图像数据中存在至少两个对象(例如,在该图像4中存在猫和狗这两个对象)。
上文“待使用模型”是指上文第二模型的训练结果;而且该待使用模型包括骨干网络和第二处理网络。其中,因该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中始终保持不变,使得该待使用模型中骨干网络就是上文“第一模型中骨干网络”(也就是,经由上文S101预训练后的骨干网络)。又因该第二模型中除了骨干网络以外其他所有网络的网络参数在针对该第二模型的训练过程中均会进行迭代更新处理,使得该待使用模型中第二处理网络是指针对该第二模型中第一处理网络的训练结果,从而使得该第二处理网络能够更好地配合该骨干网络完成上文目标应用领域下的图像处理任务。
实际上,为了更好地提高针对预训练效果,本公开还提供了上文“待使用模型”的一种确定过程,其具体可以包括下文步骤11-步骤12。
步骤11:利用上文第二模型,初始化在线模型和动量模型。
其中,在线模型是指在针对上文目标应用领域下图像处理模型中除了骨干网络以外其他部分或者全部网络进行预训练处理时所需参考的一种图像处理模型。例如,该在线模型可以是指图3所示的在线模型。
动量模型是指在针对上文目标应用领域下图像处理模型中除了骨干网络以外其他部分或者全部网络进行预训练处理时所需参考的另一种图像处理模型。例如,该动量模型可以是指图3所示的动量模型。
另外,本公开不限定上文在线模型与上文动量模型之间的关联关系,例如,该动量模型中网络参数是利用该在线模型的移动指数平均处理结果(例如,下文公式(1)所示的结果)所确定的。
Vt=β×Vt-1+(1-β)×Dt (1)
Vt=β×Vt-1+(1-β)×Dt (1)
式中,Vt表示在执行第t轮训练过程时动量模型中网络参数所具有的参数值;Vt-1表示在执行第t-1轮训练过程时动量模型中网络参数所具有的参数值,而且V0是预先设定的数值,例如,V0=0;Dt表示在执行第t轮训练过程时在线模型中网络参数所具有的参数值,而且该D1就是指上文第二模型中网络参数所具有的参数值;β表示预先设定的系数值,例如,β=0.04,而且1-β=0.996。
基于上文步骤11的相关内容可知,在获取到上文第二模型之后,可以直接将该第二模型,确定为在线模型的初始值,以使初始化后的在线模型中网络参数的参数值与该第二模型中网络参数的参数值保持一致;再将该初始化后的在线模型的移动指数平均处理结果,确定为动量模型的初始值,以使初始化后的动量模型中网络参数的参数值是该初始化后的在线模型中网络参数的参数值的移动指数平均处理结果(例如,上文公式(1)所示的结果),如此能够实现针对该在线模型以及该动量模型进行初始化的目的。
需要说明的是,对于上文步骤11来说,该步骤可以用于将上文在线模型以及上文动量模型初始化为与第二模型相同的网络架构,而且该动量模型的网络参数的初始化过程可以是按照上文公式(1)进行。另外,在初始话网络参数的时候,动量模型和在线模型中的骨干网路参数应该是和第二模型中骨干网络参数一样的,只有对骨干以外的网络部分进行初始化即可。
步骤12:依据上文第二数据集、上文初始化后的在线模型、以及上文初始化后的动量模型,确定待使用模型。
作为示例,步骤12具体可以包括下文步骤121-步骤127。
步骤121:从至少一个第二图像数据中选择待处理图像数据。
其中,待处理图像数据是指在上文至少一个第二图像数据中存在的、任意一个仍未参与过模型训练过程的图像数据。
另外,本公开不限定上文待处理图像数据的确定过程,例如,其具体可以为:先从上文至少一个第二图像数据中筛选出仍未参与过模型训练过程的所有图像数据;再从筛选所得的所有图像数据中随机抽取一个图像数据,确定为该待处理图像数据,以便在当前轮训练过程中针对该待处理图像数据进行一些数据处理(例如,下文步骤122-步骤123所示的处理过程等)。
步骤122:获取上文待处理图像数据对应的对象区域标签。
其中,对象区域标签用于表示上文待处理图像数据中各个对象在该待处理图像数据内所占区域。
另外,本公开不限定上文对象区域标签的实施方式,例如,在目标检测领域下,该对象区域标签可以借助物体框(例如,图3所示的框1和框2)进行实施。又如,在图像分割领域下,该对象区域标签可以借助掩码(mask)进行实施。还如,在关键点检测领域下,该对象区域标签可以借助关键点的位置标识框进行实施。
此外,本公开不限定上文对象区域标签的获取方式,为了便于理解,下面结合两种情况进行说明。
情况1,在一些应用场景(例如,存储资源比较充足的场景)下,可以预先确定出每个第二图像数据对应的对象区域标签,并将这些第二图像数据对应的对象区域标签存储至某个存储空间,以便后续在每一轮训练过程中可以从该存储空间中直接读取某个第二图像数据对应的对象区域标签。
基于上段情况1可知,在一种可能的实施方式下,上文步骤122具体可以为:从预先构建的映射关系中查找上文待处理图像数据对应的对象区域标签。其中,该映射关系包括各第二图像数据与各第二图像数据对应的对象区域标签之间的对应关系;而且本公开实施例不限定该映射关系,例如,其可以采用数据库进行实施。
另外,本公开不限定上文映射关系中所记录的第i个第二图像数据对应的对象区域标签的确定过程,例如,其可以借助人工打标的方式进行实施。又如,为了更好地减少资源消耗,该第i个第二图像数据对应的对象区域标签的自动确定过程,具体可以为:利用选择性搜索算法(Selective Search),针对该第i个第二图像数据(例如,图3所示的图像4)进行对象区域搜索处理,得到该第i个第二图像数据对应的对象区域标签(例如,图3所示的{框1,框2})。其中,该选择性搜索算法是一种无监督算法。i为正整数,i≤I,I为正整数,I表示上文“至少一个第二图像数据”中的图像个数。
基于上文情况1的相关内容可知,在一些应用场景下,可以通过离线模式,预先确定出各个第二图像数据对应的对象区域标签,并将所有第二图像数据对应的对象区域标签按照一定方式(例如,键值对方式)存储到某一存储空间中,以使在该存储空间中以上文映射关系方式存储有各第二图像数据与各第二图像数据对应的对象区域标签之间的对应关系,以便后续在每一轮训练过程中可以从该存储空间中直接读取某个第二图像数据对应的对象区域标签,如此能够有效地节省当实时确定每个第二图像数据对应的对象区域标签所需占用的资源,从而有利于提高网络训练效果。
情况2,在一些应用场景(例如,存储资源有限的场景)下,可以在每一轮训练过程中实时地确定上文待处理图像数据对应的对象区域标签。
基于上段情况2可知,在一种可能的实施方式下,上文步骤122具体可以为:利用上文选择性搜索算法,对上文待处理图像数据进行对象区域搜索处理,得到该待处理图像数据对应的对象区域标签。
基于上文步骤122的相关内容可知,对于当前轮训练过程来说,在获取到待处理图像数据之后,可以获取该待处理图像数据对应的对象区域标签,以便后续能够将该对象区域标签作为监督信息进行使用。
步骤123:依据待处理图像数据与该待处理图像数据对应的对象区域标签,确定至少两个待使用图像数据和该至少两个待使用图像数据对应的对象区域标签。
其中,待使用图像数据是指针对上文待处理图像数据进行数据增强处理所确定的图像数据。
另外,对于上文“至少两个待使用图像数据”来说,各个待使用图像数据均是指针对上文待处理图像数据的数据增强处理结果,但是因在生成各个待使用图像数据时所使用的增强参数不同,使得这些待使用图像数据中任意两个图像数据在至少一个方面(例如,颜色、长宽比、尺寸、图像信息等方面)存在差别,从而使得这些待使用图像数据能够借助不同像素信息表示出相同对象(例如,图3所示的图像5和图像6能够借助不同的像素信息表示出猫和狗这两个物体等)。
此外,本公开不限定上文“至少两个待使用图像数据”的实施方式,例如,当上文待处理图像数据为图3所示的图像4时,该“至少两个待使用图像数据”可以包括图3所示的图像5和图像6。
还有,本公开不限定上文“至少两个待使用图像数据”中的图像数据个数,例如,其可以包括N个。其中,N为正整数,N≥2。
第n个待使用图像数据对应的对象区域标签用于表示该第n个待使用图像数据中各个对象在该第n个待使用图像数据内所占区域。其中,n为正整数,n≤N。
另外,本公开不限定上文“第n个待使用图像数据对应的对象区域标签”的获取方式,例如,其可以采用现有的或者未来出现的任意一种能够针对一个图像数据进行对象区域确定处理(例如,人工打标或者上文选择性搜索算法)的方法进行实施。
实际上,为了更好地提高模型训练效果,本公开还提供了上文“第n个待使用图像数据对应的对象区域标签”的确定过程的一种可能的实施方式,在该实施方式中,当该第n个待使用图像数据是通过按照某一增强参数,对上文待处理图像数据进行数据增强处理所确定的时,该“第n个待使用图像数据对应的对象区域标签”的确定过程,具体可以为:按照该增强参数,对该待处理图像数据对应的对象区域标签进行数据增强处理,得到该第n个待使用图像数据对应的对象区域标签,以使该“第n个待使用图像数据对应的对象区域标签”能够表示出该第n个待使用图像数据中各个对象在该第n个待使用图像数据内所占区域。
需要说明的是,本公开不限定上段中“在生成第n个待使用图像数据时所使用的增强参数”这一信息的确定过程,例如,可以随机确定。又如,也可以预先设定。
基于上文步骤123的相关内容可知,在获取到上文待处理图像数据(例如,图3所示的图像4)以及该待处理图像数据对应的对象区域标签(例如,图3所示的{框1,框2})之后,可以针对该待处理图像数据进行N次不同的数据增强处理,并将各次数据增强处理分别确定为待使用图像数据(例如,图3所示的图像5或者图像6);同时,该待处理图像数据对应的对象区域标签也会随着各次数据增强处理进行相应地改变,以得到各待使用图像数据对应的对象区域标签(例如,图3所示的{框3,框4}或者{框5,框6}),以便后续能够依据这些待使用图像数据及其对应的对象区域标签,继续执行当前轮训练过程。
步骤124:利用在线模型和动量模型,确定至少两个待使用图像数据对应的对象区域预测结果。
其中,第n个待使用图像数据对应的对象区域预测结果是指由模型针对该第n个待使用图像数据进行对象区域预测处理所确定的结果。其中,n为正整数,n≤N。
另外,本公开不限定上文对象区域预测结果的实施方式,例如,上文“第n个待使用图像数据对应的对象区域预测结果”其可以包括至少一个预测区域表征数据(例如,图3所示的框集1中的各个物体框等)以及该至少一个预测区域表征数据对应的预测区域特征(例如,图3所示的框特征集1中的各个框特征等)。其中,第e个预测区域表征数据用于表示该第n个待使用图像数据中第e个对象在该第n个待使用图像数据中所占区域。该第e个预测区域表征数据对应的预测区域特征用于表征该第e个预测区域表征数据所呈现的特征。e为正整数,e≤E,E表示该“至少一个预测区域表征数据”中的数据个数。
此外,本公开不限定上文步骤124的实施方式,为了便于理解,下面结合示例进行说明。
作为示例,当上文“至少两个待使用图像数据”包括至少一个第三图像数据和至少一个第四图像数据时,步骤124具体可以包括下文步骤1241-步骤1242。
步骤1241:利用上文在线模型,确定各个第三图像数据对应的对象区域预测结果。
其中,第三图像数据是指需要由上文在线模型进行对象区域预测处理的待使用图像数据。例如,该第三图像数据可以是指图3中所示的图像5。
第j个第三图像数据对应的对象区域预测结果是指由上文在线模型针对该第j个第三图像数据进行对象区域预测处理所确定的结果。其中,j为正整数,j≤J,J为正整数,J表示上文“至少一个第三图像数据”中的图像数据个数。
另外,本公开不限定上文“第j个第三图像数据对应的对象区域预测结果”的确定过程,例如,其具体可以为:将该第j个第三图像数据(例如,图3所示的图像5)输入上文在线模型,得到该在线模型输出的该第j个第三图像数据对应的对象区域预测结果(例如,图3所示的框集1和框特征集1)。
步骤1242:利用上文动量模型,确定各个第四图像数据对应的对象区域预测结果。
其中,第四图像数据是指需要由上文动量模型进行对象区域预测处理的待使用图像数据。例如,该第四图像数据可以是指图3中所示的图像6。
第m个第四图像数据对应的对象区域预测结果是指由上文在线模型针对该第m个第四图像数据进行对象区域预测处理所确定的结果。其中,m为正整数,m≤M,M为正整数,M表示上文“至少一个第四图像数据”中的图像数据个数。需要说明的是,上文N=M+J。
另外,本公开不限定上文“第m个第四图像数据对应的对象区域预测结果”的确定过程,例如,其具体可以为:将该第m个第四图像数据(例如,图3所示的图像6)输入上文动量模型,得到该动量模型输出的该第m个第四图像数据对应的对象区域预测结果(例如,图3所示的框集2和框特征集2)。
基于上文步骤1241至步骤1242的相关内容可知,对于上文至少两个待使用图像数据来说,可以将这些待使用图像数据划分成两部分,一部分图像数据(例如,图3中所示的图像5)会被送入上文在线模型,以得到由该在线模型针对其所输出的预测结果;但是,另一部分图像数据(例如,图3中所示的图像6)会被送入上文动量模型,以得到由该动量模型针对其所输出的预测结果,如此能够实现借助在线模型和动量模型完成针对这些待使用图像数据进行对象区域预测处理的目的。
需要说明的是,本公开不限定被送入上文在线模型的图像数据(也就是,上文J个第三图像数据)的确定过程,例如,其具体可以为:在获取到上文“至少两个待使用图像数据”之后,从这些待使用图像数据中随机挑选J个图像数据,并将挑选出的这些图像数据均视为第三图像数据,以便后续将挑选出的这些图像数据送入该在线模型。另外,本公开也不限定上文动量模型的图像数据(也就是,上文M个第四图像数据)的确定过程,例如,其具体可以为:在从这些待使用图像数据中随机挑选出J个图像数据之后,将剩余各个图像数据均视为第四图像数据,并将剩余的这些图像数据送入该动量模型。
基于上文步骤124的相关内容可知,在获取到各个待使用图像数据之后,可以将各个待使用图像数据分别送入其所对应的模型(例如,在线模型或者动量模型)中,以使该模型能够得到针对该待使用图像数据预测所得的预测结果(例如,该该待使用图像数据对应的对象区域预测结果),以便后续能够借助这些预测结果,确定该在线模型的模型预测性能。
步骤125:判断是否达到预设停止条件,若是,则执行下文步骤127;若否,则执行下文步骤126。
其中,预设停止条件是指在针对上文目标应用领域下图像处理模型中除了骨干网络以外其他部分或者全部网络进行预训练处理时所需参考的训练停止条件;而且本公开不限定该预设停止条件,例如,其可以包括:训练过程的迭代次数达到预设次数阈值。又如,该预设停止条件可以包括:上文在线模型的模型损失低于预先设定的损失阈值。还如,该预设停止条件可以包括:该在线模型的模型损失的变化率低于预先设定的变化率阈值(也就是,该在线模型趋向于收敛)。
上文“在线模型的模型损失”用于表征该在线模型的模型预测性能;而且本公开不限定该“在线模型的模型损失”的确定过程。
实际上,为了更好地提高模型训练效果,本公开还提供了上文“在线模型的模型损失”的确定过程的一种可能的实施方式,在该实施方式中,当上文“至少两个待使用图像数据”包括至少一个第三图像数据和至少一个第四图像数据时,该“在线模型的模型损失”的确定过程具体可以包括下文步骤21-步骤23。
步骤21:根据上文至少一个第三图像数据对应的对象区域预测结果与该至少一个第三图像数据对应的对象区域标签,确定在线模型对应的回归损失。
其中,第j个第三图像数据对应的对象区域标签用于表示该第j个第三图像数据中各个对象在该第j个第三图像数据内所占区域。其中,j为正整数,j≤J。
上文“在线模型对应的回归损失”用于表示在当前轮训练过程中该在线模型在回归任务下所具有的回归特征。其中,该回归任务具体为:在将一个图像数据输入该在线模型之后,由该在线模型针对该图像数据所输出的对象区域预测结果应该尽可能地与该图像数据所对应的对象区域标签保持一致。例如,该“在线模型对应的回归损失”可以是图3所示的回归损失。
另外,本公开不限定上文“在线模型对应的回归损失”的确定过程,例如,当上文“对象区域预测结果”包括至少一个预测区域表征数据(例如,图3所示的框集1中的各个物体框等)时,该“在线模型对应的回归损失”的确定过程具体可以为:按照预先设定的回归损失计算公式,对上文至少一个第三图像数据对应的至少一个预测区域表征数据与该至少
一个第三图像数据对应的对象区域标签进行回归损失计算处理,得到该在线模型对应的回归损失,以使该回归损失能够表示出该在线模型所具有的回归特征。
需要说明的是,本公开不限定上段回归损失计算公式的实施方式,例如,其可以采用现有的或者未来出现的任意一种回归损失计算方法进行实施。又如,其可以采用依据实际应用场景所设定的回归损失计算方法进行实施。
步骤22:根据上文至少一个第三图像数据对应的对象区域预测结果与上文至少一个第四图像数据对应的对象区域预测结果,确定在线模型对应的对比损失。
其中,在线模型对应的对比损失(例如,图3所示的对比损失)用于表示在当前轮训练过程中该在线模型在分类任务下所具有的分类特征。其中,该分类任务是一种自监督的分类任务;而且该分类任务可以借助对比学习进行实现。
另外,本公开不限定上文“在线模型对应的对比损失”的确定过程,例如,在一种可能的实施方式下,当上文对象区域标签包括至少一个目标区域表征数据,而且上文对象区域预测结果包括至少一个预测区域特征(例如,图3所示的框特征集1或者框特征集2)以及该至少一个预测区域特征对应的预测区域表征数据(例如,图3所示的框集1或者框集2)时,该“在线模型对应的对比损失”的确定过程具体可以包括下文步骤31-步骤33。
步骤31:获取第j个第三图像数据对应的至少一个目标区域表征数据与第m个第四图像数据对应的至少一个目标区域表征数据之间的对应关系。其中,j为正整数,j≤J,m为正整数,m≤M。
其中,第j个第三图像数据对应的第k个目标区域表征数据用于表示该第j个第三图像数据中第k个对象在该第j个第三图像数据中所占区域,以使该“第j个第三图像数据对应的第k个目标区域表征数据”能够表示出该第k个对象的区域标签。k为正整数,k≤K,K为正整数,K表示上文“第j个第三图像数据对应的至少一个目标区域表征数据”中的数据个数。
另外,本公开不限定上文“第j个第三图像数据对应的至少一个目标区域表征数据”,例如,当该第j个第三图像数据为图3所示的图像5时,该“第j个第三图像数据对应的至少一个目标区域表征数据”可以包括图3所示的框3和框4。
第m个第四图像数据对应的第h个目标区域表征数据用于表示该第m个第四图像数据中第h个对象在该第m个第四图像数据中所占区域,以使该“第m个第四图像数据对应的第h个目标区域表征数据”能够表示出该第h个对象的区域标签。h为正整数,h≤H,H为正整数,H表示上文“第m个第四图像数据对应的至少一个目标区域表征数据”中的数据个数。
另外,本公开不限定上文“第m个第四图像数据对应的至少一个目标区域表征数据”,例如,当该第m个第四图像数据为图3所示的图像6时,该“第m个第四图像数据对应的至少一个目标区域表征数据”可以包括图3所示的框5和框6。
此外,本公开不限定上文步骤31的实施方式,例如,其具体可以为:从预设存储空间中读取第j个第三图像数据对应的至少一个目标区域表征数据与第m个第四图像数据对应的至少一个目标区域表征数据之间的对应关系。
又如,在一种可能的实施方式下,上文步骤31具体可以包括下文步骤311-步骤313。
步骤311:获取上文第j个第三图像数据对应的至少一个目标区域表征数据与上文待处理图像数据对应的至少一个目标区域表征数据之间的对应关系,作为第一对应关系。
其中,待处理图像数据对应的第d个目标区域表征数据用于表示该待处理图像数据中第d个对象在该待处理图像数据中所占区域,以使该“待处理图像数据对应的第d个目标区域表征数据”能够表示出该第d个对象对应的区域标签。d为正整数,d≤D,D为正整数,D表示上文“待处理图像数据对应的至少一个目标区域表征数据”中的数据个数。
另外,本公开不限定上文“待处理图像数据对应的至少一个目标区域表征数据”,例如,当该待处理图像数据为图3所示的图像4时,该“待处理图像数据对应的至少一个目标区域表征数据”可以包括图3所示的框1和框2。
此外,本公开不限定步骤311的实施方式,例如,其具体可以为:如果上文“第j个第三图像数据对应的第k个目标区域表征数据”是由上文“待处理图像数据对应的第d个目标区域表征数据”进行一定改变所确定的,则可以确定该“第j个第三图像数据对应的第k个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间存在对应关系;如果该“第j个第三图像数据对应的第k个目标区域表征数据”不是由该“待处理图像数据对应的第d个目标区域表征数据”进行一定改变所确定的,则可以确定该“第j个第三图像数据对应的第k个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间不存在对应关系。其中,k为正整数,k≤K;d为正整数,d≤D。
步骤312:获取上文第m个第四图像数据对应的至少一个目标区域表征数据与上文待处理图像数据对应的至少一个目标区域表征数据之间的对应关系,作为第二对应关系。
需要说明的是,步骤312的实施方式类似于上文步骤311的实施方式,例如,其具体可以为:如果上文“第m个第四图像数据对应的第h个目标区域表征数据”是由上文“待处理图像数据对应的第d个目标区域表征数据”进行一定改变所确定的,则可以确定该“第m个第四图像数据对应的第h个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间存在对应关系;如果该“第m个第四图像数据对应的第h个目标区域表征数据”不是由该“待处理图像数据对应的第d个目标区域表征数据”进行一定改变所确定的,则可以确定该“第m个第四图像数据对应的第h个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间不存在对应关系。其中,h为正整数,h≤H;d为正整数,d≤D。
步骤313:依据上文第一对应关系和上文第二对应关系,确定上文第j个第三图像数据对应的至少一个目标区域表征数据与上文第m个第四图像数据对应的至少一个目标区域表征数据之间的对应关系。
需要说明的是,本公开不限定上文步骤313的实施方式,例如,其可以借助对应关系传递过程进行实施。可见,在一种可能的实施方式下,该步骤313具体可以为:如果上文第一对应关系表示上文“第j个第三图像数据对应的第k个目标区域表征数据”与上文“待处理图像数据对应的第d个目标区域表征数据”之间存在对应关系,而且上文第二对应关系表示上文“第m个第四图像数据对应的第h个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间存在对应关系,则可以确定该“第j个第三图像数据对应的第k个目标区域表征数据”与该“第m个第四图像数据对应的第h个目标区域表征数据”对应于该待处理图像数据中同一个对象,故可以确定该“第j个第三图像数据对应的第k个
目标区域表征数据”与该“第m个第四图像数据对应的第h个目标区域表征数据”之间存在对应关系。
然而,如果上文第一对应关系表示上文“第j个第三图像数据对应的第k个目标区域表征数据”与上文“待处理图像数据对应的第d个目标区域表征数据”之间存在对应关系,但是上文第二对应关系表示上文“第m个第四图像数据对应的第h个目标区域表征数据”与该“待处理图像数据对应的第d个目标区域表征数据”之间不存在对应关系,则可以确定该“第j个第三图像数据对应的第k个目标区域表征数据”与该“第m个第四图像数据对应的第h个目标区域表征数据”对应于该待处理图像数据中不同对象,故可以确定该“第j个第三图像数据对应的第k个目标区域表征数据”与该“第m个第四图像数据对应的第h个目标区域表征数据”之间不存在对应关系。
基于上文步骤31的相关内容可知,在获取到上文至少一个第三图像数据以及至少一个第四图像数据之后,可以确定各个第三图像数据对应的至少一个目标区域表征数据(例如,图3所示的框3和框4)与各个第四图像数据对应的至少一个目标区域表征数据(例如,图3所示的框5和框6)之间的对应关系,以便后续能够基于该对应关系,确定出该至少一个第三图像数据的预测结果与该至少一个第四图像数据的预测结果之间的对比损失。
步骤32:依据上文对应关系,从上文至少一个第四图像数据对应的至少一个预测区域特征中,确定上文至少一个第三图像数据对应的各预测区域特征的正样本以及负样本。
其中,第m个第四图像数据对应的第h个预测区域表征数据用于表示针对该第m个第四图像数据中第h个对象预测所得的区域。h为正整数,h≤H。
第m个第四图像数据对应的第h个预测区域特征用于表征上文“第m个第四图像数据对应的第h个预测区域表征数据”所具有的特征。h为正整数,h≤H。
第j个第三图像数据对应的第k个预测区域表征数据用于表示针对该第j个第三图像数据中第k个对象预测所得的区域。k为正整数,k≤K。
第j个第三图像数据对应的第k个预测区域特征用于表征上文“第j个第三图像数据对应的第k个预测区域表征数据”所具有的特征。k为正整数,k≤K。
第j个第三图像数据对应的第k个预测区域特征的正样本是指任意一个第四图像数据的对象区域预测结果中存在的、与该预测区域特征所表征的预测区域之间存在对应关系的预测区域特征。k为正整数,k≤K。
第j个第三图像数据对应的第k个预测区域特征的负样本是指任意一个第四图像数据的对象区域预测结果中存在的、与该预测区域特征所表征的预测区域之间不存在对应关系的预测区域特征。k为正整数,k≤K。
另外,本公开不限定上文步骤32的实施方式,例如,该步骤32具体可以包括下文步骤321-步骤322。
步骤321:如果上文对应关系表示出上文“第m个第四图像数据对应的第h个目标区域表征数据”与上文“第j个第三图像数据对应的第k个目标区域表征数据”之间存在对应关系,则将与该“第m个第四图像数据对应的第h个目标区域表征数据”具有对应关系的“第m个第四图像数据对应的第h个预测区域特征”,确定为上文“第j个第三图像数据对应的第k个预测区域特征”的正样本。其中,h为正整数,h≤H,k为正整数,k≤K。
本公开中,如果上文对应关系表示出上文“第m个第四图像数据对应的第h个目标区域表征数据”与上文“第j个第三图像数据对应的第k个目标区域表征数据”之间存在对应关系,则可以确定“第m个第四图像数据对应的第h个目标区域表征数据”与该“第j个第三图像数据对应的第k个目标区域表征数据”对应于上文待处理图像数据中同一个对象,从而可以确定与该“第m个第四图像数据对应的第h个目标区域表征数据”相对应的预测结果(例如,预测区域表征数据及其对应的预测区域特征)、以及与该“第j个第三图像数据对应的第k个目标区域表征数据”相对应的预测结果均是针对同一个对象预测所得的,进而可以确定前一个预测结果就是后一个预测结果的正样本,故可以将前一个预测结果中的预测区域特征(也就是,上文“第m个第四图像数据对应的第h个预测区域特征”),确定为后一个预测结果中的预测区域特征(也就是,上文“第j个第三图像数据对应的第k个预测区域特征”)的正样本。
步骤322:如果上文对应关系表示出上文“第m个第四图像数据对应的第h个目标区域表征数据”与上文“第j个第三图像数据对应的第k个目标区域表征数据”之间不存在对应关系,则将与该“第m个第四图像数据对应的第h个目标区域表征数据”具有对应关系的“第m个第四图像数据对应的第h个预测区域特征”,确定为上文“第j个第三图像数据对应的第k个预测区域特征”的负样本。其中,h为正整数,h≤H,k为正整数,k≤K。
本公开中,如果上文对应关系表示出上文“第m个第四图像数据对应的第h个目标区域表征数据”与上文“第j个第三图像数据对应的第k个目标区域表征数据”之间不存在对应关系,则可以确定“第m个第四图像数据对应的第h个目标区域表征数据”与该“第j个第三图像数据对应的第k个目标区域表征数据”对应于上文待处理图像数据中不同对象,从而可以确定与该“第m个第四图像数据对应的第h个目标区域表征数据”相对应的预测结果(例如,预测区域表征数据及其对应的预测区域特征)、以及与该“第j个第三图像数据对应的第k个目标区域表征数据”相对应的预测结果均是针对不同对象预测所得的,进而可以确定前一个预测结果就是后一个预测结果的负样本,故可以将前一个预测结果中的预测区域特征(也就是,上文“第m个第四图像数据对应的第h个预测区域特征”),确定为后一个预测结果中的预测区域特征(也就是,上文“第j个第三图像数据对应的第k个预测区域特征”)的负样本。
基于上文步骤321至步骤322的相关内容可知,在一种可能的实施方式下,对于任意一个第三图像数据来说,当该第三图像数据对应的至少一个预测区域特征包括待使用区域特征(例如,上文“第j个第三图像数据对应的第k个预测区域特征”),而且该待使用区域特征用于代表该第三图像数据所对应的任意一个预测区域特征时,该待使用区域特征的正负样本分别满足下文①-②所示的条件。
①上文待使用区域特征的正样本对应的目标区域表征数据与该待使用区域特征对应的目标区域表征数据之间存在对应关系。
上文“待使用区域特征的正样本对应的目标区域表征数据”是指该正样本在上文第四图像数据中所对应的对象的区域标签。例如,当该待使用区域特征为上文“第j个第三图像数据对应的第k个预测区域特征”,而且该待使用区域特征的正样本为上文“第m个第四图
像数据对应的第h个预测区域特征”时,该“待使用区域特征的正样本对应的目标区域表征数据”就是指上文“第m个第四图像数据对应的第h个目标区域表征数据”。
另外,本公开不限定上文“待使用区域特征的正样本对应的目标区域表征数据”的确定过程,例如,其具体可以为:根据该正样本对应的预测区域表征数据与该正样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸,确定该正样本对应的目标区域表征数据,以使该正样本对应的预测区域表征数据与该正样本对应的目标区域表征数据之间的重叠区域尺寸达到最大。其中,该“正样本对应的预测区域表征数据”是指该正样本在上文第四图像数据中所对应的对象的区域预测结果(例如,上文“第m个第四图像数据对应的第h个预测区域表征数据”)。
上文“待使用区域特征对应的目标区域表征数据”是指该待使用区域特征在上文第三图像数据中所对应的对象的区域标签。例如,当该待使用区域特征为上文“第j个第三图像数据对应的第k个预测区域特征”时,该“待使用区域特征对应的目标区域表征数据”就是指上文“第j个第三图像数据对应的第k个目标区域表征数据”。
需要说明的是,上文“待使用区域特征对应的目标区域表征数据”的获取过程类似于上文“待使用区域特征的正样本对应的目标区域表征数据”的获取过程,为了简要起见,在此不再赘述。
可见,在一种可能的实施方式下,上文“待使用区域特征对应的目标区域表征数据”的获取过程,具体可以为:根据该待使用区域特征对应的预测区域表征数据与该待使用区域特征所属的第三图像数据对应的各个目标区域表征数据之间的重叠区域尺寸,确定该待使用区域特征对应的目标区域表征数据,以使该待使用区域特征对应的预测区域表征数据与该待使用区域特征对应的目标区域表征数据之间的重叠区域尺寸达到最大。其中,该“待使用区域特征对应的预测区域表征数据”是指该待使用区域特征在上文第三图像数据中所对应的对象的区域预测结果。
②上文待使用区域特征的负样本对应的目标区域表征数据与该待使用区域特征对应的目标区域表征数据之间不存在对应关系。
上文“待使用区域特征的负样本对应的目标区域表征数据”是指该负样本在上文第四图像数据中所对应的对象的区域标签。例如,当该待使用区域特征为上文“第j个第三图像数据对应的第k个预测区域特征”,而且该待使用区域特征的负样本包括上文“第m个第四图像数据对应的第h个预测区域特征”时,该“待使用区域特征的负样本对应的目标区域表征数据”就是指上文“第m个第四图像数据对应的第h个目标区域表征数据”。
需要说明的是,上文“待使用区域特征的负样本对应的目标区域表征数据”的获取过程类似于上文“待使用区域特征的正样本对应的目标区域表征数据”的获取过程,为了简要起见,在此不再赘述。
可见,在一种可能的实施方式下,该“待使用区域特征的负样本对应的目标区域表征数据”的获取过程,具体可以为:根据该负样本对应的预测区域表征数据与该负样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸,确定该负样本对应的目标区域表征数据,以使该负样本对应的预测区域表征数据与该正样本对应的目标区域表
征数据之间的重叠区域尺寸达到最大。其中,该“负样本对应的预测区域表征数据”是指该负样本在上文第四图像数据中所对应的对象的区域预测结果。
基于上文步骤32的相关内容可知,在获取到上文第j个第三图像数据对应的至少一个目标区域表征数据与上文第m个第四图像数据对应的至少一个目标区域表征数据之间的对应关系之后,将这两个图像数据中对应于同一个对象的目标区域表征数据的预测结果(例如,预测区域特征),确定为正样本,并将这两个图像数据中对应于不同对象的目标区域表征数据的预测结果,确定为负样本,以便后续能够借助这些正样本以及这些负样本,确定出这两个图像数据的预测结果之间的对比损失。其中,j为正整数,j≤J,m为正整数,m≤M。
步骤33:根据上文至少一个第三图像数据对应的至少一个预测区域特征、以及该至少一个第三图像数据对应的各预测区域特征的正样本以及负样本,确定上文在线模型对应的对比损失。
需要说明的是,本公开不限定步骤33的实施方式,例如,其可以采用现有的或者未来出现的任意一种对比损失的确定方法进行实施。
基于上文步骤31至步骤33的相关内容可知,在一种可能的实施方式下,在获取到由上文在线模型输出的至少一个第三图像数据对应的对象区域预测结果(例如,图3所示的框集1和框特征集1)、以及由上文动量模型输出的至少一个第四图像数据对应的对象区域预测结果(例如,图3所示的框集2和框特征集2)之后,可以借助对比学习方式,确定该在线模型对应的对比损失,以使该对比损失能够表示出该在线模型的分类性能。
步骤23:根据上文回归损失和上文对比损失,确定上文在线模型的模型损失。
需要说明的是,本公开不限定步骤23的实施方式,例如,其可以采用现有的或者未来出现的任意一种能够将两种损失进行整合处理的方式(例如,加权求和、集合等处理方式)进行实施。
基于上文步骤21至步骤23的相关内容可知,在获取到由上文在线模型输出的至少一个第三图像数据对应的对象区域预测结果、以及由上文动量模型输出的至少一个第四图像数据对应的对象区域预测结果之后,可以先分别借助这些对象区域预测结果,确定出该在线模型的回归损失和对比损失;再基于这两种损失,确定出该在线模型的模型损失,以使该模型损失能够更好地表示出该在线模型的预测性能(例如,对象所占区域的预测性能、分类性能等)。
基于上文步骤125的相关内容可知,在一种可能的实施方式下,对于当前轮训练过程来说,在获取到由上文在线模型输出的至少一个第三图像数据对应的对象区域预测结果、以及由上文动量模型输出的至少一个第四图像数据对应的对象区域预测结果之后,可以先利用这些对象区域预测结果,确定出该在线模型的模型损失,以使该模型损失能够表示出该在线模型的预测性能;再判断该模型损失是否达到预设损失条件,若达到预设损失条件,则可以确定该在线模型具有较好的预测性能,故可以确定已达到上文预设停止条件,从而可以继续执行下文步骤127;若未达到预设损失条件,则可以确定该在线模型的预测性能不太好,故可以确定未达到上文预设停止条件,从而可以继续执行下文步骤126。其中,该预设损失条件是预先设定的,例如,其具体可以包括:模型损失低于预先设定的损失阈值。又如,其也可以包括:模型损失的变化率低于预先设定的变化率阈值。
步骤126:在确定未达到预设停止条件时,根据上文至少两个待使用图像数据对应的对象区域预测结果、以及该至少两个待使用图像数据对应的对象区域标签,更新在线模型和动量模型,并继续执行上文步骤121及其后续步骤。
需要说明是,本公开不限定上文在线模型的更新过程,例如,当上文“至少两个待使用图像数据”包括至少一个第三图像数据和至少一个第四图像数据时,该在线模型的更新过程可以包括下文步骤41-步骤43。
步骤41:根据上文至少一个第三图像数据对应的对象区域预测结果与该至少一个第三图像数据对应的对象区域标签,确定上文在线模型对应的回归损失。
需要说明的是,步骤41的相关内容请参见上文步骤21,为了简要起见,在此不再赘述。
步骤42:根据上文至少一个第三图像数据对应的对象区域预测结果与上文至少一个第四图像数据对应的对象区域预测结果,确定上文在线模型对应的对比损失。
需要说明的是,步骤42的相关内容请参见上文步骤22,为了简要起见,在此不再赘述。
步骤43:根据上文回归损失和上文对比损失,更新上文在线模型。
需要说明的是,本公开不限定步骤43的实施方式,例如,当上文在线模型包括骨干网络和第一处理网络时,该步骤43具体可以为:根据上文回归损失和上文对比损失,更新该在线模型中第一处理网络的网络参数,以实现固定骨干网络的网络参数、以及针对该在线模型中除了该骨干网络以外其他网络的网络参数进行更新的目的。
还需要说明的是,本公开不限定上段中“网络参数”的更新方式,例如,其可以采用现有的或者未来出现的任意一种能够基于模型损失进行网络参数更新处理的方法(例如,梯度更新等)进行实施。
基于上文步骤41至步骤43的相关内容可知,在一种可能的实施方式下,在确定未达到预设停止条件时,可以根据上文至少两个待使用图像数据对应的对象区域预测结果、以及该至少两个待使用图像数据对应的对象区域标签,确定上文在线模型的模型损失;再利用该模型损失,对该在线模型中除了骨干网络以外的其他所有网络的网络参数进行梯度更新,得到更新后的在线模型,以使更新后的在线模型中骨干网络的网络参数与更新前的在线模型中骨干网络的网络参数保持一致,如此能够实现针对该在线模型中除了该骨干网络以外其他网络的网络参数进行更新的目的。
另外,本公开不限定上文动量模型的更新过程,例如,其具体可以为:利用更新后的在线模型,更新动量模型。可见,在一种可能的实施方式下,对于当前轮训练过程来说,在获取到更新后的在线模型之后,可以将该更新后的在线模型的移动指数平均处理结果(例如,上文公式(1)所示的结果),确定为更新后的动量模型。
实际上,为了更好地提高模型训练效果,本公开还提供了上文步骤“利用更新后的在线模型,更新动量模型”的一种可能的实施方式,其具体可以为:根据更新后的在线模型中第一处理网络的网络参数,更新该动量模型中第一处理网络的网络参数(例如,将更新后的在线模型中第一处理网络的网络参数的移动指数平均处理结果,确定为更新后的动量模型中第一处理网络的网络参数等),如此能够实现针对该动量模型中除了该骨干网络以外其他网络的网络参数进行更新的目的。
基于上段内容以及上文公式(1)可知,在一种可能的实施方下,在获取到上文更新后的在线模型之后,可以将更新前的动量模型中第一处理网络的网络参数与该更新后的在线模型中第一处理网络的网络参数进行加权求和处理,得到更新后的动量模型中第一处理网络的网络参数。需要说明的是,该加权求和处理中所涉及的权重的相关内容请参见上文(1)中所涉及的权重的相关内容,为了简要起见,在此不再赘述。
基于上文步骤126的相关内容可知,对于当前轮训练过程来说,在确定未达到预设停止条件时,可以确定上文在线模型的预测性能仍然需要继续完善,故可以先根据上文至少两个待使用图像数据对应的对象区域预测结果、以及该至少两个待使用图像数据对应的对象区域标签,更新在线模型和动量模型,以得到更新后的在线模型和更新后的动量模型,以使这两个模型具有更好的预测性能;再利用更新后的在线模型和更新后的动量模型,返回继续执行上文步骤121及其后续步骤,以开启下一轮训练过程,如此迭代循环,直至达到预设停止条件。
步骤127:在确定达到预设停止条件时,根据上文在线模型,确定待使用模型。
本公开中,对于当前轮训练过程来说,在确定达到预设停止条件时,可以确定上文在线模型具有较好的预测性能,故可以直接根据该在线模型,确定待使用模型(例如,可以直接将最后一轮训练过程中所使用的在线模型,确定为该待使用模型),以使该待使用模型具有较好的预测性能,如此能够实现针对目标应用领域下图像处理模型的预训练目的。
基于上文步骤121至步骤127的相关内容可知,在一种可能的实施方式下,当上文目标应用领域为目标检测领域时,上文第二数据集可以包括若干多物体图像数据,而且对于任意一个多物体图像数据(例如,图3所示的图像4)来说,可以先通过选择性搜索算法确定出该多物体图像数据的目标框(例如,图3所示的框1和框2);再通过N(例如,图3所示的N=2)次不同的数据增强得到该多物体图像数据的N个增强后图像(例如,图3所示的图像5和图像6),而且该多物体图像数据的目标框的坐标也会随着该数据增强过程发生相应地改变,以得到这些增强后图像的目标框,以便后续将这些目标框被作为这些增强后图像的伪标签;其次,将这些增强后图像中一部分送入在线模型,并将另一部分送入动量模型,得到这些增强后图像的模型预测结果;然后,依据这些增强后图像的模型预测结果以及这些增强后图像的目标框,确定该在线模型的模型损失;随后,利用该模型损失针对该在线模型中除了骨干网络以外的其他网络的网络参数进行梯度更新,并利用更新后的在线模型的移动指数平均处理结果,更新该动量模型,以便后续能够基于更新后的在线模型以及动量模型继续执行下一轮训练过程。
另外,本公开可以根据这些增强后图像的模型预测结果以及这些增强后图像的目标框,确定出上文在线模型在这些增强后图像上所呈现的分类特征和回归特征,故本公开可以基于该分类特征构建自监督的分类任务,而且在该分类任务中通常可以将对应同一目标框的预测结果视为正样本,以及对应不同目标框的预测结果视为负样本,以此构建对比学习。同时,本公开也可以构建回归任务,而且该回归任务的目的是确保针对该增强后图像预测所得的预测框的坐标和该增强后图像的目标框保持一致,以达到回归目的。可见,基于这两个任务,本公开可以实现采用无监督方式预训练一个目标检测模型中除了Backbone
以外的其他网络的目的,以使当上文Backbone是采用自监督方式进行预训练时,可以实现以无监督方式对任意一个目标检测模型所有网络进行比较完整的预训练处理的目的。
基于上文S101至S103的相关内容可知,对于一些图像处理领域中所使用的机器学习模型(例如,目标检测模型、语义分割模型或者关键点检测模型等)来说,先利用第一数据集(例如,大量单对象图像数据),对待处理模型进行训练,得到第一模型,以使该第一模型中骨干网络具有较好的图像特征提取功能,以实现针对该机器学习模型中骨干网络的预训练过程;再根据该第一模型中骨干网络,构建第二模型,以使该第二模型所实现的图像处理功能与该机器学习模型所需实现的图像处理功能保持一致;然后,利用第二数据集(例如,一些多对象图像数据),对该第二模型进行训练,并保证该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中始终保持不变,以便在将训练好的第二模型确定为待使用模型时,该待使用模型中骨干网络与该第一模型中骨干网络保持一致,而且该待使用模型中第二处理网络是指针对该第二模型中第一处理网络的训练结果,如此能够实现在固定骨干网络的前提下针对该机器学习模型中其它网络进行预训练的目的,以便后续能够借助针对该待使用模型的微调处理得到一个构建好的图像处理模型(例如,目标检测模型),以使该图像处理模型具有较好的图像处理性能,如此实现了针对这些图像处理领域中机器学习模型进行构建处理的目的。
另外,对于本公开所提供的模型构建方法来说,不仅会针对上文图像处理模型(例如,目标检测模型等)中骨干网络进行预训练,还会针对该图像处理模型中除了该骨干网络以外的其他网络(例如,检测头网络)也进行预训练,以使最终预训练后的模型中所有网络均具有比较好的数据处理性能,如此能够有效地避免只针对骨干网络进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
此外,对于本公开所提供的模型构建方法来说,其不仅利用单对象图像数据参与了模型预训练,还利用多对象图像数据参与了该模型预训练,以使最终预训练后的模型针对多对象图像数据具有较好的图像处理功能,如此能够有效地避免只利用单对象图像数据进行模型预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
还有,对于本公开所提供的模型构建方法来说,其不仅聚焦于分类任务,还聚焦于回归任务,以使最终预训练后的模型具有较好的图像处理性能,如此能够有效地避免只聚焦于分类任务进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
实际上,基于上文模型构建方法的相关内容可知,上文S101至S103提供了一种预训练过程,故为了更好地提高图像处理效果,本公开还提供了另一种模型构建方法,为了便于理解下面结合附图进行说明。如图4所示,本公开实施例提供的模型构建方法的另一种可能的实施方式,在该实施方式中,该模型构建方法除了包括上文S101-S103以外,可以还包括下文S104。其中,该S104的执行时间晚于该S103的执行时间;该图4为本公开提供的另一种模型构建方法的流程图。
S104:利用预设图像数据集,对待使用模型进行微调处理,得到图像处理模型;该图像处理模型包括目标检测模型、语义分割模型或者关键点检测模型。
其中,预设图像数据集是指在针对上文目标应用领域下图像处理模型进行微调时所使用的图像数据集;而且该预设图像数据集中各个图像数据均属于多对象图像数据。
另外,本公开不限定上文预设图像数据集的实施方式,例如,当该目标应用领域为目标检测领域时,该预设图像数据集是指在针对目标检测模型进行微调处理时所使用的图像数据集(例如,多物体图像数据集)。又如,当该目标应用领域为图像分割领域时,该预设图像数据集是指在针对图像分割模型进行微调处理时所使用的图像数据集。还如,当该目标应用领域为关键点检测领域时,该预设图像数据集是指在针对关键点检测模型进行微调处理时所使用的图像数据集。
此外,本公开不限定上文S104的实施方式,例如,其可以采用现有的或者未来出现的适用于针对上文目标应用领域下图像处理模型进行微调处理的任一方法进行实施。
还有,本公开不限定上文S104中“图像处理模型”,例如,当上文目标应用领域为目标检测领域时,该图像处理模型为目标检测模型。又如,当上文目标应用领域为图像分割领域时,该图像处理模型为图像分割模型。还如,当上文目标应用领域为关键点检测领域时,该图像处理模型为关键点检测模型。
基于上文S101至S104的相关内容可知,对于本公开实施例提供的模型构建方法来说,该模型构建方法可以应用于目标检测领域、图像分割领域或者关键点检测领域等多个图像处理领域;而且该模型构建方法具体可以为:先借助本公开提供的基于两阶段的模型构建方法(例如,由图2-图3所示的两阶段的预训练过程),针对该目标检测领域下图像处理模型中所有网络进行预训练处理,得到预训练后的图像处理模型,以使该预训练后的图像处理模型中所有网络都具有比较好的数据处理性能;再针对该预训练后的图像处理模型进行微调处理,得到微调后的图像处理模型,以使该微调后的图像处理模型在该目标检测领域下具有较好的图像处理性能,从而使得该微调后的图像处理模型能够更好地完成该目标检测领域下图像处理任务(例如,目标检测任务、图像分割任务或者关键点检测任务等),如此有利于提高该目标检测领域下的图像处理效果。
另外,对于本公开提供的模型构建方法来说,该模型构建方法中所涉及的预训练过程以及微调过程均使用了多对象图像数据,以使该预训练过程以及微调过程在图像数据方面达成一致,从而能够有效地避免当预训练过程以及微调过程在图像数据方面存在差异时所造成的不良影响,进而使得基于该模型构建方法所构建的图像处理模型具有较好的图像处理性能。
此外,对于本公开提供的模型构建方法来说,该模型构建方法中所涉及的预训练过程以及微调过程均需要针对图像数据模型中所有网络进行训练处理,以使该预训练过程以及微调过程在训练对象方面达成一致,从而能够有效地避免当预训练过程以及微调过程在训练对象方面存在差异时所造成的不良影响,进而使得基于该模型构建方法所构建的图像处理模型具有较好的图像处理性能。
还有,对于本公开提供的模型构建方法来说,该模型构建方法中所涉及的预训练过程以及微调过程均同时聚焦于分类任务和回归任务,以使该预训练过程以及微调过程在学习
任务方面达成一致,从而能够有效地避免当预训练过程以及微调过程在学习任务方面存在差异时所造成的不良影响,进而使得基于该模型构建方法所构建的图像处理模型具有较好的图像处理性能。
再者,本公开不限定上文模型构建方法的执行主体,例如,本公开实施例提供的模型构建方法可以应用于终端设备或服务器等具有数据处理功能的设备。又如,本公开实施例提供的模型构建方法也可以借助终端设备与服务器之间的数据通信过程进行实现。
基于本公开实施例提供的模型构建方法,本公开实施例还提供了一种模型构建装置,下面结合图5进行解释和说明。其中,图5为本公开实施例提供的一种模型构建装置的结构示意图。需要说明的是,本公开实施例提供的模型构建装置的技术详情,请参照上文模型构建方法的相关内容。
如图5所示,本公开实施例提供的模型构建装置500,包括:
第一训练单元501,用于利用第一数据集,对待处理模型进行训练,得到第一模型;所述第一数据集包括至少一个第一图像数据;所述第一模型包括骨干网络;
模型构建单元502,用于根据所述第一模型中骨干网络,构建第二模型;所述第二模型包括所述骨干网络和第一处理网络,所述第一处理网络是指所述第二模型中除了所述骨干网络以外的其他全部或者部分网络;
第二训练单元503,用于利用第二数据集,对所述第二模型进行训练,得到待使用模型;所述待使用模型包括所述骨干网络和第二处理网络,所述第二模型中骨干网络的网络参数在针对所述第二模型的训练过程中保持不变,所述第二处理网络是指针对所述第二模型中第一处理网络的训练结果;所述第二数据集包括至少一个第二图像数据。
在一种可能的实施方式下,所述第一处理网络用于针对所述骨干网络的输出数据进行处理,以得到所述第二模型的输出结果。
在一种可能的实施方式下,所述第一图像数据属于单对象图像数据;
和/或,
所述第二图像数据中存在至少两个对象。
在一种可能的实施方式下,所述模型构建装置500还包括:
初始化单元,用于利用所述第二模型,初始化在线模型和动量模型;
所述第二训练单元503,具体用于:依据所述第二数据集、所述在线模型和所述动量模型,确定所述待使用模型。
在一种可能的实施方式下,所述第二训练单元503,包括:
图像选择子单元,用于从所述至少一个第二图像数据中选择待处理图像数据;
第一获取子单元,用于获取至少两个待使用图像数据和所述至少两个待使用图像数据对应的对象区域标签;所述待使用图像数据是依据所述待处理图像数据所确定的;所述待使用图像数据对应的对象区域标签是依据所述待处理图像数据对应的对象区域标签所确定的;
第一确定子单元,用于利用所述在线模型和所述动量模型,确定所述至少两个待使用图像数据对应的对象区域预测结果;
第一更新子单元,用于根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,并返回所述图像选择子单元继续执行所述从所述至少一个第二图像数据中选择待处理图像数据的步骤;
第二确定子单元,用于在达到预设停止条件时,根据所述在线模型,确定所述待使用模型。
在一种可能的实施方式下,所述至少两个待使用图像数据包括至少一个第三图像数据和至少一个第四图像数据;
所述第三图像数据对应的对象区域预测结果是利用所述在线模型确定的;
所述第四图像数据对应的对象区域预测结果是利用所述动量模型确定的。
在一种可能的实施方式下,所述第一更新子单元,包括:
第三确定子单元,用于根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第三图像数据对应的对象区域标签,确定所述在线模型对应的回归损失;
第四确定子单元,用于根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第四图像数据对应的对象区域预测结果,确定所述在线模型对应的对比损失;
第二更新子单元,用于根据所述回归损失和所述对比损失,更新所述在线模型;
第三更新子单元,用于根据更新后的在线模型,更新所述动量模型。
在一种可能的实施方式下,所述第二更新子单元,具体用于:根据所述回归损失和所述对比损失,更新所述在线模型中第一处理网络的网络参数;
和/或,
所述第三更新子单元,具体用于:根据更新后的在线模型中第一处理网络的网络参数,更新所述动量模型中第一处理网络的网络参数。
在一种可能的实施方式下,所述第三更新子单元,具体用于:将更新前的动量模型中第一处理网络的网络参数与更新后的在线模型中第一处理网络的网络参数进行加权求和处理,得到更新后的动量模型中第一处理网络的网络参数。
在一种可能的实施方式下,所述对象区域标签包括至少一个目标区域表征数据;所述对象区域预测结果包括至少一个预测区域特征;
所述第一更新子单元还包括:
第五确定子单元,用于依据所述第三图像数据对应的至少一个目标区域表征数据与所述第四图像数据对应的至少一个目标区域表征数据之间的对应关系,从所述至少一个第四图像数据对应的至少一个预测区域特征中,确定所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本;
所述第四确定子单元,具体用于:根据所述至少一个第三图像数据对应的至少一个预测区域特征、以及所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本,确定所述在线模型对应的对比损失。
在一种可能的实施方式下,所述对象区域预测结果还包括各所述预测区域特征对应的预测区域表征数据;
所述第三图像数据对应的至少一个预测区域特征包括待使用区域特征;
所述待使用区域特征的正样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间存在对应关系,所述待使用区域特征的负样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间不存在对应关系;
所述正样本对应的目标区域表征数据是根据所述正样本对应的预测区域表征数据与所述正样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的;
所述待使用区域特征对应的目标区域表征数据是根据所述待使用区域特征对应的预测区域表征数据与所述待使用区域特征所属的第三图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的;
所述负样本对应的目标区域表征数据是根据所述负样本对应的预测区域表征数据与所述负样本所属的第四图像数据对应的各个目标区域表征数据之间的重叠区域尺寸确定的。
在一种可能的实施方式下,所述待处理图像数据对应的对象区域标签的获取过程,包括:利用选择性搜索算法,对所述待处理图像数据进行对象区域搜索处理,得到所述待处理图像数据对应的对象区域标签;
或者,
所述待处理图像数据对应的对象区域标签的获取过程,包括:从预先构建的映射关系中查找所述待处理图像数据对应的对象区域标签;所述映射关系包括各所述第二图像数据与各所述第二图像数据对应的对象区域标签之间的对应关系;所述第二图像数据对应的对象区域标签是利用选择性搜索算法针对所述第二图像数据进行对象区域搜索处理所确定的。
在一种可能的实施方式下,所述第二模型的输出结果为目标检测结果、语义分割结果或者关键点检测结果。
在一种可能的实施方式下,所述第一训练单元501,具体用于:利用第一数据集,对待处理模型进行全监督训练,得到第一模型;
或者
利用第一数据集,对待处理模型进行自监督训练,得到第一模型。
在一种可能的实施方式下,如图6所示,所述模型构建装置500,还包括:
微调单元504,用于利用预设图像数据集,对所述待使用模型进行微调处理,得到图像处理模型;所述图像处理模型包括目标检测模型、语义分割模型或者关键点检测模型。
基于上述模型构建装置500的相关内容可知,对于该模型构建装置500来说,先利用第一数据集(例如,大量单对象图像数据),对待处理模型进行训练,得到第一模型,以使该第一模型中骨干网络具有较好的图像特征提取功能,以实现针对某个图像处理领域下机器学习模型中骨干网络的预训练过程;再根据该第一模型中骨干网络,构建第二模型,以使该第二模型所实现的图像处理功能与该机器学习模型所需实现的图像处理功能保持一致;然后,利用第二数据集(例如,一些多对象图像数据),对该第二模型进行训练,并保证该第二模型中骨干网络的网络参数在针对该第二模型的训练过程中始终保持不变,以便在将训练好的第二模型确定为待使用模型时,该待使用模型中骨干网络与该第一模型中骨干网
络保持一致,而且该待使用模型中第二处理网络是指针对该第二模型中第一处理网络的训练结果,如此能够实现在固定骨干网络的前提下针对该机器学习模型中其它网络进行预训练的目的,以便后续能够借助针对该待使用模型的微调处理得到一个构建好的图像处理模型(例如,目标检测模型),以使该图像处理模型具有较好的图像处理性能,如此实现了针对这些图像处理领域中机器学习模型进行构建处理的目的。
另外,对于本公开所提供的模型构建方法来说,不仅会针对上文图像处理模型(例如,目标检测模型等)中骨干网络进行预训练,还会针对该图像处理模型中除了该骨干网络以外的其他网络(例如,检测头网络)也进行预训练,以使最终预训练后的模型中所有网络均具有比较好的数据处理性能,如此能够有效地避免只针对骨干网络进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
此外,对于本公开所提供的模型构建方法来说,其不仅利用单对象图像数据参与了模型预训练,还利用多对象图像数据参与了该模型预训练,以使最终预训练后的模型针对多对象图像数据具有较好的图像处理功能,如此能够有效地避免只利用单对象图像数据进行模型预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
还有,对于本公开所提供的模型构建方法来说,其不仅聚焦于分类任务,还聚焦于回归任务,以使最终预训练后的模型具有较好的图像处理性能,如此能够有效地避免只聚焦于分类任务进行预训练处理所导致的不良影响,从而能够有效地提高最终构建好的图像处理模型的图像处理效果(例如,目标检测效果)。
再者,本公开实施例还提供了一种电子设备,所述设备包括处理器以及存储器:所述存储器,用于存储指令或计算机程序;所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行本公开实施例提供的模型构建方法的任一实施方式。
参见图7,其示出了适于用来实现本公开实施例的电子设备700的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图7示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,其可以根据存储在只读存储器(ROM)702中的程序或者从存储装置708加载到随机访问存储器(RAM)703中的程序而执行各种适当的动作和处理。在RAM703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7
示出了具有各种装置的电子设备700,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置708被安装,或者从ROM702被安装。在该计算机程序被处理装置701执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例提供的电子设备与上述实施例提供的方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。
本公开实施例还提供了一种计算机可读介质,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行本公开实施例提供的模型构建方法的任一实施方式。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(Hyper Text Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备可以执行上述方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元/模块的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
应当理解,在本公开中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数
或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本公开。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
Claims (18)
- 一种模型构建方法,其中,所述方法包括:利用第一数据集,对待处理模型进行训练,得到第一模型;所述第一数据集包括至少一个第一图像数据;所述第一模型包括骨干网络;根据所述第一模型中骨干网络,构建第二模型;所述第二模型包括所述骨干网络和第一处理网络,所述第一处理网络是指所述第二模型中除了所述骨干网络以外的其他全部或者部分网络;利用第二数据集,对所述第二模型进行训练,得到待使用模型;所述待使用模型包括所述骨干网络和第二处理网络,所述第二模型中骨干网络的网络参数在针对所述第二模型的训练过程中保持不变,所述第二处理网络是指针对所述第二模型中第一处理网络的训练结果;所述第二数据集包括至少一个第二图像数据。
- 根据权利要求1所述的方法,其中,所述第一处理网络用于针对所述骨干网络的输出数据进行处理,以得到所述第二模型的输出结果。
- 根据权利要求1所述的方法,其中,所述第一图像数据属于单对象图像数据;和/或,所述第二图像数据中存在至少两个对象。
- 根据权利要求1所述的方法,其中,所述方法还包括:利用所述第二模型,初始化在线模型和动量模型;所述利用第二数据集,对所述第二模型进行训练,得到待使用模型,包括:依据所述第二数据集、所述在线模型和所述动量模型,确定所述待使用模型。
- 根据权利要求4所述的方法,其中,所述待使用模型的确定过程,包括:从所述至少一个第二图像数据中选择待处理图像数据;获取至少两个待使用图像数据和所述至少两个待使用图像数据对应的对象区域标签;所述待使用图像数据是依据所述待处理图像数据所确定的;所述待使用图像数据对应的对象区域标签是依据所述待处理图像数据对应的对象区域标签所确定的;利用所述在线模型和所述动量模型,确定所述至少两个待使用图像数据对应的对象区域预测结果;根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,并继续执行所述从所述至少一个第二图像数据中选择待处理图像数据的步骤,直至在达到预设停止条件时,根据所述在线模型,确定所述待使用模型。
- 根据权利要求5所述的方法,其中,所述至少两个待使用图像数据包括至少一个第三图像数据和至少一个第四图像数据;所述第三图像数据对应的对象区域预测结果是利用所述在线模型确定的;所述第四图像数据对应的对象区域预测结果是利用所述动量模型确定的。
- 根据权利要求6所述的方法,其中,所述根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在 线模型和所述动量模型,包括:根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第三图像数据对应的对象区域标签,确定所述在线模型对应的回归损失;根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第四图像数据对应的对象区域预测结果,确定所述在线模型对应的对比损失;根据所述回归损失和所述对比损失,更新所述在线模型;根据更新后的在线模型,更新所述动量模型。
- 根据权利要求5所述的方法,其中,所述根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,更新所述在线模型和所述动量模型,包括:根据所述至少两个待使用图像数据对应的对象区域预测结果、以及所述至少两个待使用图像数据对应的对象区域标签,确定所述在线模型的模型损失;根据所述模型损失,更新所述在线模型中第一处理网络的网络参数;根据更新后的在线模型中第一处理网络的网络参数,更新所述动量模型中第一处理网络的网络参数。
- 根据权利要求8所述的方法,其中,所述根据更新后的在线模型中第一处理网络的网络参数,更新所述动量模型中第一处理网络的网络参数,包括:将更新前的动量模型中第一处理网络的网络参数与更新后的在线模型中第一处理网络的网络参数进行加权求和处理,得到更新后的动量模型中第一处理网络的网络参数。
- 根据权利要求7所述的方法,其中,所述对象区域标签包括至少一个目标区域表征数据;所述对象区域预测结果包括至少一个预测区域特征;所述方法还包括:依据所述第三图像数据对应的至少一个目标区域表征数据与所述第四图像数据对应的至少一个目标区域表征数据之间的对应关系,从所述至少一个第四图像数据对应的至少一个预测区域特征中,确定所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本;所述根据所述至少一个第三图像数据对应的对象区域预测结果与所述至少一个第四图像数据对应的对象区域预测结果,确定所述在线模型对应的对比损失,包括:根据所述至少一个第三图像数据对应的至少一个预测区域特征、以及所述至少一个第三图像数据对应的各预测区域特征的正样本以及负样本,确定所述在线模型对应的对比损失。
- 根据权利要求10所述的方法,其中,所述第三图像数据对应的至少一个预测区域特征包括待使用区域特征;所述待使用区域特征的正样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间存在对应关系;所述待使用区域特征的负样本对应的目标区域表征数据与所述待使用区域特征对应的目标区域表征数据之间不存在对应关系。
- 根据权利要求5所述的方法,其中,所述待处理图像数据对应的对象区域标签的 获取过程,包括:利用选择性搜索算法,对所述待处理图像数据进行对象区域搜索处理,得到所述待处理图像数据对应的对象区域标签;或者,所述待处理图像数据对应的对象区域标签的获取过程,包括:从预先构建的映射关系中查找所述待处理图像数据对应的对象区域标签;所述映射关系包括各所述第二图像数据与各所述第二图像数据对应的对象区域标签之间的对应关系;所述第二图像数据对应的对象区域标签是利用选择性搜索算法针对所述第二图像数据进行对象区域搜索处理所确定的。
- 根据权利要求2所述的方法,其中,所述第二模型的输出结果为目标检测结果、语义分割结果或者关键点检测结果。
- 根据权利要求1所述的方法,其中,所述利用第一数据集,对待处理模型进行训练,得到第一模型,包括:利用第一数据集,对待处理模型进行全监督训练,得到第一模型;或者利用第一数据集,对待处理模型进行自监督训练,得到第一模型。
- 根据权利要求1-14任一项所述的方法,其中,所述方法还包括:利用预设图像数据集,对所述待使用模型进行微调处理,得到图像处理模型;所述图像处理模型包括目标检测模型、语义分割模型或者关键点检测模型。
- 一种模型构建装置,其中,所述装置包括:第一训练单元,用于利用第一数据集,对待处理模型进行训练,得到第一模型;所述第一数据集包括至少一个第一图像数据;所述第一模型包括骨干网络;模型构建单元,用于根据所述第一模型中骨干网络,构建第二模型;所述第二模型包括所述骨干网络和第一处理网络,所述第一处理网络是指所述第二模型中除了所述骨干网络以外的其他全部或者部分网络;第二训练单元,用于利用第二数据集,对所述第二模型进行训练,得到待使用模型;所述待使用模型包括所述骨干网络和第二处理网络,所述第二模型中骨干网络的网络参数在针对所述第二模型的训练过程中保持不变,所述第二处理网络是指针对所述第二模型中第一处理网络的训练结果;所述第二数据集包括至少一个第二图像数据。
- 一种电子设备,其中,所述设备包括:处理器和存储器;所述存储器,用于存储指令或计算机程序;所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行权利要求1-15任一项所述的方法。
- 一种计算机可读介质,其中,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行权利要求1-15任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211634668.2 | 2022-12-19 | ||
CN202211634668.2A CN118230015A (zh) | 2022-12-19 | 2022-12-19 | 一种模型构建方法、装置、电子设备、计算机可读介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024131408A1 true WO2024131408A1 (zh) | 2024-06-27 |
Family
ID=91508971
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/132631 WO2024131408A1 (zh) | 2022-12-19 | 2023-11-20 | 一种模型构建方法、装置、电子设备、计算机可读介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118230015A (zh) |
WO (1) | WO2024131408A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112561076A (zh) * | 2020-12-10 | 2021-03-26 | 支付宝(杭州)信息技术有限公司 | 模型处理方法和装置 |
KR20210068707A (ko) * | 2019-12-02 | 2021-06-10 | 주식회사 수아랩 | 신경망을 학습시키는 방법 |
CN113780461A (zh) * | 2021-09-23 | 2021-12-10 | 中国人民解放军国防科技大学 | 基于特征匹配的鲁棒神经网络训练方法 |
CN113962951A (zh) * | 2021-10-15 | 2022-01-21 | 杭州研极微电子有限公司 | 检测分割模型的训练方法及装置、目标检测方法及装置 |
CN114549904A (zh) * | 2022-02-25 | 2022-05-27 | 北京百度网讯科技有限公司 | 视觉处理及模型训练方法、设备、存储介质及程序产品 |
-
2022
- 2022-12-19 CN CN202211634668.2A patent/CN118230015A/zh active Pending
-
2023
- 2023-11-20 WO PCT/CN2023/132631 patent/WO2024131408A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210068707A (ko) * | 2019-12-02 | 2021-06-10 | 주식회사 수아랩 | 신경망을 학습시키는 방법 |
CN112561076A (zh) * | 2020-12-10 | 2021-03-26 | 支付宝(杭州)信息技术有限公司 | 模型处理方法和装置 |
CN113780461A (zh) * | 2021-09-23 | 2021-12-10 | 中国人民解放军国防科技大学 | 基于特征匹配的鲁棒神经网络训练方法 |
CN113962951A (zh) * | 2021-10-15 | 2022-01-21 | 杭州研极微电子有限公司 | 检测分割模型的训练方法及装置、目标检测方法及装置 |
CN114549904A (zh) * | 2022-02-25 | 2022-05-27 | 北京百度网讯科技有限公司 | 视觉处理及模型训练方法、设备、存储介质及程序产品 |
Also Published As
Publication number | Publication date |
---|---|
CN118230015A (zh) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7331171B2 (ja) | 画像認識モデルをトレーニングするための方法および装置、画像を認識するための方法および装置、電子機器、記憶媒体、並びにコンピュータプログラム | |
KR20200109230A (ko) | 뉴럴 네트워크 생성 방법 및 장치 | |
US20240127795A1 (en) | Model training method, speech recognition method, device, medium, and apparatus | |
CN110826567B (zh) | 光学字符识别方法、装置、设备及存储介质 | |
WO2022247562A1 (zh) | 多模态数据检索方法、装置、介质及电子设备 | |
WO2023143016A1 (zh) | 特征提取模型的生成方法、图像特征提取方法和装置 | |
CN113140012B (zh) | 图像处理方法、装置、介质及电子设备 | |
CN112364829B (zh) | 一种人脸识别方法、装置、设备及存储介质 | |
CN112883968A (zh) | 图像字符识别方法、装置、介质及电子设备 | |
CN113033682B (zh) | 视频分类方法、装置、可读介质、电子设备 | |
US20240233358A9 (en) | Image classification method, model training method, device, storage medium, and computer program | |
WO2023202543A1 (zh) | 文字处理方法、装置、电子设备及存储介质 | |
CN115578570A (zh) | 图像处理方法、装置、可读介质及电子设备 | |
WO2024199349A1 (zh) | 对象推荐方法、装置、介质及电子设备 | |
CN111444335B (zh) | 中心词的提取方法及装置 | |
WO2023174075A1 (zh) | 内容检测模型的训练方法、内容检测方法及装置 | |
CN111275089B (zh) | 一种分类模型训练方法及装置、存储介质 | |
WO2024060587A1 (zh) | 自监督学习模型的生成方法和转化率预估模型的生成方法 | |
CN117150122A (zh) | 终端推荐模型的联邦训练方法、装置和存储介质 | |
WO2024131408A1 (zh) | 一种模型构建方法、装置、电子设备、计算机可读介质 | |
CN116363431A (zh) | 物品分类方法、装置、电子设备和计算机可读介质 | |
CN110414527A (zh) | 字符识别方法、装置、存储介质及电子设备 | |
CN116244431A (zh) | 文本分类方法、装置、介质及电子设备 | |
CN113051400B (zh) | 标注数据确定方法、装置、可读介质及电子设备 | |
CN111898658B (zh) | 图像分类方法、装置和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23905577 Country of ref document: EP Kind code of ref document: A1 |