WO2024055530A1 - 一种图像的目标检测方法、系统、设备及存储介质 - Google Patents

一种图像的目标检测方法、系统、设备及存储介质 Download PDF

Info

Publication number
WO2024055530A1
WO2024055530A1 PCT/CN2023/078490 CN2023078490W WO2024055530A1 WO 2024055530 A1 WO2024055530 A1 WO 2024055530A1 CN 2023078490 W CN2023078490 W CN 2023078490W WO 2024055530 A1 WO2024055530 A1 WO 2024055530A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target detection
training
model
images
Prior art date
Application number
PCT/CN2023/078490
Other languages
English (en)
French (fr)
Inventor
赵冰
李军
朱红
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024055530A1 publication Critical patent/WO2024055530A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of machine learning technology, and in particular to an image target detection method, system, equipment and storage medium.
  • self-supervised learning is a direction that attracts a lot of attention. Different from traditional supervised learning methods that require manual labeling of data, self-supervised learning hopes to automatically generate labels for unlabeled data by designing agent tasks, thereby completing the learning of the data.
  • Agent tasks in self-supervised learning are mainly divided into two categories: image transformation and contrastive learning.
  • Image transformation-based agent tasks include image denoising, restoration, color conversion, etc. Based on these tasks related to image transformation, supervisory information is constructed to guide model learning.
  • the contrast learning type refers to the comparison task, which refers to data enhancement strategies such as cropping and color adjustment of samples.
  • Two data enhancement samples generated from the same picture are regarded as positive samples, and the enhanced samples generated from different samples are regarded as positive samples. regarded as negative samples.
  • Features are extracted from the enhanced samples through the autoencoder, and the feature vector is further reduced in dimension to obtain a low-dimensional vector.
  • the loss function is used to bring the similarity between positive samples closer and the similarity between negative samples farther away.
  • Figure 1 is a schematic diagram of the principle of contrastive learning.
  • the core of contrastive learning is to better learn the representation of images by learning the similarities between different images.
  • the model can learn the similarity difference between positive and negative samples, it means that the model has extracted better features.
  • the best performing methods in the field of self-supervision are all based on contrastive tasks.
  • Upstream pre-training + downstream parameter adjustment is a classic paradigm of machine learning.
  • this paradigm refers to the pre-training of labeled image classification on large-scale image classification data sets.
  • downstream tasks such as image target detection, semantic segmentation, etc.
  • the trained model will Freeze parameters and use a small amount of labeled data for parameter adjustment training on downstream tasks.
  • Self-supervised learning also follows this paradigm, with the difference being that self-supervised learning does not rely on data labels in upstream pre-training.
  • the current comparative self-supervised learning method completes the comparative learning pre-training of the image classification data set, and uses the trained CNN (Convolutional Neural Network, convolutional neural network) weights to perform target detection tasks in the downstream transmission line scenario. , use the above weights as the feature extraction network, and separately train a cascade r-cnn target detection network.
  • CNN Convolutional Neural Network, convolutional neural network
  • This method represents the current mainstream method for applying contrastive learning to downstream tasks such as target detection tasks. Its upstream pre-training and downstream parameter adjustment are completely separated. During upstream pre-training, the proxy task used is to distinguish image similarity. This task is highly related to the image classification task, but less related to the downstream target detection task. Only feature extraction is completed during the pre-training process. For the training of the network, the remaining components in the target detection network still need to be trained from scratch in the target detection task. Training results in such methods having low performance in target detection tasks and insufficient detection accuracy.
  • the purpose of this application is to provide an image target detection method, system, equipment and storage medium to effectively perform image target detection and improve detection accuracy.
  • An image target detection method including:
  • n is a positive integer not less than 2, and any 1 background image All come from target detection data sets;
  • the images in the target detection data set are sequentially input into the target detection model as training images for training, and the trained target detection model is obtained;
  • the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the feature level, and it is the same algorithm as the feature image representation algorithm used in the target detection model; the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the vector level.
  • the characteristic vector representation algorithm is the same algorithm as the characteristic vector representation algorithm used in the target detection model.
  • the search box is determined from the pre-training image, including:
  • Multiple rectangular boxes are automatically generated on the pre-training image, and one is randomly selected from each rectangular box as the determined search box.
  • multiple rectangular boxes are automatically generated on the pre-training images, including:
  • one randomly selected one from each rectangular box is used as the determined search box, including:
  • the image in the search box is cropped and pasted onto n different background images according to preset rules, including:
  • the cropped images are randomly adjusted n times, including:
  • the cropped image is randomly adjusted n times, and when the cropped image is adjusted any time, the image size is adjusted by adjusting the length and/or width.
  • moving the frame of the pasted image after pasting includes:
  • the border of the pasted image is moved by perturbing the border position, and the area intersection-to-union ratio of the border after the movement and the border before the movement is greater than the preset area-to-union ratio threshold.
  • the feature image representation algorithm used by the target detection model and the contrastive learning model is both the ROI Align algorithm, where the contrastive learning model uses the ROI Align algorithm to characterize the target in the input image at a feature level;
  • the feature vector representation algorithm used by the target detection model and the contrastive learning model is both the R-CNN head algorithm.
  • the contrastive learning model uses the R-CNN head algorithm to represent the target in the input image at the vector level.
  • both the target detection model and the comparative learning model adopt convolutional neural networks with the same structure.
  • both the target detection model and the contrastive learning model adopt a convolutional neural network with multi-layer outputs
  • the contrastive loss function of the contrastive learning model is a contrastive loss function calculated based on the multi-layer outputs of the convolutional neural network.
  • both the target detection model and the comparative learning model adopt a convolutional neural network with an FPN structure.
  • the contrastive learning model after training the contrastive learning model through contrastive learning, it also includes:
  • the images in the target detection data set are sequentially input into the semantic segmentation model as training images for training, and the trained semantic segmentation model is obtained;
  • An image target detection system including:
  • the pre-training data set determination module is used to determine the pre-training data set and use the images in the pre-training data set as pre-training images in turn;
  • the search box selection module is used to determine the search box from the pre-training image after selecting any one pre-training image
  • the cut-and-paste perturbation module is used to cut the image in the search box and paste it onto n different background images according to preset rules, and move the border of the pasted image after pasting; where n is a positive value not less than 2. Integer, any background image comes from the target detection data set;
  • the contrastive learning model training module is used to input each image with the frame moved into the contrastive learning model, and train the contrastive learning model through contrastive learning;
  • the target detection model training module is used to input the images in the target detection data set as training images into the target detection model for training, and obtain the trained target detection model;
  • the target detection result determination module is used to input the image to be tested into the trained target detection model, and obtain the target detection result output by the target detection model for the image to be tested;
  • the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the feature level, and it is the same algorithm as the feature image representation algorithm used in the target detection model; the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the vector level.
  • the eigenvector representation algorithm is used to represent the object, and it is the same algorithm as the eigenvector representation algorithm used in the target detection model.
  • An image target detection device including:
  • Memory used to store computer programs
  • a processor configured to execute a computer program to implement the steps of the above image target detection method.
  • a non-volatile computer-readable storage medium A computer program is stored on the non-volatile computer-readable storage medium. When the computer program is executed by a processor, the steps of the above image target detection method are implemented.
  • the contrastive learning model is set up with a feature image representation algorithm for characterizing the target at the feature level, and it is the same algorithm as the feature image representation algorithm used by the target detection model.
  • contrastive learning The model is set up with a feature vector representation algorithm used to represent targets at the vector level, and it is the same algorithm as the feature vector representation algorithm used in the target detection model. That is to say, compare the feature image representation algorithm and features set in the learning model.
  • the vector representation algorithm will be reused in the target detection model, thereby effectively improving the parameter adjustment training performance of the target detection model.
  • this application considers that in the pre-training stage, the ability of position modeling required by the target detection model can be improved.
  • Background invariance refers to the ability of the target to change in different locations.
  • the model can identify the target relatively accurately.
  • the model has background invariance, it means that the model has learned the concept of "target" and has the ability to locate the target.
  • the search box after selecting any pre-training image, the search box will be determined from the pre-training image, and then the image in the search box will be cropped and pasted to n different background images according to the preset rules. , and move the border of the pasted image after pasting. Any background image comes from the target detection data set, so the moved border can include both the cropped target in the pre-training image and the background image in the target detection data set.
  • the contrastive learning model After training the contrastive learning model based on this, it can be made
  • the target detection model that reuses the feature image representation algorithm and the feature vector representation algorithm with the contrastive learning model can learn the position modeling ability of the target on different backgrounds, which is beneficial to the target detection model to identify the target more accurately. Improved the background invariance capability of the target detection model.
  • the solution of the present application can effectively perform target detection in images and improve the detection performance of the target detection model, that is, the detection accuracy of the target detection model is improved.
  • Figure 1 is a schematic diagram of the principle of contrastive learning
  • Figure 2 is an implementation flow chart of an image target detection method in this application.
  • Figure 3 is a functional block diagram of an image target detection method in some embodiments of the present application.
  • Figure 4 is a schematic structural diagram of an image target detection system in this application.
  • Figure 5 is a schematic structural diagram of an image target detection device in this application.
  • the core of this application is to provide an image target detection method that can effectively perform image target detection and improve the detection performance of the target detection model, that is, the detection accuracy of the target detection model is improved.
  • Figure 2 is an implementation flow chart of an image target detection method in this application.
  • the image target detection method may include the following steps:
  • Step S201 Determine a pre-training data set, and use the images in the pre-training data set as pre-training images in sequence.
  • the pre-training data set may include a large number of images used for pre-training. In practical applications, it may usually include tens of millions or even more images. Since the solution of this application is based on contrastive self-supervised learning to achieve pre-training, there is no need to set labels for these images. Moreover, since the pre-training data set contains a large number of images, training can usually be performed in batches. For example, in one case, every 50 images are used as a training batch.
  • Step S202 After selecting any one pre-training image, determine the search box from the pre-training image.
  • the images in the pre-training data set can be used as pre-training images in sequence.
  • the search box can be determined from the pre-training image.
  • Figure 3 is a schematic block diagram of an image target detection method in some embodiments.
  • the giant panda image in Figure 3 is a pre-training image selected from the pre-training data set, which is used to train the comparative learning model.
  • the search box is determined from pre-training images. Considering that the image in the pre-training data set is usually a single target, and the target may be located at any position in the image, it can be randomly selected from the pre-training image.
  • the search box is identified in the image.
  • the shape of the search box is usually set to a rectangle, so that the range of the search box can be determined by the coordinates of two points in the image.
  • determining the search box from the pre-training image described in step S202 may specifically include: automatically generating multiple rectangular boxes on the pre-training image, and randomly selecting 1 from each rectangular box. as the determined search box.
  • the specific method can also be set and selected according to actual needs.
  • rectangular frames can be automatically generated at multiple specified positions, thereby obtaining multiple automatically generated rectangular frames.
  • the target may be located at any position in the image, after generating multiple rectangular frames, one is randomly selected as the determined search frame. Therefore, the above-mentioned automatic generation of multiple rectangular frames on the pre-training image can be Specifically, it includes: automatically generating multiple rectangular boxes on pre-training images through a random search algorithm. It is relatively simple and convenient to automatically generate multiple rectangular boxes through a random search algorithm.
  • the search box may or may not contain the target.
  • the pre-training image after automatically generating multiple rectangular frames on the pre-training image, it may also include:
  • the above description of randomly selecting one from each rectangular box as the determined search box can specifically include:
  • each rectangle whose aspect ratio exceeds the preset range will be Filter by frame. For example, in one situation, when the aspect ratio of the rectangular frame is >3 or ⁇ 1/3, it will be filtered.
  • the search box one of the remaining rectangular boxes after filtering is randomly selected as the determined search box.
  • Step S203 Crop the image in the search box and paste it onto n different background images according to preset rules, and move the border of the pasted image after pasting; wherein n is a positive integer not less than 2, and any one of the background images comes from the target detection dataset.
  • the image in the search box can be cropped and pasted to n different background images according to the preset rules. For example, a simple way is to directly paste the The image is pasted onto n different background images.
  • the cropped image in the search box in order to improve the recognition ability of the model, that is, to improve the training effect, can be adjusted and then pasted onto n different background images respectively. That is, in some embodiments of the present application, cropping the image in the search box and pasting it onto n different background images according to preset rules as described in step S203 may specifically include:
  • Step 1 Crop the image in the search box, and randomly adjust the cropped images n times to obtain n adjusted images;
  • Step 2 Paste the n adjusted images onto n different background images.
  • the cropped images will be randomly adjusted n times, thereby obtaining n adjusted images.
  • there may be multiple adjustment methods such as image rotation, resolution adjustment, length adjustment, width adjustment, etc.
  • n random adjustments which may specifically include:
  • the cropped image is randomly adjusted n times, and when the cropped image is adjusted any time, the image size is adjusted by adjusting the length and/or width. Of course, this will also change the image. resolution.
  • w represents the length in the new resolution
  • h represents the width in the new resolution
  • w 1 represents the length in the original resolution
  • h 1 represents the width in the original resolution
  • ⁇ 1 ⁇ 2 is the variation coefficient set separately for the length and width
  • ⁇ 3 is the overall variation coefficient.
  • the cropped images need to be randomly adjusted n times respectively.
  • each of the n adjustments is performed.
  • the ⁇ 1 , ⁇ 2 and ⁇ 3 used can be randomly selected.
  • the allowable value ranges of ⁇ 1 , ⁇ 2 and ⁇ 3 can be set.
  • the length and width of the image in the search box are reduced, and the cropped image in the search box is pasted in the stadium.
  • the length of the image in the search box has been increased, while the width has been reduced.
  • the border of the pasted image needs to be moved.
  • the border of the pasted image will be consistent with the size of the search box. If in the above embodiment, for example, the length and/or width are adjusted, the size of the frame of the pasted image and the size of the search box will be inconsistent.
  • the movement method can be selected as needed, for example, it can be moved randomly.
  • any background image is derived from the target detection data set, that is, this application introduces the target detection data set as the background during the pre-training process, with the purpose of achieving comparative learning.
  • the relevant components of the target detection model can learn the ability to model the position of the target on different backgrounds, that is, specifically the ability to achieve background invariance. Therefore, if the border of the pasted image is not moved, the training effect will be poor.
  • the moved frame can include part of the original pasted image and part of the background image information.
  • moving the frame of the pasted image after pasting described in step S203 may specifically include:
  • the border of the pasted image is moved by perturbing the border position, and the area intersection-to-union ratio of the border after the movement and the border before the movement is greater than the preset area-to-union ratio threshold.
  • the position of the frame of the pasted image is perturbed to realize movement of the frame. Furthermore, it is required that the area intersection ratio of the border after movement and the border before movement is greater than the area intersection ratio threshold, for example, the area intersection ratio threshold is set to 0.6.
  • IoU Intersection over Union
  • area intersection ratio which reflects the overlap of the areas of two rectangular boxes, that is, the ratio of the intersection and union of the two.
  • the IoU is a maximum value of 1.
  • Step S204 Input each image with the frame moved into the contrast learning model, and train the contrast learning model through contrast learning.
  • This application considers that in the pre-training stage of contrastive self-supervised learning, more alignment with the downstream target detection task can be achieved, thereby improving the downstream target detection performance. That is to say, more components of the target detection model can be introduced in the pre-training stage, so that after the pre-training is completed, these components can be reused in the parameter adjustment training of the target detection model, and can provide information for the parameter adjustment training of the target detection model. More appropriate initial weights will help improve the parameter adjustment training performance of the target detection model.
  • the contrastive learning model sets a feature image representation algorithm for representing the target at the feature level, and the feature image representation algorithm used in the target detection model is Same algorithm.
  • the contrastive learning model sets a feature vector representation algorithm for representing targets at the vector level, and it is the same algorithm as the feature vector representation algorithm used in the target detection model.
  • the feature image representation algorithm and feature vector representation algorithm set in the contrastive learning model will be reused in the target detection model, thereby effectively improving the parameter adjustment training performance of the target detection model.
  • Comparative learning models usually use the structure of query network and answer network, that is, the structure of query network and key network.
  • n is selected to be a larger value, the number of key networks will increase accordingly.
  • convolutional neural networks used in the comparative learning model.
  • a convolutional neural network with an FPN structure can be used.
  • the feature image representation algorithm is used to characterize the target at the feature level
  • the feature vector representation algorithm is used to characterize the target at the vector level.
  • the specific types of the feature image representation algorithm and the feature vector representation algorithm can be selected according to needs, for example, considering ROI Align and R-CNN head are common components in target detection models. Therefore, in some embodiments of this application, the feature image representation algorithms used by the target detection model and the contrastive learning model are both ROI Align algorithms, where contrastive learning The model uses the ROI Align algorithm to characterize the target in the input image at the feature level;
  • the feature vector representation algorithm used by the target detection model and the contrastive learning model is both the R-CNN head algorithm.
  • the contrastive learning model uses the R-CNN head algorithm to represent the target in the input image at the vector level.
  • function f q and function f k refer to query network and key network respectively.
  • Query network and key network are two learning branches of comparative self-supervised learning.
  • the two model structures are exactly the same, but the specific parameters are different. They can generally be encoder structures.
  • I q represents the frame image input to the query network. It can be understood that the frame of the frame image described here should be the frame after moving the frame of the pasted image in step S203, and bb q represents the frame.
  • the position of the image in the background image In the example of Figure 3, it is the position of the frame image in the street view image. For example, the position can be represented by the upper left and lower right coordinate points.
  • I ki represents the frame image input to the key network, where i represents the i-th key network among the n-1 key networks.
  • n represents the i-th key network among the n-1 key networks.
  • bb ki represents the position of the border image in the background image. In the example of Figure 3, it is the position of the border image in the stadium image.
  • the function of ROI Align is to correspond the position of the target in the original image with the position of different feature maps.
  • v q in the above formula represents the output of ROI Align corresponding to the query network
  • v ki represents the output of ROI Align corresponding to the i-th key network in the n-1 key network
  • the output of ROI Align can be
  • the information of the above-mentioned border images in different feature maps is reflected on the two-dimensional level.
  • the f RH refers to the R-CNN head algorithm.
  • the function of the R-CNN head algorithm is to allow the model to output a bounding box that may contain the target after analysis.
  • e q in the above formula represents the R-CNN head algorithm corresponding to the query network
  • Output, e ki represents the output of the R-CNN head algorithm corresponding to the i-th key network in the n-1 key network.
  • the output of the R-CNN head algorithm can reflect the above-mentioned border image at the vector level. Feature information.
  • both the target detection model and the comparative learning model adopt convolutional neural networks with the same structure.
  • This implementation takes into account that the structure of a convolutional neural network is usually used in target detection models. Therefore, in order to further improve the reuse rate of components, in this implementation, the contrast learning model uses a convolutional neural network, and The structure is the same as the target detection model, which is conducive to further improving the performance of the trained target detection model.
  • both the target detection model and the contrastive learning model adopt a convolutional neural network with multi-layer output
  • the contrastive loss function of the contrastive learning model is based on the multi-layer output of the convolutional neural network. Computed contrastive loss function.
  • This implementation method takes into account that traditional contrastive learning usually only uses the output of the query network and key network to calculate the contrastive loss, while the middle layer of the convolutional neural network also has information, and the target detection model can usually also use multi-layer
  • the output convolutional neural network therefore, in this implementation, the contrastive learning model is set to use a convolutional neural network with multi-layer output, so that the contrastive learning model can perform hierarchical contrastive learning and improve the learning effect, that is, the contrastive learning model
  • the contrast loss function is a contrast loss function calculated based on the multi-layer output of the convolutional neural network.
  • the target detection model also needs to use this convolutional neural network.
  • the contrastive learning model uses a convolutional neural network with an FPN structure.
  • P2, P3, P4, and P5 in its multi-layer output can be specifically selected to calculate the contrastive loss function.
  • the calculation formula of the contrastive learning loss function of a single level in P2, P3, P4 and P5 can be expressed as:
  • L q-ki represents a single-level contrastive learning loss function
  • N in the formula represents the number of images in a single training batch. For example, in the example above, the number of images in a single training batch is 50.
  • V ei in the formula is the vector representation of the positive sample, and the two enhanced samples of an image are called positive samples.
  • is a hyperparameter.
  • Step S205 The images in the target detection data set are sequentially input into the target detection model as training images for training, and a trained target detection model is obtained.
  • pre-training that is, train the contrastive learning model through contrastive learning.
  • contrastive learning model is trained, you can start training the target detection model.
  • the target detection model should reuse components with the contrastive learning model, and the reuse rate can be as high as possible.
  • the contrastive learning model If a convolutional neural network with an FPN structure is set up, and ROI Align and R-CNN head are used, the target detection model selected in this application can also use a convolutional neural network with an FPN structure, and ROI Align and R-CNN head are used. As a component of an object detection model.
  • the images in the target detection data set are sequentially input into the target detection model as training images for training.
  • the recognition rate of the target detection model meets the requirements, the training is completed, and the trained target detection model can be obtained.
  • the target detection model of this application can perform image recognition, and there can be a variety of specific recognition objects.
  • the target detection model of this application is applied in a highway scene, and the collected pictures are analyzed for vehicles, obstacles, Recognition and detection of road signs, people and other targets.
  • Step S206 Input the image to be tested into the trained target detection model, and obtain the target detection result output by the target detection model for the image to be tested.
  • the image to be tested can be input to the trained target detection model, thereby obtaining the target detection result output by the target detection model for the image to be tested.
  • the target detection model determines the position of each "person” in the image to be tested, marks it as a person, and determines the position of each "car” in the image to be tested. , and marked as a car.
  • step S204 it may also include:
  • the images in the target detection data set are sequentially input into the semantic segmentation model as training images for training, and the trained semantic segmentation model is obtained;
  • This implementation method takes into account that in addition to target detection, the semantic segmentation model is also a commonly used downstream model, and when training the semantic segmentation model, it also needs to input the location and label of the target, that is, the semantic segmentation model is also relatively sensitive to the location of the target. Attention, therefore, after using the solution of this application for upstream pre-training, the images in the target detection data set can be sequentially input into the semantic segmentation model as training images to complete the training of the semantic segmentation model.
  • the relevant components in the semantic segmentation model should also be as identical as possible to the relevant components of the comparative learning model, that is, try to increase the reuse rate of components to improve the performance of the trained semantic segmentation model.
  • the contrastive learning model is set up with a feature image representation algorithm for characterizing the target at the feature level, and it is the same feature image representation algorithm used by the target detection model.
  • algorithm at the same time, the contrastive learning model sets a eigenvector representation algorithm for representing targets at the vector level, and it is the same algorithm as the eigenvector representation algorithm used by the target detection model. That is to say, the eigenvector representation algorithm set in the contrastive learning model
  • the feature image representation algorithm and feature vector representation algorithm will be reused in the target detection model, thereby effectively improving the parameter adjustment training performance of the target detection model.
  • this application considers that in the pre-training stage, the ability of position modeling required by the target detection model can be improved.
  • this application starts with background invariance.
  • Background invariance refers to What is important is that the model can identify the target more accurately on different background images.
  • the model has background invariance, it means that the model has learned the concept of "target" and has the ability to locate the target.
  • the search box after selecting any pre-training image, the search box will be determined from the pre-training image, and then the image in the search box will be cropped and pasted to n different background images according to the preset rules. , and move the border of the pasted image after pasting. Any background image comes from the target detection data set, so the moved border can include both the cropped target in the pre-training image and the background image in the target detection data set.
  • the contrastive learning model After training the contrastive learning model based on this, it can be made
  • the target detection model that reuses the feature image representation algorithm and the feature vector representation algorithm with the contrastive learning model can learn the position modeling ability of the target on different backgrounds, which is beneficial to the target detection model to identify the target more accurately. Improved the background invariance capability of the target detection model.
  • the solution of the present application can effectively perform target detection in images and improve the detection performance of the target detection model, that is, the detection accuracy of the target detection model is improved.
  • embodiments of the present application also provide an image target detection system, which can be mutually referenced with the above.
  • FIG. 4 is a schematic structural diagram of an image target detection system in this application, including:
  • the pre-training data set determination module 401 is used to determine the pre-training data set and use the images in the pre-training data set as pre-training images in turn;
  • the search box selection module 402 is used to determine the search box from the pre-training image after selecting any one pre-training image
  • the cut-and-paste perturbation module 403 is used to cut the image in the search box and paste it onto n different background images according to preset rules, and move the border of the pasted image after pasting; where n is not less than 2 Positive integer, any background image comes from the target detection data set;
  • the contrast learning model training module 404 is used to input each image with the frame moved into the contrast learning model, and train the contrast learning model through contrast learning;
  • the target detection model training module 405 is used to input the images in the target detection data set as training images into the target detection model for training, and obtain the trained target detection model;
  • the target detection result determination module 406 is used to input the image to be tested into the trained target detection model, and obtain the target detection result output by the target detection model for the image to be tested;
  • the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the feature level, and it is the same algorithm as the feature image representation algorithm used in the target detection model; the contrastive learning model is set up with a feature image representation algorithm used to represent the target at the vector level.
  • the characteristic vector representation algorithm is the same algorithm as the characteristic vector representation algorithm used in the target detection model.
  • the search box selection module 402 is specifically used to:
  • a plurality of rectangular frames are automatically generated on the pre-training image, and one is randomly selected from each rectangular frame as a determined search frame.
  • the search box selection module 402 automatically generates multiple rectangular boxes on the pre-training image, including:
  • the search box selection module 402 is also used to:
  • one randomly selected one from each rectangular box is used as the determined search box, including:
  • the cut-and-paste perturbation module 403 cuts the image in the search box and pastes it onto n different background images according to preset rules, specifically for:
  • the cut-and-paste perturbation module 403 randomly adjusts the cropped images n times, specifically for:
  • the cropped image is randomly adjusted n times, and when the cropped image is adjusted any time, the image size is adjusted by adjusting the length and/or width.
  • moving the frame of the pasted image after pasting includes:
  • the border of the pasted image is moved by perturbing the border position, and the area intersection-to-union ratio of the border after the movement and the border before the movement is greater than the preset area-to-union ratio threshold.
  • the feature image representation algorithm used by the target detection model and the contrastive learning model is both the ROI Align algorithm, where the contrastive learning model uses the ROI Align algorithm to characterize the target in the input image at a feature level;
  • the feature vector representation algorithm used by the target detection model and the contrastive learning model is both the R-CNN head algorithm.
  • the contrastive learning model uses the R-CNN head algorithm to represent the target in the input image at the vector level.
  • both the target detection model and the comparative learning model adopt convolutional neural networks with the same structure.
  • both the target detection model and the contrast learning model adopt a convolutional neural network with multi-layer output
  • the contrast loss function of the contrast learning model is a contrast calculated based on the multi-layer output of the convolutional neural network. loss function.
  • both the target detection model and the comparative learning model adopt a convolutional neural network with an FPN structure.
  • the semantic segmentation model training module is used to input the images in the target detection data set as training images into the semantic segmentation model for training, and obtain the trained semantic segmentation model;
  • the semantic segmentation result determination module is used to input the image to be tested into the trained semantic segmentation model and obtain the semantic segmentation result for the image to be tested output by the semantic segmentation model.
  • embodiments of the present application also provide an image target detection device and a non-volatile computer-readable storage medium, which may be mutually referenced with the above.
  • a computer program is stored on the non-volatile computer-readable storage medium.
  • the steps of the image target detection method in any of the above embodiments are implemented.
  • the non-volatile computer-readable storage media mentioned here include random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROM, or any other form of storage media known in the technical field.
  • the target detection device for this image may include:
  • Memory 501 used to store computer programs
  • the processor 501 is configured to execute a computer program to implement the steps of the image target detection method as in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种图像的目标检测方法、系统、设备及存储介质,应用于机器学习技术领域,包括:选取出预训练数据集中的任意1张预训练图像之后,从中确定出搜索框;将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上后进行边框移动;任意1张背景图像均来源于目标检测数据集;将边框移动后的各张图像输入至对比学习模型中,通过对比学习的方式训练对比学习模型;将待测图像输入至训练完成的目标检测模型,得到目标检测结果;对比学习模型与目标检测模型采用同一特征图像表征算法以及同一特征向量表征算法。应用本申请的方案,可以有效进行图像的目标检测,提高了目标检测模型的检测性能,也即提高了目标检测模型的检测准确率。

Description

一种图像的目标检测方法、系统、设备及存储介质
相关申请的交叉引用
本申请要求于2022年09月15日提交中国专利局,申请号为202211118927.6,申请名称为“一种图像的目标检测方法、系统、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,特别是涉及一种图像的目标检测方法、系统、设备及存储介质。
背景技术
目前,在计算机视觉中,自监督学习是关注度非常高的一个方向。区别于传统的需要对数据进行人工标注的监督型学习方法,自监督学习希望通过设计代理任务,对无标注的数据自动生成标签,由此完成对数据的学习。
自监督学习中的代理任务主要分为两类:图像变换式与对比学习式。图像变换式的代理任务包括图像降噪、修复、颜色转换等,根据这些与图像变换相关的任务,构造监督信息指导模型学习。而对比学习式指对比任务,对比任务指的是将样本进行裁剪,颜色调整等数据增强策略,将同一张图片生成的两个数据增强样本视作正样本,将不同样本生成的增强样本之间看作负样本。通过自动编码器对增强后的样本进行特征提取,将特征向量进一步进行降维得到低维向量,通过损失函数拉近正样本间的相似度,拉远负样本间的相似度。
图1为对比学习的原理示意图,对比学习的核心是通过学习不同图像间的相似度,从而更好地学习到图像的表征。当模型能够学习到正负样本间的相似度差异时,说明模型提取到的特征较好。目前,自监督领域性能最优的方法均基于对比任务。
上游预训练+下游调参,是机器学习的经典范式。在监督型学习中,该范式指的是在图像分类大规模数据集上进行带标签图像的分类预训练,而在下游任务中,如进行图像的目标检测、语义分割等,训练完的模型会冻结参数,在下游任务上使用少量带有标签的数据进行调参训练。自监督学习也遵循此范式,区别在于,自监督学习在上游预训练中不依赖数据标签。
目前,将对比式自监督学习应用于下游的目标检测任务的研究较少,且存在上下游割裂的情况。例如目前的一种对比式自监督学习方法,完成图像分类数据集的对比学习预训练,使用训练完的CNN(Convolutional Neural Network,卷积神经网络)权重,在下游输电线路场景下的目标检测任务中,以上述权重作为特征提取网络,单独训练一个cascade r-cnn目标检测网络。
该方法代表目前将对比学习应用于下游任务如目标检测任务的主流方法,其上游预训练与下游调参是完全割裂开的。在上游预训练时,所采用的代理任务是区分图像相似度,该任务与图像分类任务的关联度比较大,而与下游的目标检测任务关联度较小,在预训练过程中仅仅完成特征提取网络的训练,目标检测网络中的其余组件仍需要在目标检测任务中从头训 练,导致这样的方法在目标检测任务中性能偏低,检测的准确率不足。
综上所述,如何有效地进行图像的目标检测,提高检测的准确率,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种图像的目标检测方法、系统、设备及存储介质,以有效地进行图像的目标检测,提高检测的准确率。
为解决上述技术问题,本申请提供如下技术方案:
一种图像的目标检测方法,包括:
确定出预训练数据集,并将预训练数据集中的图像依次作为预训练图像;
选取出任意1张预训练图像之后,从预训练图像中确定出搜索框;
将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张背景图像均来源于目标检测数据集;
将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练对比学习模型;
将目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的目标检测模型;
将待测图像输入至训练完成的目标检测模型,得到由目标检测模型输出的针对待测图像的目标检测结果;
其中,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法;对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法。
在本申请的一些实施例中,从预训练图像中确定出搜索框,包括:
在预训练图像上自动生成多个矩形框,并从各个矩形框中随机选取1个作为确定出的搜索框。
在本申请的一些实施例中,在预训练图像上自动生成多个矩形框,包括:
通过随机搜索算法在预训练图像上自动生成多个矩形框。
在本申请的一些实施例中,在预训练图像上自动生成多个矩形框之后,还包括:
将长宽比超出预设范围的各个矩形框进行过滤;
相应的,从各个矩形框中随机选取1个作为确定出的搜索框,包括:
从经过过滤之后剩余的各个矩形框中随机选取1个作为确定出的搜索框。
在本申请的一些实施例中,将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,包括:
将搜索框中的图像裁剪,并将裁剪出的图像分别进行n次随机调整,得到n张调整之后的图像;
将n张调整之后的图像分别粘贴至n张不同的背景图像上。
在本申请的一些实施例中,将裁剪出的图像分别进行n次随机调整,包括:
将裁剪出的图像分别进行n次随机调整,并且,在将裁剪出的图像进行任意1次调整时,通过调整长度和/或宽度的方式进行图像尺寸的调整。
在本申请的一些实施例中,粘贴之后对粘贴图像的边框进行移动,包括:
粘贴之后,通过边框位置扰动的方式,对粘贴图像的边框进行移动,且移动之后的边框与移动之前的边框的面积交并比大于预设的面积交并比阈值。
在本申请的一些实施例中,目标检测模型与对比学习模型所采用的特征图像表征算法均为ROI Align算法,其中,对比学习模型通过ROI Align算法对输入图像中的目标进行特征级别的表征;
目标检测模型与对比学习模型所采用的特征向量表征算法均为R-CNN head算法,其中,对比学习模型通过R-CNN head算法对输入图像中的目标进行向量级别的表征。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用相同结构的卷积神经网络。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用具有多层输出的卷积神经网络,且对比学习模型的对比损失函数为基于卷积神经网络的多层输出所计算的对比损失函数。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用FPN结构的卷积神经网络。
在本申请的一些实施例中,在通过对比学习的方式训练对比学习模型之后,还包括:
将目标检测数据集中的图像依次作为训练图像输入至语义分割模型中进行训练,得到训练完成的语义分割模型;
将待测图像输入至训练完成的语义分割模型,得到由语义分割模型输出的针对待测图像的语义分割结果。
一种图像的目标检测系统,包括:
预训练数据集确定模块,用于确定出预训练数据集,并将预训练数据集中的图像依次作为预训练图像;
搜索框选择模块,用于选取出任意1张预训练图像之后,从预训练图像中确定出搜索框;
裁剪粘贴扰动模块,用于将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张背景图像均来源于目标检测数据集;
对比学习模型训练模块,用于将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练对比学习模型;
目标检测模型训练模块,用于将目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的目标检测模型;
目标检测结果确定模块,用于将待测图像输入至训练完成的目标检测模型,得到由目标检测模型输出的针对待测图像的目标检测结果;
其中,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法;对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法。
一种图像的目标检测设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行计算机程序以实现如上述的图像的目标检测方法的步骤。
一种非易失性计算机可读存储介质,非易失性计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述的图像的目标检测方法的步骤。
应用本申请实施例所提供的技术方案,考虑到在对比式自监督学习的预训练阶段,可以与下游的目标检测任务实现更多的对齐,从而提高下游的目标检测性能。而本申请的方案中,一方面是在预训练阶段引入更多的目标检测组件,这样预训练完成后,这些目标检测的组件能够复用到目标检测模型的调参训练中,能够为目标检测模型的调参训练提供更合适的初始权重,也就有助于提升目标检测模型的调参训练性能。具体的,在本申请的方案中,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法,同时,对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法,也就是说,对比学习模型中设置的特征图像表征算法和特征向量表征算法会被复用至目标检测模型,从而有效地提高目标检测模型的调参训练性能。
另一方面,本申请考虑到可以在预训练阶段,提升目标检测模型所需要的对于位置建模的能力,具体的,本申请是从背景不变性入手,背景不变性指的是目标在不同的背景图像上,模型均能够较为准确地识别出目标。当模型具备背景不变性时,说明模型学习到了“目标”这一概念,具备了对目标进行定位的能力。
本申请的方案中,选取出任意1张预训练图像之后,会从预训练图像中确定出搜索框,进而将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动。任意1张背景图像均来源于目标检测数据集,因此移动后的边框可以既包括预训练图像中裁剪下的目标,又包括目标检测数据集中的背景图像,据此训练对比学习模型之后,可以使得与对比学习模型复用了特征图像表征算法和特征向量表征算法的目标检测模型,能够学习到目标在不同背景上的位置建模能力,即有利于目标检测模型能够更为准确地识别出目标,提高了目标检测模型的背景不变性的能力。
综上所述,本申请的方案可以有效地进行图像的目标检测,提高了目标检测模型的检测性能,也即提高了目标检测模型的检测准确率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为对比学习的原理示意图;
图2为本申请中一种图像的目标检测方法的实施流程图;
图3为本申请一些实施例中的图像的目标检测方法的原理框图;
图4为本申请中一种图像的目标检测系统的结构示意图;
图5为本申请中一种图像的目标检测设备的结构示意图。
具体实施方式
本申请的核心是提供一种图像的目标检测方法,可以有效地进行图像的目标检测,提高了目标检测模型的检测性能,也即提高了目标检测模型的检测准确率。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参考图2,图2为本申请中一种图像的目标检测方法的实施流程图,该图像的目标检测方法可以包括以下步骤:
步骤S201:确定出预训练数据集,并将预训练数据集中的图像依次作为预训练图像。
在本申请的一些实施例中,预训练数据集中可以包括大量的用于进行预训练的图像,在实际应用中,通常可以包含千万张甚至更多数量的图像。由于本申请的方案是基于对比式自监督学习实现预训练,因此并不需要为这些图像设置标签。并且,由于预训练数据集中包括的图像较多,因此通常可以分批次进行训练,例如一种场合中,每50张图片作为1个训练批次。
步骤S202:选取出任意1张预训练图像之后,从预训练图像中确定出搜索框。
在本申请的一些实施例中,可以将预训练数据集中的图像依次作为预训练图像,在选取出任意1张预训练图像之后,便可以从预训练图像中确定出搜索框。
例如图3为一些实施例中的图像的目标检测方法的原理框图,图3中的大熊猫图像便是从预训练数据集中选取出的1张预训练图像,用于进行对比学习模型的训练。
从预训练图像中确定出搜索框的方式可以有多种,考虑到由于预训练数据集中的图像通常为单个目标,且该目标可能位于图像的任意位置,因此可以通过随机选择的方式从预训练图像中确定出搜索框。搜索框的形状通常设置为矩形,使得在图像中,通过两个点的坐标便可以确定出搜索框的范围。
即在本申请的一些实施例中,步骤S202中描述的从预训练图像中确定出搜索框,可以具体包括:在预训练图像上自动生成多个矩形框,并从各个矩形框中随机选取1个作为确定出的搜索框。
在预训练图像上自动生成多个矩形框时,具体方式也可以根据实际需要进行设定和选取,例如可以在多个指定位置自动生成矩形框,从而得到自动生成的多个矩形框。又如,考虑到目标可能位于图像的任意位置,生成多个矩形框之后,也是从中随机选取1个作为确定出的搜索框,因此,上述的在预训练图像上自动生成多个矩形框,可以具体包括:通过随机搜索算法在预训练图像上自动生成多个矩形框。通过随机搜索算法自动生成多个矩形框的方式较为简单方便。
此外可以理解的是,从预训练图像中确定出搜索框之后,搜索框中可能包含目标,也可能不包含目标。
进一步的,在本申请的一些实施例中,在预训练图像上自动生成多个矩形框之后,还可以包括:
将长宽比超出预设范围的各个矩形框进行过滤;
相应的,上面描述的从各个矩形框中随机选取1个作为确定出的搜索框,可以具体包括:
从经过过滤之后剩余的各个矩形框中随机选取1个作为确定出的搜索框。
该种实施方式中,考虑到为了方便地生成多个矩形框,通常采用的是随机生成的方式,例如上述实施例中通过随机搜索算法自动生成多个矩形框,因此,对于随机生成的部分矩形框,便可能出现长宽比过大或过小的情况,这样的不符合期望的矩形框不利于后续的训练,因此,该种实施方式中,会将长宽比超出预设范围的各个矩形框进行过滤,例如一种场合中,当矩形框的长宽比>3或者<1/3时会被过滤。相应的,在确定搜索框时,便是从经过过滤之后剩余的各个矩形框中随机选取1个作为确定出的搜索框。
步骤S203:将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张背景图像均来源于目标检测数据集。
从预训练图像中确定出搜索框之后,便可以将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,例如一种简单的方式便是直接将搜索框中的图像粘贴至n张不同的背景图像。
在本申请的一些实施例中,为了提高模型的识别能力,即提高训练效果,可以将裁剪下的搜索框中的图像进行调整,再分别粘贴至n张不同的背景图像上。即在本申请的一些实施例中,步骤S203中描述的将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,可以具体包括:
步骤一:将搜索框中的图像裁剪,并将裁剪出的图像分别进行n次随机调整,得到n张调整之后的图像;
步骤二:将n张调整之后的图像分别粘贴至n张不同的背景图像上。
该种具体实施方式中,为了提高模型的识别能力,即提高训练效果,会将将裁剪出的图像分别进行n次随机调整,从而得到n张调整之后的图像。当然,在本申请的一些实施例中调整方式可以有多种,例如图像旋转,分辨率的调整,长度调整,宽度调整等等。
在本申请的一些实施例中,考虑到进行长度和/或宽度的调整时,操作较为简单,也不容易破坏搜索框中的图像信息,因此,上述步骤一中描述的将裁剪出的图像分别进行n次随机调整,可以具体包括:
将裁剪出的图像分别进行n次随机调整,并且,在将裁剪出的图像进行任意1次调整时,通过调整长度和/或宽度的方式进行图像尺寸的调整,当然,这样也会改变图像的分辨率。
例如一些实施例中,调整的策略可以表示为:
(w,h)=λ3*(λ1*w1,λ2*h1)。
其中,w表示的是新分辨率中的长,h表示的是新分辨率中的宽,w1表示的是原始分辨率中的长,h1表示的是原始分辨率中的宽,λ1与λ2是为长度和宽度分别设置的变化系数,λ3则是整体变化系数。
此外可以理解的是,以该种实施方式的调整策略为例,本申请的方案中,需要将裁剪出的图像分别进行n次随机调整,则该种实施方式中,进行n次调整中的每次调整时,所采用的λ1,λ2以及λ3均可以是随机选择的,当然,可以设置好λ1,λ2以及λ3各自可允许的取值范围。
例如在图3的例子中,n=2,来源于目标检测数据集的2张背景图像中的一张是街景,另一张则是体育场的图像。将搜索框中的图像裁剪之后,粘贴在街景图像中时,可以看出,是将搜索框中的图像的长度和宽度均进行了降低,而将裁剪出的搜索框中的图像粘贴在体育场的图像中时,可以看出,是将搜索框中的图像的长度进行了提高,而宽度则进行了降低。
将裁剪出的搜索框中的图像按照预设规则粘贴至n张不同的背景图像上之后,需要对粘贴图像的边框进行移动。
可以理解的是,如果直接将裁剪出的搜索框中的图像粘贴至n张不同的背景图像上,则粘贴图像的边框与搜索框的尺寸是一致的。而如果是上述的实施方式中,例如对长度和/或宽度进行了调整,则粘贴图像的边框与搜索框的尺寸便不一致。
对粘贴图像的边框进行移动时,在本申请的一些实施例中移动方式可以根据需要进行选取,例如可以随机移动。此外需要说明的是,本申请的方案中,任意1张背景图像均来源于目标检测数据集,即本申请在预训练的过程中引入了目标检测数据集作为背景,目的是希望在对比学习的过程中,目标检测模型的相关组件能够学习到目标在不同背景上的位置建模能力,即具体为背景不变性的能力。因此,如果不进行粘贴图像的边框的移动,会使得训练效果较差。而在本申请的一些实施方式,是移动之后的边框能够包括部分原粘贴图像,又涵盖了部分背景图片的信息。
在本申请的一些实施例中,步骤S203中描述的粘贴之后对粘贴图像的边框进行移动,可以具体包括:
粘贴之后,通过边框位置扰动的方式,对粘贴图像的边框进行移动,且移动之后的边框与移动之前的边框的面积交并比大于预设的面积交并比阈值。
该种实施方式中,是对粘贴图像的边框进行位置扰动,从而实现边框的移动。而且,要求移动之后的边框与移动之前的边框的面积交并比大于面积交并比阈值,例如面积交并比阈值设置为0.6。
IoU(Intersection over Union,交并比)也称为面积交并比,反映的是两个矩形框的面积重合情况,即二者的交集和并集的比值。当完全重叠时,IoU为最大值1。
采用该种实施方式的设置,可以使得移动之后的边框位置与移动之前的边框位置差距不会过大,即不会出现完全偏离原边框的情况,为了便于理解,本申请的图3中也标示出了原边框以及移动后的边框。
步骤S204:将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练对比学习模型。
本申请考虑到,在对比式自监督学习的预训练阶段,可以与下游的目标检测任务实现更多的对齐,从而提高下游的目标检测性能。也就是说,在预训练阶段可以引入更多的目标检测模型的组件,这样预训练完成后,这些组件能够复用到目标检测模型的调参训练中,能够为目标检测模型的调参训练提供更合适的初始权重,也就有助于提升目标检测模型的调参训练性能。
因此,本申请在设置对比式自监督学习的对比学习模型时,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法。同时,对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法。
也就是说,对比学习模型中设置的特征图像表征算法和特征向量表征算法会被复用至目标检测模型,从而有效地提高目标检测模型的调参训练性能。
对比学习模型通常会采用查询网络和答案网络的结构,即query network和key network的结构,例如图3中便是采用了这样的结构,且图3中的n=2,因此答案网络key network的数量为1,当n选取为更多的数值时,key network的数量便会相应增多。
对比学习模型所采用的卷积神经网络的具体类型可以有多种,例如图3的实施方式中可以采用FPN结构的卷积神经网络。
特征图像表征算法用于进行目标在特征级别的表征,特征向量表征算法用于进行目标在向量级别的表征,特征图像表征算法以及特征向量表征算法的具体类型可以根据需要进行选取,例如考虑到ROI Align以及R-CNN head是目标检测模型中的常用组件,因此,在本申请的一些实施例中,目标检测模型与对比学习模型所采用的特征图像表征算法均为ROI Align算法,其中,对比学习模型通过ROI Align算法对输入图像中的目标进行特征级别的表征;
目标检测模型与对比学习模型所采用的特征向量表征算法均为R-CNN head算法,其中,对比学习模型通过R-CNN head算法对输入图像中的目标进行向量级别的表征。
对输入图像中的目标进行特征级别的表征,可以表示为:
vq=RoI Align(fq(Iq),bbq);
vki=RoI Align(fk(Iki),bbki)。
其中,函数fq和函数fk分别指代query network和key network。query network和key network是对比式自监督学习的两个学习支路,二者模型结构完全相同,具体参数不同,一般可以为编码器结构。
Iq表示的是输入至query network的边框图像,可以理解的是,此处描述的边框图像的边框应当是步骤S203中对粘贴图像的边框进行移动之后的边框,bbq则表示的是该边框图像在背景图像中的位置,图3的例子中便是边框图像在街景图像中的位置,例如可以由左上和右下两个坐标点来进行位置体现。
相应的,Iki表示的是输入至key network的边框图像,其中的i表示的是n-1个key network中的第i个key network,当然,图3的例子中n=2,则只有1个key network。bbki则表示的是该边框图像在背景图像中的位置,图3的例子中便是边框图像在体育场图像中的位置。
ROI Align的功能是将目标在原图的位置与不同特征图的位置进行对应。上式中的vq表示的是对应于query network的ROI Align的输出,vki表示的是对应于n-1个key network中的第i个key network的ROI Align的输出,ROI Align的输出可以在二维层面上反映出上述的边框图像在不同特征图中的信息。
而对输入图像中的目标进行向量级别的表征,可以表示为:
eq=fR-H(vq)
eki=fR-H(vki)
其中的fR-H指代R-CNN head算法,R-CNN head算法的功能是让模型经过分析后,输出可能含有目标的边界框。上式中的eq表示的是对应于query network的R-CNN head算法的 输出,eki表示的是对应于n-1个key network中的第i个key network的R-CNN head算法的输出,R-CNN head算法的输出可以在向量层面上反映出上述的边框图像的特征信息。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用相同结构的卷积神经网络。
该种实施方式考虑到,目标检测模型中通常会采用卷积神经网络的结构,因此,为了进一步地提高组件的复用率,该种实施方式中,设置对比学习模型采用卷积神经网络,且结构与目标检测模型相同,也就有利于进一步地提高训练出的目标检测模型的性能。
进一步的,在本申请的一些实施例中,目标检测模型与对比学习模型均采用具有多层输出的卷积神经网络,且对比学习模型的对比损失函数为基于卷积神经网络的多层输出所计算的对比损失函数。
该种实施方式考虑到,传统的对比学习通常是仅利用query network和key network的输出计算对比损失,而卷积神经网络的中间层也具有信息,且目标检测模型中通常也可以采用具有多层输出的卷积神经网络,因此,该种实施方式中,设置对比学习模型采用具有多层输出的卷积神经网络,使得对比学习模型可以进行层次化的对比学习,提高学习效果,即对比学习模型的对比损失函数为基于卷积神经网络的多层输出所计算的对比损失函数。当然,为了提高组件的复用率,目标检测模型也需要采用该卷积神经网络。
具有多层输出的卷积神经网络的具体结构有多种,例如考虑到FPN是目标检测模型中的常用组件,因此,在本申请的一些实施例中,目标检测模型与对比学习模型均采用FPN结构的卷积神经网络。
本申请的图3中,对比学习模型采用的便是FPN结构的卷积神经网络,例如可以具体选取其多层输出中的P2,P3,P4,P5进行对比损失函数的计算。P2,P3,P4以及P5中的单个层次的对比学习损失函数计算式可以表示为:
其中的Lq-ki即表示单个层次的对比学习损失函数,式子中的N表示的是单个训练批次的图像数量,例如上文例子中单个训练批次的图像数量为50。对于P2,P3,P4,P5中的不同层次,R-CNN head算法输出的eq和eki的取值不同。式中的vei为正样本的向量表征,一张图像的2个增强样本相互之间称为正样本。τ为超参数。
对比学习模型可以进行层次化的对比学习时,对比损失函数为基于卷积神经网络的多层输出所计算的对比损失函数,即,将各层次的对比学习损失函数求和,作为最终得到的对比损失函数。即最终的损失函数可以表示为L=∑Lq-ki。此外可以理解的是,n=2时,最终的损失函数是将4个层次的对比学习损失函数求和。例如n=3时,最终的损失函数则是将8个层次的对比学习损失函数求和,即各个key network可以分别与query network进行对比学习。
步骤S205:将目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的目标检测模型。
利用预训练数据集的图像,可以进行预训练,即通过对比学习的方式训练对比学习模型,当对比学习模型训练完毕时,便可以开始训练目标检测模型。
并且如前文的描述,为了使得目标检测模型有良好的性能,目标检测模型应当与对比学习模型进行组件的复用,且可以让复用率尽量地高,例如前文的实施方式中,对比学习模型 设置了FPN结构的卷积神经网络,且采用了ROI Align以及R-CNN head,则本申请选择的目标检测模型,也可以采用FPN结构的卷积神经网络,且采用ROI Align以及R-CNN head作为目标检测模型的组件。
将目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,当目标检测模型的识别率达到要求时,说明训练完成,可以得到训练完成的目标检测模型。
本申请的目标检测模型可以进行图像识别,具体的识别对象可以有多种,例如一种场合中,本申请的目标检测模型应用在高速公路场景中,对采集到的图片进行车辆,障碍物,路标,人等目标的识别检测。
步骤S206:将待测图像输入至训练完成的目标检测模型,得到由目标检测模型输出的针对待测图像的目标检测结果。
得到训练完成的目标检测模型之后,便可以将待测图像输入至训练完成的目标检测模型,从而得到由目标检测模型输出的针对待测图像的目标检测结果。例如将待测图像输入至训练完成的目标检测模型之后,目标检测模型确定出该待测图片中的各个“人”的位置,并标记为人,确定出待测图片中的各个“车”的位置,并标记为车。
进一步的,在本申请的一些实施例中,在步骤S204之后,还可以包括:
将目标检测数据集中的图像依次作为训练图像输入至语义分割模型中进行训练,得到训练完成的语义分割模型;
将待测图像输入至训练完成的语义分割模型,得到由语义分割模型输出的针对待测图像的语义分割结果。
该种实施方式中考虑到,除了目标检测之外,语义分割模型也是常用的下游模型,且语义分割模型在训练时,也需要输入目标的位置和标签,即语义分割模型对于目标的位置也较为关注,因此,采用了本申请的方案进行了上游的预训练之后,可以将目标检测数据集中的图像依次作为训练图像输入至语义分割模型中,完成对于语义分割模型的训练。此外可以理解的是,语义分割模型中的相关组件,也应当尽量与对比学习模型的相关组件相同,即尽量提高组件的复用率,以提高训练出的语义分割模型的性能。
应用本申请实施例所提供的技术方案,考虑到在对比式自监督学习的预训练阶段,可以与下游的目标检测任务实现更多的对齐,从而提高下游的目标检测性能。而本申请的方案中,一方面是在预训练阶段引入更多的目标检测组件,这样预训练完成后,这些目标检测的组件能够复用到目标检测模型的调参训练中,能够为目标检测模型的调参训练提供更合适的初始权重,也就有助于提升目标检测模型的调参训练性能。在本申请的一些实施例中,在本申请的方案中,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法,同时,对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法,也就是说,对比学习模型中设置的特征图像表征算法和特征向量表征算法会被复用至目标检测模型,从而有效地提高目标检测模型的调参训练性能。
另一方面,本申请考虑到可以在预训练阶段,提升目标检测模型所需要的对于位置建模的能力,在本申请的一些实施例中,本申请是从背景不变性入手,背景不变性指的是目标在不同的背景图像上,模型均能够较为准确地识别出目标。当模型具备背景不变性时,说明模型学习到了“目标”这一概念,具备了对目标进行定位的能力。
本申请的方案中,选取出任意1张预训练图像之后,会从预训练图像中确定出搜索框,进而将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动。任意1张背景图像均来源于目标检测数据集,因此移动后的边框可以既包括预训练图像中裁剪下的目标,又包括目标检测数据集中的背景图像,据此训练对比学习模型之后,可以使得与对比学习模型复用了特征图像表征算法和特征向量表征算法的目标检测模型,能够学习到目标在不同背景上的位置建模能力,即有利于目标检测模型能够更为准确地识别出目标,提高了目标检测模型的背景不变性的能力。
综上所述,本申请的方案可以有效地进行图像的目标检测,提高了目标检测模型的检测性能,也即提高了目标检测模型的检测准确率。
相应于上面的方法实施例,本申请实施例还提供了一种图像的目标检测系统,可与上文相互对应参照。
参见图4所示,为本申请中一种图像的目标检测系统的结构示意图,包括:
预训练数据集确定模块401,用于确定出预训练数据集,并将预训练数据集中的图像依次作为预训练图像;
搜索框选择模块402,用于选取出任意1张预训练图像之后,从预训练图像中确定出搜索框;
裁剪粘贴扰动模块403,用于将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张背景图像均来源于目标检测数据集;
对比学习模型训练模块404,用于将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练对比学习模型;
目标检测模型训练模块405,用于将目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的目标检测模型;
目标检测结果确定模块406,用于将待测图像输入至训练完成的目标检测模型,得到由目标检测模型输出的针对待测图像的目标检测结果;
其中,对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与目标检测模型所采用的特征图像表征算法为同一算法;对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与目标检测模型所采用的特征向量表征算法为同一算法。
在本申请的一些实施例中,搜索框选择模块402具体用于:
选取出任意1张预训练图像之后,在预训练图像上自动生成多个矩形框,并从各个矩形框中随机选取1个作为确定出的搜索框。
在本申请的一些实施例中,搜索框选择模块402在预训练图像上自动生成多个矩形框,包括:
通过随机搜索算法在预训练图像上自动生成多个矩形框。
在本申请的一些实施例中,搜索框选择模块402还用于:
在预训练图像上自动生成多个矩形框之后,将长宽比超出预设范围的各个矩形框进行过滤;
相应的,从各个矩形框中随机选取1个作为确定出的搜索框,包括:
从经过过滤之后剩余的各个矩形框中随机选取1个作为确定出的搜索框。
在本申请的一些实施例中,裁剪粘贴扰动模块403将搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,具体用于:
将搜索框中的图像裁剪,并将裁剪出的图像分别进行n次随机调整,得到n张调整之后的图像;
将n张调整之后的图像分别粘贴至n张不同的背景图像上。
在本申请的一些实施例中,裁剪粘贴扰动模块403将裁剪出的图像分别进行n次随机调整,具体用于:
将裁剪出的图像分别进行n次随机调整,并且,在将裁剪出的图像进行任意1次调整时,通过调整长度和/或宽度的方式进行图像尺寸的调整。
在本申请的一些实施例中,粘贴之后对粘贴图像的边框进行移动,包括:
粘贴之后,通过边框位置扰动的方式,对粘贴图像的边框进行移动,且移动之后的边框与移动之前的边框的面积交并比大于预设的面积交并比阈值。
在本申请的一些实施例中,目标检测模型与对比学习模型所采用的特征图像表征算法均为ROI Align算法,其中,对比学习模型通过ROI Align算法对输入图像中的目标进行特征级别的表征;
目标检测模型与对比学习模型所采用的特征向量表征算法均为R-CNN head算法,其中,对比学习模型通过R-CNN head算法对输入图像中的目标进行向量级别的表征。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用相同结构的卷积神经网络。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用具有多层输出的卷积神经网络,且对比学习模型的对比损失函数为基于卷积神经网络的多层输出所计算的对比损失函数。
在本申请的一些实施例中,目标检测模型与对比学习模型均采用FPN结构的卷积神经网络。
在本申请的一些实施例中,还包括:
语义分割模型训练模块,用于将目标检测数据集中的图像依次作为训练图像输入至语义分割模型中进行训练,得到训练完成的语义分割模型;
语义分割结果确定模块,用于将待测图像输入至训练完成的语义分割模型,得到由语义分割模型输出的针对待测图像的语义分割结果。
相应于上面的方法和系统实施例,本申请实施例还提供了一种图像的目标检测设备以及一种非易失性计算机可读存储介质,可与上文相互对应参照。非易失性计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述任一实施例中的图像的目标检测方法的步骤。这里所说的非易失性计算机可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。
可参阅图5,该图像的目标检测设备可以包括:
存储器501,用于存储计算机程序;
处理器501,用于执行计算机程序以实现如上述任一实施例中的图像的目标检测方法的步骤。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的技术方案及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请的保护范围内。

Claims (20)

  1. 一种图像的目标检测方法,其特征在于,包括:
    确定出预训练数据集,并将所述预训练数据集中的图像依次作为预训练图像;
    选取出任意1张预训练图像之后,从所述预训练图像中确定出搜索框;
    将所述搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张所述背景图像均来源于目标检测数据集;
    将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练所述对比学习模型;
    将所述目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的所述目标检测模型;
    将待测图像输入至训练完成的所述目标检测模型,得到由所述目标检测模型输出的针对所述待测图像的目标检测结果;
    其中,所述对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与所述目标检测模型所采用的特征图像表征算法为同一算法;所述对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与所述目标检测模型所采用的特征向量表征算法为同一算法。
  2. 根据权利要求1所述的图像的目标检测方法,其特征在于,所述从所述预训练图像中确定出搜索框,包括:
    在所述预训练图像上自动生成多个矩形框,并从各个矩形框中随机选取1个作为确定出的搜索框。
  3. 根据权利要求2所述的图像的目标检测方法,其特征在于,所述在所述预训练图像上自动生成多个矩形框,包括:
    通过随机搜索算法在所述预训练图像上自动生成多个矩形框。
  4. 根据权利要求2所述的图像的目标检测方法,其特征在于,在所述预训练图像上自动生成多个矩形框之后,还包括:
    将长宽比超出预设范围的各个矩形框进行过滤;
    相应的,所述从各个矩形框中随机选取1个作为确定出的搜索框,包括:
    从经过过滤之后剩余的各个矩形框中随机选取1个作为确定出的搜索框。
  5. 根据权利要求1所述的图像的目标检测方法,其特征在于,所述将所述搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,包括:
    将所述搜索框中的图像裁剪,并将裁剪出的图像分别进行n次随机调整,得到n张调整之后的图像;
    将n张调整之后的图像分别粘贴至n张不同的背景图像上。
  6. 根据权利要求5所述的图像的目标检测方法,其特征在于,所述将裁剪出的图像分别进行n次随机调整,包括:
    将裁剪出的图像分别进行n次随机调整,并且,在将裁剪出的图像进行任意1次调整时,通过调整长度和/或宽度的方式进行图像尺寸的调整。
  7. 根据权利要求1所述的图像的目标检测方法,其特征在于,所述粘贴之后对粘贴 图像的边框进行移动,包括:
    粘贴之后,通过边框位置扰动的方式,对粘贴图像的边框进行移动,且移动之后的边框与移动之前的边框的面积交并比大于预设的面积交并比阈值。
  8. 根据权利要求1所述的图像的目标检测方法,其特征在于,所述目标检测模型与所述对比学习模型所采用的特征图像表征算法均为ROI Align算法,其中,所述对比学习模型通过所述ROI Align算法对输入图像中的目标进行特征级别的表征;
    所述目标检测模型与所述对比学习模型所采用的特征向量表征算法均为R-CNN head算法,其中,所述对比学习模型通过所述R-CNN head算法对输入图像中的目标进行向量级别的表征。
  9. 根据权利要求1所述的图像的目标检测方法,其特征在于,所述目标检测模型与所述对比学习模型均采用相同结构的卷积神经网络。
  10. 根据权利要求9所述的图像的目标检测方法,其特征在于,所述目标检测模型与所述对比学习模型均采用具有多层输出的卷积神经网络,且所述对比学习模型的对比损失函数为基于所述卷积神经网络的多层输出所计算的对比损失函数。
  11. 根据权利要求9所述的图像的目标检测方法,其特征在于,所述目标检测模型与所述对比学习模型均采用FPN结构的卷积神经网络。
  12. 根据权利要求1至11任一项所述的图像的目标检测方法,其特征在于,在通过对比学习的方式训练所述对比学习模型之后,还包括:
    将所述目标检测数据集中的图像依次作为训练图像输入至语义分割模型中进行训练,得到训练完成的所述语义分割模型;
    将待测图像输入至训练完成的所述语义分割模型,得到由所述语义分割模型输出的针对所述待测图像的语义分割结果。
  13. 根据权利要求2所述的图像的目标检测方法,其特征在于,所述从在所述预训练图像上自动生成多个矩形框,包括:
    在多个指定位置自动生成矩形框。
  14. 根据权利要求5所述的图像的目标检测方法,其特征在于,所述调整的方式包括如下的一项或多项:
    图像旋转、分辨率的调整、长度调整、宽度调整。
  15. 根据权利要求5所述的图像的目标检测方法,其特征在于,所述调整的策略表示为:
    (w,h)=λ3*(λ1*w1,λ2*h1);
    其中,w表示的是新分辨率中的长,h表示的是新分辨率中的宽,w1表示的是原始分辨率中的长,h1表示的是原始分辨率中的宽,λ1与λ2是为长度和宽度分别设置的变化系数,λ3则是整体变化系数。
  16. 根据权利要求8所述的图像的目标检测方法,其特征在于,所述对比学习模型采用query network和key network的结构。
  17. 根据权利要求16所述的图像的目标检测方法,其特征在于,所述对输入图像中的目标进行特征级别的表征表示为:
    vq=RoI Align(fq(Iq),bbq);
    vki=RoI Align(fk(Iki),bbki);
    其中,函数fq和函数fk分别指代query network和key network。
  18. 一种图像的目标检测系统,其特征在于,包括:
    预训练数据集确定模块,用于确定出预训练数据集,并将所述预训练数据集中的图像依次作为预训练图像;
    搜索框选择模块,用于选取出任意1张预训练图像之后,从所述预训练图像中确定出搜索框;
    裁剪粘贴扰动模块,用于将所述搜索框中的图像裁剪,并按照预设规则粘贴至n张不同的背景图像上,且粘贴之后对粘贴图像的边框进行移动;其中,n为不小于2的正整数,任意1张所述背景图像均来源于目标检测数据集;
    对比学习模型训练模块,用于将进行了边框移动的各张图像输入至对比学习模型中,通过对比学习的方式训练所述对比学习模型;
    目标检测模型训练模块,用于将所述目标检测数据集中的图像依次作为训练图像输入至目标检测模型中进行训练,得到训练完成的所述目标检测模型;
    目标检测结果确定模块,用于将待测图像输入至训练完成的所述目标检测模型,得到由所述目标检测模型输出的针对所述待测图像的目标检测结果;
    其中,所述对比学习模型设置了用于进行目标在特征级别的表征的特征图像表征算法,且与所述目标检测模型所采用的特征图像表征算法为同一算法;所述对比学习模型设置了用于进行目标在向量级别的表征的特征向量表征算法,且与所述目标检测模型所采用的特征向量表征算法为同一算法。
  19. 一种图像的目标检测设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序以实现如权利要求1至17任一项所述的图像的目标检测方法的步骤。
  20. 一种非易失性计算机可读存储介质,其特征在于,所述非易失性计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至17任一项所述的图像的目标检测方法的步骤。
PCT/CN2023/078490 2022-09-15 2023-02-27 一种图像的目标检测方法、系统、设备及存储介质 WO2024055530A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211118927.6 2022-09-15
CN202211118927.6A CN115205636B (zh) 2022-09-15 2022-09-15 一种图像的目标检测方法、系统、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024055530A1 true WO2024055530A1 (zh) 2024-03-21

Family

ID=83572781

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078490 WO2024055530A1 (zh) 2022-09-15 2023-02-27 一种图像的目标检测方法、系统、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115205636B (zh)
WO (1) WO2024055530A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205636B (zh) * 2022-09-15 2023-04-07 苏州浪潮智能科技有限公司 一种图像的目标检测方法、系统、设备及存储介质
CN116596878B (zh) * 2023-05-15 2024-04-16 湖北纽睿德防务科技有限公司 一种带钢表面缺陷检测方法、系统、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180260628A1 (en) * 2017-03-13 2018-09-13 Fanuc Corporation Apparatus and method for image processing to calculate likelihood of image of target object detected from input image
CN108648233A (zh) * 2018-03-24 2018-10-12 北京工业大学 一种基于深度学习的目标识别与抓取定位方法
CN114898111A (zh) * 2022-04-26 2022-08-12 北京百度网讯科技有限公司 预训练模型生成方法和装置、目标检测方法和装置
CN115205636A (zh) * 2022-09-15 2022-10-18 苏州浪潮智能科技有限公司 一种图像的目标检测方法、系统、设备及存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016614B (zh) * 2020-08-27 2022-10-11 北京理工大学 光学图像目标检测模型的构建方法、目标检测方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180260628A1 (en) * 2017-03-13 2018-09-13 Fanuc Corporation Apparatus and method for image processing to calculate likelihood of image of target object detected from input image
CN108648233A (zh) * 2018-03-24 2018-10-12 北京工业大学 一种基于深度学习的目标识别与抓取定位方法
CN114898111A (zh) * 2022-04-26 2022-08-12 北京百度网讯科技有限公司 预训练模型生成方法和装置、目标检测方法和装置
CN115205636A (zh) * 2022-09-15 2022-10-18 苏州浪潮智能科技有限公司 一种图像的目标检测方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN115205636B (zh) 2023-04-07
CN115205636A (zh) 2022-10-18

Similar Documents

Publication Publication Date Title
CN109325398B (zh) 一种基于迁移学习的人脸属性分析方法
WO2024055530A1 (zh) 一种图像的目标检测方法、系统、设备及存储介质
WO2018103608A1 (zh) 一种文字检测方法、装置及存储介质
CN109583483B (zh) 一种基于卷积神经网络的目标检测方法和系统
CN111259940B (zh) 一种基于空间注意力地图的目标检测方法
CN107545263B (zh) 一种物体检测方法及装置
CN110874618B (zh) 基于小样本的ocr模板学习方法、装置、电子设备及介质
CN113673338B (zh) 自然场景文本图像字符像素弱监督自动标注方法、系统及介质
CN111914698B (zh) 图像中人体的分割方法、分割系统、电子设备及存储介质
CN110020650B (zh) 一种基于深度学习识别模型的倾斜车牌的识别方法及装置
CN107123130B (zh) 一种基于超像素和混合哈希的核相关滤波目标跟踪方法
CN110598698B (zh) 基于自适应区域建议网络的自然场景文本检测方法和系统
CN110852327A (zh) 图像处理方法、装置、电子设备及存储介质
CN110827312A (zh) 一种基于协同视觉注意力神经网络的学习方法
CN113837290A (zh) 一种基于注意力生成器网络的无监督非成对图像翻译方法
CN113052039A (zh) 一种交通路网行人密度检测的方法、系统及服务器
CN116863319A (zh) 基于跨尺度建模和交替细化的复制移动篡改检测方法
CN109977862B (zh) 一种车位限位器的识别方法
CN112884135B (zh) 一种基于边框回归的数据标注校正方法
CN114519717A (zh) 一种图像处理方法及装置、计算机设备、存储介质
CN111881914A (zh) 一种基于自学习阈值的车牌字符分割方法及系统
CN109325487B (zh) 一种基于目标检测的全种类车牌识别方法
CN114882372A (zh) 一种目标检测的方法及设备
CN112800259A (zh) 一种基于边缘闭合与共性检测的图像生成方法及系统
CN112634331A (zh) 一种光流预测方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864273

Country of ref document: EP

Kind code of ref document: A1