CN111325204B

CN111325204B - Target detection method, target detection device, electronic equipment and storage medium

Info

Publication number: CN111325204B
Application number: CN202010070961.5A
Authority: CN
Inventors: 黄超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2023-10-31
Anticipated expiration: 2040-01-21
Also published as: CN111325204A

Abstract

The embodiment of the invention discloses a target detection method, a target detection device, electronic equipment and a storage medium, wherein the target detection method comprises the steps of collecting an image to be detected; extracting image features of the image to be detected under a plurality of scales, and acquiring a reference object set corresponding to each scale; predicting the region where the target object is located in the image to be detected according to the image characteristics to obtain a predicted region; selecting a reference object matched with the target object from the reference object set to obtain a target reference object; feature fusion is carried out on the image features under the multiple scales, and fused image features are obtained; and detecting the target object based on the prediction area, the target reference object and the fused image characteristics to obtain a detection result.

Description

Target detection method, target detection device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a target detection method, a target detection device, an electronic device, and a storage medium.

Background

Target detection is the basis of many computer vision tasks, and the task of target detection is to find all interested target objects in an image, determine their positions and sizes, and is one of the core problems in the field of machine vision. Because various objects have different appearances, shapes and attitudes, and the interference of factors such as illumination and/or shielding during imaging is added, target detection is always the most challenging problem in the field of machine vision.

The existing target detection technology generally adopts a mode of detecting targets one by one, when the targets are dense targets, some individual targets in the dense targets are easy to misjudge as background areas or misjudge that the background areas are individual targets in the dense targets, and the accuracy rate of target detection is reduced.

Disclosure of Invention

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium, which can improve the accuracy of target detection.

The embodiment of the invention provides a target detection method, which comprises the following steps:

collecting an image to be detected;

extracting image features of the image to be detected under a plurality of scales, and acquiring a reference object set corresponding to each scale;

predicting the region where the target object is located in the image to be detected according to the image characteristics to obtain a predicted region;

Selecting a reference object matched with the target object from the reference object set to obtain a target reference object;

feature fusion is carried out on the image features under the multiple scales, and fused image features are obtained;

and detecting the target object based on the prediction area, the target reference object and the fused image characteristics to obtain a detection result.

Correspondingly, the embodiment of the invention also provides a target detection device, which comprises:

the acquisition module is used for acquiring an image to be detected;

the extraction module is used for extracting image features of the image to be detected under a plurality of scales;

the acquisition module is used for acquiring a reference object set corresponding to each scale;

the prediction module is used for predicting the region where the target object is located in the image to be detected according to the image characteristics to obtain a prediction region;

the selection module is used for selecting a reference object matched with the target object from the reference object set to obtain a target reference object;

the fusion module is used for carrying out feature fusion on the image features under the multiple scales to obtain fused image features;

and the detection module is used for detecting the target object based on the prediction area, the target reference object and the fused image characteristics to obtain a detection result.

Optionally, in some embodiments of the present invention, the detection module includes:

the acquisition unit is used for acquiring the region where the target reference object is located in the image to be detected to obtain a reference region;

and the detection unit is used for detecting the target object based on the prediction area, the reference area and the fused image characteristics to obtain a detection result.

Optionally, in some embodiments of the present invention, the detection unit includes:

an adjusting subunit, configured to adjust the position of the prediction area according to the reference area, so as to obtain an adjusted area;

and the detection subunit is used for detecting the target object based on the adjusted region and the fused image characteristics to obtain a detection result.

Optionally, in some embodiments of the present invention, the adjusting subunit is specifically configured to:

calculating the position offset between the predicted area and the reference area;

and adjusting the position of the prediction area based on the position offset to obtain an adjusted area.

Optionally, in some embodiments of the present invention, the detection subunit is specifically configured to:

adjusting the position of the adjusted region in the image to be detected according to the fused image characteristics to obtain a target region;

And detecting the target object based on the target area to obtain the category to which the target object belongs.

Optionally, in some embodiments of the present invention, the fusion module is specifically configured to:

extracting depth information corresponding to each image feature;

and carrying out feature fusion on the image features under the multiple scales based on the depth information to obtain fused image features.

Optionally, in some embodiments of the present invention, the prediction module is specifically configured to:

acquiring a trained target detection model, wherein the target detection model is obtained by training a plurality of sample images;

and predicting the region where the target object is located in the image to be detected based on the target detection model and the image characteristics to obtain a prediction region.

Optionally, in some embodiments of the present invention, the method further includes a training module, where the training module is specifically configured to:

collecting a plurality of sample images marked with regional attributes;

determining sample images which need to be trained currently from the acquired sample images of a plurality of samples to obtain a current processing object;

the current processing object is guided into a preset initial detection model for training, and a predicted object corresponding to the current processing object is obtained;

Converging a reference object corresponding to the current processing object and a predicted object of the current processing object so as to adjust parameters of the preset initial detection model;

and returning to the step of executing the sample images of the plurality of collected samples to determine the sample images which need to be trained currently until the plurality of sample images are trained.

Optionally, in some embodiments of the present invention, the selecting module is specifically configured to:

and selecting a reference object which accords with the image characteristics of the target object from the reference object set to obtain the target reference object.

The method comprises the steps of collecting an image to be detected, extracting image features of the image to be detected under a plurality of scales, obtaining a reference object set corresponding to each scale, predicting an area where a target object in the image to be detected is located according to the image features, obtaining a prediction area, selecting a reference object matched with the target object from the reference object set, obtaining a target reference object, carrying out feature fusion on the image features under the plurality of scales, obtaining fused image features, and detecting the target object based on the prediction area, the target reference object and the fused image features, so as to obtain a detection result. Therefore, the scheme can effectively improve the accuracy of target detection.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a scenario of a target detection method according to an embodiment of the present invention;

FIG. 1b is a schematic flow chart of a target detection method according to an embodiment of the present invention;

FIG. 2a is a schematic flow chart of a target detection method according to an embodiment of the present invention;

fig. 2b is a schematic diagram of another scenario of the target detection method according to the embodiment of the present invention;

FIG. 2c is a schematic diagram of sample labeling in the target detection method according to the embodiment of the present invention;

FIG. 2d is a schematic diagram of a feature conversion module according to an embodiment of the present invention;

FIG. 3a is a schematic diagram of a target detection apparatus according to an embodiment of the present invention;

FIG. 3b is a schematic diagram of another structure of the object detection device according to the embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium.

The object detection device may be integrated in a terminal, which may include a mobile phone, a tablet computer or a personal computer (PC, personal Computer).

For example, referring to fig. 1a, the target detection device is integrated on a personal computer, when the personal computer receives a target detection request, the target detection request includes an image to be detected, for example, the image to be detected is a game interface diagram, the target detection request indicates to detect a control in the game interface diagram, the personal computer may extract image features of the image to be detected under multiple scales, and obtain a reference object set corresponding to each scale, then predict an area where a target object in the image to be detected is located according to the image features to obtain a prediction area, then select a reference object matched with the target object from the reference object set to obtain a target reference object, then perform feature fusion on the image features under the multiple scales to obtain a fused image feature, and finally detect the target object based on the prediction area, the target reference object and the fused image feature to obtain a detection result, for example, the personal computer may detect that the game interface diagram has a control, and may also detect a position of the control in the game interface diagram.

Compared with the existing target detection scheme, the fused image features in the target detection scheme are obtained by feature fusion of the image features under a plurality of scales, so that the feature expression capability can be improved.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

A method of detecting an object, comprising: collecting an image to be detected, extracting image features of the image to be detected under a plurality of scales, acquiring a reference object set corresponding to each scale, predicting an area where a target object in the image to be detected is located according to the image features to obtain a prediction area, selecting a reference object matched with the target object from the reference object set to obtain a target reference object, carrying out feature fusion on the image features under the plurality of scales to obtain fused image features, and detecting the target object based on the prediction area, the target reference object and the fused image features to obtain a detection result.

Referring to fig. 1b, fig. 1b is a flow chart of a target detection method according to an embodiment of the invention. The specific flow of the target detection method can be as follows:

101. and collecting an image to be detected.

The image to be detected can be an image shot in real time through a camera, can be an image intercepted from a section of video stream, can be a game interface image, and can be acquired in various ways, for example, the image to be detected can be acquired from the Internet and/or a designated database, and the specific requirement can be determined according to practical application.

102. And extracting image characteristics of the image to be detected under a plurality of scales, and acquiring a reference object set corresponding to each scale.

For example, specifically, the feature extraction may be performed on the image to be detected by using an anchoring optimization module (Anchor Refinement Module, ARM) in the trained target detection model to obtain image features of the image to be detected under multiple scales, and meanwhile, a reference object set corresponding to each scale is obtained, where the reference object set may include multiple reference objects, and the reference objects may be specific objects, people or animals, and may also be a priori region, for example, the target detection request indicates to detect a specific region in the image to be detected, such as a region where a control is located in the game interface image, and at this time, the reference objects may be a priori region, and it should be noted that the reference objects may be constructed by using samples used when the target detection model is trained.

It should be noted that, in the embodiment of the present invention, extracting the image features of the image to be detected under multiple scales refers to extracting the image features of the image to be detected under different sizes, for example, extracting the image features of the image to be detected under three scales of 40×40, 20×20 and 10×10, to obtain the image features corresponding to the three image sizes.

103. And predicting the region where the target object is located in the image to be detected according to the image characteristics to obtain a predicted region.

For example, specifically, the area where the target object is located in the image to be detected may be predicted by an anchor optimization module (Anchor Refinement Module, ARM) in the trained target detection model, that is, optionally, in some embodiments, the step of "predicting the area where the target object is located in the image to be detected according to the image feature to obtain the predicted area" may specifically include:

(11) Acquiring a trained target detection model;

(12) And predicting the region where the target object is located in the image to be detected based on the target detection model and the image characteristics to obtain a prediction region.

The target detection model may be a refined detection network model (refinished), and the target detection model may include an anchor optimization module (Anchor Refinement Module, ARM), a target detection module (Object Detection Module, ODM), and a feature conversion module (Transfer Connection Block, TCB), and may predict an area where a target object is located in an image to be detected based on the ARM and the image feature, where it is noted that the target detection model may be pre-constructed, that is, in some embodiments, before step "obtaining a trained target detection model", the method may specifically further include:

(21) Collecting a plurality of sample images marked with regional attributes;

(22) Determining sample images which need to be trained currently from the acquired sample images of a plurality of samples to obtain a current processing object;

(23) The current processing object is imported into a preset initial detection model for training, and a predicted object corresponding to the current processing object is obtained;

(24) Converging a reference object corresponding to the current processing object and a predicted object of the current processing object so as to adjust parameters of a preset initial detection model;

(25) And returning to the step of executing the sample images of the plurality of collected samples to determine the sample images which need to be trained currently until the plurality of sample images are trained.

Taking a control in a detection game interface image as an example, firstly, a large number of game interface images can be collected to obtain a sample image set, then, sample images with similarity larger than a preset threshold value in an initial sample image set are calculated, then, the remaining sample images are deleted, for example, the regional attribute of the control in the sample image is marked, the regional attribute of the control comprises the position of the control in the sample image and the type of the control, wherein the type of the control can be preset, for example, the type of the control comprises 'returning', 'attribute', 'closing' and 'other', the setting is specifically carried out according to specific conditions, then, a plurality of sample images (game interface images) marked with the regional attribute can be collected, then, the sample images which are required to be trained currently are determined from the collected sample images of a plurality of samples, a current processing object is obtained, then, the current processing object is led into a preset initial detection model for training, a predicted object corresponding to the current processing object is obtained, the current processing object is then, the reference object corresponding to the current processing object is converged with the predicted for the current processing object, for the preset initial detection object is carried out, the sample image is subjected to the initial detection, and parameters of the sample images are adjusted to be carried out until the samples are required to be completely trained by the sample images.

It should be noted that, the reference object may be constructed by a plurality of sample images marked with area attributes, specifically, k types of areas may be randomly selected from the plurality of sample images, where k is a positive integer, and the k types of areas may be on the same sample image or on different sample images, for example, 4 types of areas may be randomly selected from the plurality of sample images as a cluster center, where the 4 types of areas may be: "return", "attribute", "close" and "other", then, calculate the distance between each area in each sample image and these 4 cluster centers, and assign each area in each sample image to the cluster center closest to, after assigning an area, the cluster center will recalculate according to the samples in the clusters until the clusters meet the preset condition, the preset condition is that the samples are not assigned to different clusters, the cluster center is not changed, and regard the cluster center at the time of termination as the reference object, i.e. in some embodiments, the specific method may further include:

(31) Randomly selecting k types of areas from a plurality of sample images to obtain k clustering centers;

(32) Calculating the distance between each region in each sample image and k clustering centers;

(33) And updating the k clustering centers based on the distance, and returning to execute calculation of the distance between each region in each sample image and the k clustering centers until the clustering center obtained by the nth calculation is the same as the clustering center obtained by the n-1 th calculation, wherein n is a positive integer.

104. And selecting a reference object matched with the target object from the reference object set to obtain the target reference object.

Wherein, the reference object meeting the image characteristics of the target object can be selected from the reference object set to obtain the target reference object, that is, optionally, in some embodiments, the step of selecting the reference object matched with the target object from the reference object set to obtain the target reference object may specifically include: and selecting a reference object which accords with the image characteristics of the target object from the reference object set to obtain the target reference object.

For example, if the image to be detected is a face image, a reference object conforming to the image features of the face may be selected from the reference object set to obtain a target reference object, which may be specifically set according to the actual situation.

105. And carrying out feature fusion on the image features under a plurality of scales to obtain fused image features.

In order to improve the feature expression capability, feature fusion may be performed on image features under multiple scales, for example, in step 102, the image features of the image to be detected under 4 scales are respectively extracted as follows: the image features B may be deconvoluted by deconvolution, and then the image features B and the image features a after deconvolution are added, and feature extraction is performed on the added result to obtain a fused image feature, for example, the image features B, the image features C and the image features D may be deconvoluted, and the deconvoluted result may be added to the image feature a, and then feature extraction is performed on the added result to obtain three fused image features, which are not described in detail herein.

It should be noted that in the embodiment of the present invention, feature fusion is performed on image features under multiple scales, that is, deconvolution is performed on image features with a deeper depth in the extracted image features, and then the image features after deconvolution are added to image features with a shallower depth, for example, image features a, B, C, and D extracted to 40×40, 20×20, 10×10, and 5*5, then the depth of image feature a is shallower than that of image feature B, so that deconvolution may be performed on image feature B, in other words, in the embodiment of the present invention, feature fusion is performed on image features under multiple scales according to depth information of image features, that is, optionally, in some embodiments, step "feature fusion is performed on image features under multiple scales to obtain a fused image feature", which specifically may include:

(41) Extracting depth information corresponding to each image feature;

(42) And carrying out feature fusion on the image features under a plurality of scales based on the depth information to obtain fused image features.

It should also be noted that, in the embodiment of the present invention, the depth of an image feature refers to the size of a grid of the image feature in a depth network.

106. And detecting the target object based on the prediction area, the target reference object and the fused image characteristics to obtain a detection result.

For example, when detecting a control of a game interface, predicting an area where a target object (i.e., a control in the game interface) is located in an image to be detected according to image features to obtain a predicted area, for example, the predicted area is represented in a form of a graphic frame, then the target reference object may also be represented in a form of a graphic frame, it may be understood that the target reference object is a priori result of the target object, so that an area where the target reference object is located in the image to be detected may be obtained to obtain a reference area, and then the target object is detected based on the predicted area, the reference area and the fused image features, that is, optionally, the step of detecting the target object based on the predicted area, the target reference object and the fused image features to obtain a detection result may specifically include:

(51) Acquiring an area where a target reference object is located in an image to be detected, and obtaining a reference area;

(52) And detecting the target object based on the prediction area, the reference area and the fused image characteristics to obtain a detection result.

For example, the position of the prediction area may be adjusted according to the reference area to obtain an adjusted area, and then the target object is detected based on the adjusted area and the fused image feature, that is, optionally, in some embodiments, the step of detecting the target object based on the prediction area, the reference area and the fused image feature to obtain a detection result may specifically include:

(61) Adjusting the position of the prediction area according to the reference area to obtain an adjusted area;

(62) And detecting the target object based on the adjusted region and the fused image characteristics to obtain a detection result.

For example, the position information of the reference area and the predicted area on the image to be detected is obtained respectively, the position information may be represented in the form of vertex coordinates, for example, the vertex coordinates of the reference area are obtained as (1, 1), (1, 3), (2, 3) and (2, 1), the vertex coordinates of the predicted area are obtained as (1, 1.5), (1,3.7), (2.8,3) and (2, 1), then, the average coordinates of the coordinates may be calculated according to the offset of the coordinates, and an adjusted area is generated based on the average coordinates, finally, the target object is detected according to the adjusted area and the fused image feature to obtain a detection result, that is, optionally, in some embodiments, the step of adjusting the position of the predicted area according to the reference area to obtain the adjusted area may specifically include:

(71) Calculating the position offset between the predicted area and the reference area;

(72) And adjusting the position of the predicted area based on the position offset to obtain an adjusted area.

Further, the position of the adjusted region may be adjusted according to the fused image feature, that is, optionally, in some embodiments, the step of detecting the target object based on the adjusted region and the fused image feature to obtain a detection result may specifically include:

(81) Adjusting the position of the adjusted region in the image to be detected according to the fused image characteristics to obtain a target region;

(82) And detecting the target object based on the target area to obtain the category to which the target object belongs.

Specifically, the position of the adjusted region in the image to be detected can be adjusted for the second time according to the fused image features to obtain a target region, and then the target object is detected by using the ODM based on the target region to obtain the category to which the target object belongs.

According to the embodiment of the invention, after the image to be detected is acquired, the image characteristics of the image to be detected under a plurality of scales are extracted, the reference object set corresponding to each scale is acquired, then the area where the target object is located in the image to be detected is predicted according to the image characteristics, the predicted area is obtained, then the reference object matched with the target object is selected from the reference object set, the target reference object is obtained, then the image characteristics under the plurality of scales are subjected to characteristic fusion, the fused image characteristics are obtained, finally, the target object is detected based on the predicted area, the target reference object and the fused image characteristics, and the detection result is obtained.

The method according to the embodiment will be described in further detail by way of example.

In this embodiment, a case where the object detection device is specifically integrated in the terminal will be described as an example.

Referring to fig. 2a, a specific process of the target detection method may be as follows:

201. the terminal collects an image to be detected.

The image to be detected can be an image shot in real time through a camera, can be an image intercepted from a section of video stream, can be a game interface image, and can be obtained in various ways, for example, the terminal can collect the image to be detected from the Internet and/or a designated database, and the method can be specifically determined according to the requirements of practical application.

202. The terminal extracts image features of the image to be detected under a plurality of scales, and acquires a reference object set corresponding to each scale.

For example, specifically, the terminal may perform feature extraction on an image to be detected through an anchoring optimization module (Anchor Refinement Module, ARM) in a trained target detection model to obtain image features of the image to be detected under multiple scales, and meanwhile, the terminal acquires a reference object set corresponding to each scale, where the reference object set may include multiple reference objects, and the reference objects may be specific objects, people or animals, and may be a priori region, for example, the target detection request indicates detection of a specific region in the image to be detected, such as a region where a control in a game interface image is located, and at this time, the reference objects may be a priori region, and it should be noted that the reference objects may be constructed by samples used in training of the target detection model.

203. And the terminal predicts the region where the target object is located in the image to be detected according to the image characteristics to obtain a predicted region.

For example, specifically, the region where the target object is located in the image to be detected may be predicted by an anchor optimization module (Anchor Refinement Module, ARM) in a trained target detection model, where the target detection model may be a refined detection network model (RefineDet), and the target detection model may include an anchor optimization module (Anchor Refinement Module, ARM), a target detection module (Object Detection Module, ODM), and a feature transformation module (Transfer Connection Block, TCB), and the target detection model may be pre-constructed, specifically please refer to the previous embodiments, which are not repeated herein.

In order to prevent overfitting, the refindedet uses a pre-trained res net101 under a large-scale database to extract convolution features of the image, and the pre-training refers to training an image classification task in an ImageNet database by using a res net101 model. 1000 image categories of the ImageNet database are provided, the image features are rich, the image samples are over 100 ten thousand, a robust depth model can be obtained by training the database, the weight of the depth model is used as the initialization weight of the refine det extracted features, and the convergence rate of the model can be greatly accelerated through fine adjustment of the model. Meanwhile, the images are randomly cut, turned over and expanded to increase the number of samples, so that the robustness of the model is improved.

204. And the terminal selects a reference object matched with the target object from the reference object set to obtain the target reference object.

The terminal can select a reference object which accords with the image characteristics of the target object from the reference object set to obtain the target reference object.

205. And the terminal performs feature fusion on the image features under a plurality of scales to obtain fused image features.

In order to improve the expression capability of the features, the features of the images under multiple scales can be subjected to feature fusion, the features of the images under multiple scales are subjected to feature fusion, deconvolution operation is performed on the features of the images with deeper depth in the extracted features of the images, then the image features after deconvolution operation are added with the features of the images with shallower depth, the terminal can utilize a feature conversion module to perform feature fusion on the features of the images under multiple scales to obtain the fused features of the images, wherein the feature conversion module adds the features with deeper depth to the transmission features to inherit the large-scale contextual features of the images during training, and in order to enable the features with deeper depth to be matched with the dimensions of the features with shallower depth, a deconvolution layer is adopted to increase the size of the convolved features with deeper depth, the deconvolution layer is added with the feature spectrum with shallower depth, and then the convolved layer is added to improve the discrimination of the depth features. Deconvolution is used for increasing the feature scale, and can be expressed by the following formula by performing up-sampling by using an interpolation algorithm and then performing convolution by using a convolution layer:

x _out ＝f(x ₁ )+x ₂

x _out Refers to the feature of fusion, x ₁ Representing deeper features, x ₂ Representing features of lesser depth, f (x ₁ ) Representative of the use of deconvolution layer pair x ₁ And (3) performing operation to increase the size of the convolution layer.

206. And the terminal detects the target object based on the prediction area, the target reference object and the fused image characteristics to obtain a detection result.

For example, specifically, the terminal may obtain an area where the target reference object is located in the image to be detected, obtain a reference area, then adjust the position of the predicted area according to the position information of the reference area and the position information of the predicted area to obtain an adjusted area, then adjust the position of the adjusted area in the image to be detected according to the fused image feature to obtain a target area, and finally detect the target object based on the target area to obtain a category to which the target object belongs.

In order to facilitate understanding of the object detection method provided by the embodiment of the present invention, taking a scene in which an object is a game control as an example, please refer to fig. 2b, wherein the object detection device is integrated in a terminal, firstly, in a training stage, the terminal may collect images of 352 games, each game collects 100 scene images, specifically, the images may be obtained by downloading and/or recording video of the game through a network, wherein the similarity between the images cannot be higher than a preset threshold, if the similarity between the images is higher than the preset threshold, the situation that the object detection model is over-fitted is easily caused, after a sample image set is obtained, the position and type of the control may be marked on the game control, for example, the game control may be selected by using a graphic frame, then different types of game controls may be represented by graphic frames of different colors, the game control types may include return, attribute, closing and other positions, where the positions are composed of the abscissa, the ordinate, the width and the height of the graphic frame, as shown in fig. 2c, after the marked sample image set is obtained, a k-means clustering algorithm is adopted to obtain clustering centers corresponding to each type, specifically please refer to the previous embodiment, which is not repeated herein, so that 4 clustering centers may be obtained as prior frames (i.e. reference objects) of the refindedet, then the marked sample image set and the prior frames may be imported into the target detection model, and when the input image size is 320×320, the anchoring optimization module predicts the positions of the targets from the convolution feature spectrums of four scales, where the four scales are 40×40, 20×20, 10×10 and 5*5, and game buttons corresponding to different scales. The convolution feature of 40×40 mainly detects a button of a first size in an image, the convolution features of 20×20 and 10×10 mainly detect a button of a second size in an image, the convolution feature of 5*5 mainly fits a button of a third size in an image, each target prior frame is associated with a unit of a convolution feature spectrum, for example, when predicting a target position based on the convolution feature of 40×40, 1600 grids are shared, each grid feature has a prior frame of a corresponding target, the center of the prior frame is the center of the grid, and the width and height of the prior frame are the width and height obtained by previous clustering. The initialization position of each target frame relative to the corresponding unit is fixed, after the anchoring optimization module outputs the target frames, the convolution characteristic spectrum is converted through the characteristic conversion module, and characteristics required by the target detection module are generated. The specific structure of the feature conversion module is shown in fig. 2d, and features with shallower depth are fused with features with deeper depth, so that the expression capability of the features is improved. Features with shallower depths refer to convolution features with larger mesh sizes in the depth network, and features with deeper depths refer to convolution features with smaller mesh sizes in the depth network. For example, the convolution feature with the grid size of 20×20 is extracted from a plurality of convolution layers based on the convolution feature of 40×40, the convolution feature of 40×40 has a shallow depth, and the convolution feature of 20×20 has a deep depth.

The fusion process is that features with deeper depth are amplified in the width and the height of the features through a deconvolution layer with the kernel size of 4 and the step length of 2, and the deconvolution layer is used for extracting the convolution features after adding the features with shallower depth, so that the depth of a network can be increased, and the discrimination of the features is improved. The feature conversion module inputs the convolution features of the target frame detected by the anchoring optimization module and outputs of the latter feature conversion module, the output is the convolution features of the target detection module for target detection, and based on the fused features, the target detection module carries out secondary adjustment on the target frame position obtained by the target frame generation module and simultaneously outputs all kinds of predicted values in each detection frame.

Specifically, the cross entropy loss of the category and the L1 loss of the position can be used as a loss function of the model, and the model parameters can be optimized by adopting a gradient backward transfer algorithm. Wherein the cross entropy loss of the category is as follows:

m is the total number of categories, y _c Representing the true category, p _c Is a predictive value, the class cross entropy loss is used to calculate the loss of target class prediction if the target real class corresponds to the scoreThe number is higher, the cross entropy loss is smaller, and if the prediction score corresponding to the target real category is lower, the category cross entropy loss is larger.

The L1 loss of the position is used for calculating the deviation of the network prediction target position, the position fitting is carried out on the basis of a frame to be selected obtained by the anchoring optimization module, and the formula of the L1 loss is as follows:

Y _i refers to the true position of the target and,refers to the predicted location of the target. If the difference between the position and the position of the real button is small, the loss of L1 is small, otherwise, the loss is large.

Finally, the target object (game control) of the image to be detected is detected based on the trained target detection model, and the embodiment is specifically referred to the previous embodiment, and is not repeated herein.

In order to facilitate better implementation of the target detection method according to the embodiment of the present invention, the embodiment of the present invention further provides a target detection device (abbreviated as a detection device) based on the foregoing method. The meaning of the nouns is the same as that of the target detection method, and specific implementation details can be referred to in the description of the method embodiment.

Referring to fig. 3a, fig. 3a is a schematic structural diagram of a target detection device according to an embodiment of the present invention, where the detection device may include an acquisition module 301, an extraction module 302, an acquisition module 303, a prediction module 304, a selection module 305, a fusion module 306, and a detection module 307, and may specifically be as follows:

the acquisition module 301 is configured to acquire an image to be detected.

The image to be detected may be an image captured in real time through a camera, may be an image captured from a video stream, or may be a game interface image, and the ways of obtaining the image to be detected may be various, for example, the acquisition module 301 may acquire the image to be detected from the internet and/or a designated database, and may be specifically determined according to the requirements of practical applications.

The extracting module 302 is configured to extract image features of the image to be detected under multiple scales.

For example, the extraction module 302 may perform feature extraction on the image to be detected through an anchor optimization module (Anchor Refinement Module, ARM) in the trained target detection model.

An obtaining module 303, configured to obtain a reference object set corresponding to each scale.

And the prediction module 304 is configured to predict an area where the target object is located in the image to be detected according to the image feature, so as to obtain a predicted area.

In particular, for example, the prediction module 304 may predict the region of the image to be detected where the target object is located by an anchor optimization module (Anchor Refinement Module, ARM) in a trained target detection model,

alternatively, in some embodiments, the prediction module 304 may be specifically configured to: and obtaining a trained target detection model, and predicting an area where a target object in the image to be detected is based on the target detection model and the image characteristics to obtain a predicted area.

Optionally, in some embodiments, referring to fig. 3b, the detection apparatus may further include a training module 308, where the training module 308 may specifically be configured to: collecting a plurality of sample images marked with regional attributes, determining sample images which need to be trained currently from the collected sample images of the plurality of samples, obtaining a current processing object, guiding the current processing object into a preset initial detection model for training, obtaining a predicted object corresponding to the current processing object, converging a reference object corresponding to the current processing object and the predicted object of the current processing object so as to adjust parameters of the preset initial detection model, and returning to the step of determining the sample images which need to be trained currently from the collected sample images of the plurality of samples until the plurality of sample images are trained.

The selection module 305 is configured to select a reference object matching the target object from the reference object set, so as to obtain the target reference object.

Wherein the selection module 305 may select a reference object from the reference object set, which meets the image characteristics of the target object, to obtain the target reference object, that is, optionally, in some embodiments, the selection module 305 is specifically configured to: and selecting a reference object which accords with the image characteristics of the target object from the reference object set to obtain the target reference object.

And the fusion module 306 is used for carrying out feature fusion on the image features under a plurality of scales to obtain fused image features.

In order to improve the expressive power of the features, feature fusion may be performed on the image features under multiple scales, that is, optionally, the fusion module 306 is specifically configured to: and extracting depth information corresponding to each image feature, and carrying out feature fusion on the image features under a plurality of scales based on the depth information to obtain fused image features.

The detection module 307 is configured to detect the target object based on the prediction area, the target reference object, and the fused image feature, so as to obtain a detection result.

For example, specifically, the detection module 307 may acquire an area where the target reference object is located in the image to be detected to obtain a reference area, then the detection module 307 adjusts the position of the predicted area according to the position information of the reference area and the position information of the predicted area to obtain an adjusted area, then the detection module 307 adjusts the position of the adjusted area in the image to be detected according to the fused image feature to obtain a target area, and finally the terminal detects the target object based on the target area to obtain a category to which the target object belongs

Optionally, in some embodiments, the detection module 307 comprises:

Optionally, in an embodiment, the detection unit may specifically include:

the adjustment subunit is used for adjusting the position of the prediction area according to the reference area to obtain an adjusted area;

Optionally, in some embodiments, the adjusting subunit is specifically configured to: and calculating the position offset between the predicted area and the reference area, and adjusting the position of the predicted area based on the position offset to obtain an adjusted area.

Alternatively, in some embodiments, the detection subunit may be specifically configured to: and adjusting the position of the adjusted region in the image to be detected according to the fused image characteristics to obtain a target region, and detecting a target object based on the target region to obtain the category of the target object.

After the acquisition module 301 of the embodiment of the present invention acquires an image to be detected, the extraction module 302 extracts image features of the image to be detected under multiple scales, and the acquisition module 303 acquires a reference object set corresponding to each scale, then the prediction module 304 predicts an area where a target object is located in the image to be detected according to the image features to obtain a prediction area, then the selection module 305 selects a reference object matched with the target object from the reference object set to obtain a target reference object, then the fusion module 306 performs feature fusion on the image features under multiple scales to obtain a fused image feature, and finally the detection module 307 detects the target object based on the prediction area, the target reference object and the fused image feature to obtain a detection result.

In addition, the embodiment of the invention further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby controlling the electronic device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

collecting an image to be detected, extracting image features of the image to be detected under a plurality of scales, acquiring a reference object set corresponding to each scale, predicting an area where a target object in the image to be detected is located according to the image features to obtain a prediction area, selecting a reference object matched with the target object from the reference object set to obtain a target reference object, carrying out feature fusion on the image features under the plurality of scales to obtain fused image features, and detecting the target object based on the prediction area, the target reference object and the fused image features to obtain a detection result.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any one of the object detection methods provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The instructions stored in the storage medium may perform steps in any one of the target detection methods provided in the embodiments of the present invention, so that the beneficial effects that any one of the target detection methods provided in the embodiments of the present invention can be achieved, which are detailed in the previous embodiments and are not described herein.

The above description of the target detection method, the device, the electronic equipment and the storage medium provided by the embodiment of the present invention applies specific examples to illustrate the principles and the implementation of the present invention, and the above description of the embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A method of detecting an object, comprising:

collecting an image to be detected;

predicting an area where a target object is located in an image to be detected based on a trained target detection model and the image characteristics to obtain a prediction area, wherein the target detection model is obtained by training a plurality of sample images;

selecting a reference object matched with the target object from the reference object set to obtain a target reference object, wherein the reference object is a sample image used in training of the target detection model;

converting the image features under the multiple scales into image features with matched dimensions, and then summing to obtain fused image features;

2. The method according to claim 1, wherein detecting the target object based on the prediction area, the target reference object, and the fused image features to obtain a detection result includes:

Acquiring an area where a target reference object is located in the image to be detected, and obtaining a reference area;

and detecting the target object based on the prediction area, the reference area and the fused image characteristics to obtain a detection result.

3. The method according to claim 2, wherein detecting the target object based on the prediction area, the reference area, and the fused image features to obtain a detection result includes:

adjusting the position of the prediction area according to the reference area to obtain an adjusted area;

and detecting the target object based on the adjusted region and the fused image characteristics to obtain a detection result.

4. A method according to claim 3, wherein said adjusting the position of the predicted region according to the reference region to obtain an adjusted region comprises:

5. A method according to claim 3, wherein detecting the target object based on the adjusted region and the fused image features to obtain a detection result comprises:

6. The method according to any one of claims 1 to 5, wherein the converting the image features at the multiple scales into dimension-matched image features and summing the dimension-matched image features to obtain fused image features includes:

extracting depth information corresponding to each image feature;

and converting the image features under the multiple scales into image features with matched dimensions based on the depth information, and then summing to obtain the fused image features.

7. The method of claims 1 to 5, further comprising, prior to said obtaining a trained target detection model:

collecting a plurality of sample images marked with regional attributes;

8. The method according to any one of claims 1 to 5, wherein selecting a reference object from the reference object set that matches the target object to obtain a target reference object includes:

9. An object detection apparatus, comprising:

the acquisition module is used for acquiring an image to be detected;

the prediction module is used for predicting the region where the target object is located in the image to be detected based on the trained target detection model and the image characteristics to obtain a prediction region, wherein the target detection model is obtained by training a plurality of sample images;

The selection module is used for selecting a reference object matched with the target object from the reference object set to obtain a target reference object, wherein the reference object is a sample image used in training of the target detection model;

the fusion module is used for converting the image features under the multiple scales into image features with matched dimensions and then summing the image features to obtain fused image features;

10. The apparatus of claim 9, wherein the detection module comprises:

11. The apparatus according to claim 10, wherein the detection unit comprises:

12. The apparatus of claim 11, wherein the adjustment subunit is specifically configured to:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the object detection method according to any one of claims 1-8 when the program is executed by the processor.

14. A computer readable storage medium, having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the object detection method according to any of claims 1-8.