CN116935179A

CN116935179A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN116935179A
Application number: CN202311182443.2A
Authority: CN
Inventors: 张建安; 刘微; 郑维学
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2023-10-24
Anticipated expiration: 2043-09-14
Also published as: CN116935179B

Abstract

The application discloses a target detection method, a target detection device, electronic equipment and a storage medium, which relate to the technical field of computer vision, and are used for acquiring a to-be-detected image with a solid background, performing texture extraction on the to-be-detected image, and determining a texture image corresponding to the to-be-detected image; respectively inputting the image to be detected and the texture image into a target detection model, and determining a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map. The combination of the image to be detected and the texture image enables the target detection model to pay more attention to the foreground target, so that the accuracy of target detection in the solid background is improved. The technical scheme has universality, reliability, robustness and interpretability, and accords with the credibility characteristic.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to a target detection method, apparatus, electronic device, and storage medium.

Background

In the field of computer vision applications, it is often necessary to first select a specific object from a frame in an image for subsequent recognition processing, which is referred to in computer vision as an object detection technique.

Common target detection techniques include the YOLO series of techniques, such as YOLOv1, YOLOv3, YOLOv5, YOLOv7, YOLOv8, etc., the FasterRCNN series of techniques, such as RCNN, fast RCNN, mask RCNN, cascade Mask RCNN, libra RCNN, etc., SSD, retinaNet, cornerNet, etc. These target detection techniques are focused on general target detection, i.e. the detection of targets in a multi-class complex background.

In the prior art, when a general target detection model is trained, a large number of image samples under a multi-category complex background are required to be collected and used for training, the obtained target detection model is used for target detection of each scene, the training difficulty and the training cost of the general target detection model are high, the general target detection model has the capability of target detection under the multi-category complex background, and the accuracy of the existing target detection cannot be guaranteed because the collection difficulty of the large number of image samples under the multi-category complex background is high and the coverage comprehensiveness is poor.

Disclosure of Invention

The application provides a target detection method, a target detection device, electronic equipment and a storage medium, which are used for providing a target detection technical scheme aiming at a solid background image with high accuracy.

In a first aspect, the present application provides a target detection method, the method comprising:

obtaining an image to be detected of a solid background, carrying out texture extraction on the image to be detected, and determining a texture image corresponding to the image to be detected;

respectively inputting the image to be detected and the texture image into a target detection model, and determining a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map.

In a second aspect, the present application provides an object detection apparatus, the apparatus comprising:

the first determining module is used for obtaining an image to be detected of a solid background, extracting textures of the image to be detected, and determining a texture image corresponding to the image to be detected;

the second determining module is used for respectively inputting the image to be detected and the texture image into a target detection model and determining a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map.

In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the method when executing the program stored in the memory.

In a fourth aspect, the present application provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps.

The application provides a target detection method, a target detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining an image to be detected of a solid background, carrying out texture extraction on the image to be detected, and determining a texture image corresponding to the image to be detected; respectively inputting the image to be detected and the texture image into a target detection model, and determining a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map.

The technical scheme has the following advantages or beneficial effects:

in the application, aiming at a target detection scene of a solid background, after an image to be detected of the solid background is obtained, a texture image corresponding to the image to be detected is firstly determined through texture extraction; and then inputting the image to be detected and the texture image into a target detection model, carrying out feature extraction and fusion on the image to be detected and the texture image based on the target detection model to determine a first feature map, and determining a target detection result based on the first feature map. The combination of the image to be detected and the texture image enables the target detection model to pay more attention to the foreground target, so that the accuracy of target detection in the solid background is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a target detection process according to the present application;

FIG. 2 is a schematic diagram of a target detection model according to the present application;

FIG. 3 is a schematic diagram of a target detection process according to the present application;

FIG. 4 is a schematic diagram of a target detection model structure provided by the present application;

FIG. 5 is a schematic diagram of a target detection model structure provided by the present application;

FIG. 6 is a schematic diagram of a process for determining a target detection result according to the present application;

FIG. 7 is a schematic diagram of a target detection model structure provided by the present application;

FIG. 8 is a schematic diagram of a training process of the object detection model provided by the present application;

FIG. 9 is a schematic diagram of a prior art object detection model;

FIG. 10 is a schematic diagram of a target detection model according to the present application;

FIG. 11 is a schematic diagram of determining texture images based on Canny operators provided by the present application;

FIG. 12 is a schematic diagram of a target detection process according to the present application;

FIG. 13 is a schematic diagram of a target detection apparatus according to the present application;

fig. 14 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Fig. 1 is a schematic diagram of a target detection process provided by the present application, including the following steps:

S101: and obtaining an image to be detected of a solid background, extracting textures of the image to be detected, and determining a texture image corresponding to the image to be detected.

S102: respectively inputting the image to be detected and the texture image into a target detection model, and determining a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map.

The application provides a target detection method for a solid background image, such as a solid white background, a solid black background and the like. The target detection method is applied to electronic equipment, and the electronic equipment can be equipment such as a PC (personal computer), a tablet personal computer, a server and the like, and can also be image acquisition equipment with image analysis processing capability. If the electronic equipment is the image acquisition equipment, the image acquisition equipment acquires the image to be detected of the solid background, then carries out texture extraction on the image to be detected, determines a texture image corresponding to the image to be detected, and carries out the subsequent target detection process. If the electronic equipment is a PC, a tablet personal computer, a server and the like, the image acquisition equipment acquires the image to be detected of the solid background, then the image to be detected is sent to the electronic equipment, the electronic equipment performs texture extraction on the image to be detected, determines a texture image corresponding to the image to be detected, and performs a subsequent target detection process.

After the electronic equipment acquires the image to be detected of the solid background, firstly, carrying out texture extraction on the image to be detected, and determining a texture image corresponding to the image to be detected. Optionally, texture extraction can be performed on the image to be detected through an edge extraction algorithm, and a texture image corresponding to the image to be detected is determined; or the texture extraction algorithm can be used for extracting the texture of the image to be detected, and the texture image corresponding to the image to be detected is determined.

The electronic equipment is provided with a target detection model which is trained in advance, after a texture image corresponding to an image to be detected is determined, the image to be detected and the texture image are respectively input into the target detection model, the target detection model respectively performs feature extraction and fusion on the image to be detected and the texture image, a first feature map is determined, and a target detection result is determined based on the first feature map. The texture image is used as attention force diagram to calculate the attention of the image to be detected, so that the purpose of focusing on the foreground is achieved, on one hand, the complexity of target detection is reduced, and on the other hand, the accuracy of target detection in a solid-color background is improved.

The target detection method for the solid background image has universality, reliability, robustness and interpretability, and accords with the credibility characteristic.

Interpretability: the model has interpretability because the texture map extracted by the Canny operator is adopted to enhance the input, so that the foreground and the background are better separated. Robustness: the Canny operator has robustness, is insensitive to changes such as the color of an image, namely smaller color pixel changes, and the Canny algorithm outputs a human texture map unchanged, so that the whole algorithm has robustness. Commonality: the method has universality and can be expanded to scenes with any solid color background, such as white, black, blue and the like which are common in the industrial detection field. The same can be used for various scenes of industrial detection, such as part anomaly detection and the like. Therefore, the method has universality. Reliability: the input is focused, the aim of focusing on the foreground object can be achieved, and therefore the detection reliability is achieved.

Fig. 2 is a schematic diagram of a target detection model provided by the present application, including an image extraction network to be detected and a texture image extraction network. Performing feature extraction on the image to be detected based on the image to be detected extraction network to obtain a feature map of the image to be detected; performing feature extraction on the texture image based on the texture image extraction network to obtain a texture image feature map; and fusing the image feature map to be detected and the texture image feature map, determining the first feature map, and determining the target detection result based on the first feature map.

Optionally, the image extraction network to be detected and the texture image extraction network each include a Conv structure, a BN structure and a ReLU structure. The application does not limit the layer number of the image extraction network to be detected and the texture image extraction network, and can comprise a single-layer Conv structure, a single-layer BN structure and a single-layer ReLU structure, and can also comprise 2 or more layers of Conv structures, single-layer BN structures and single-layer ReLU structures. Fig. 2 illustrates an example in which the image extraction network to be detected and the texture image extraction network each include a 2-layer Conv structure, a BN structure, and a ReLU structure. Performing feature extraction on the image to be detected based on a Conv structure, a BN structure and a ReLU structure in the image extraction network to be detected to obtain an image feature map to be detected; and carrying out feature extraction on the texture image based on the Conv structure, the BN structure and the ReLU structure in the texture image extraction network to obtain a texture image feature map. And then fusing the image feature map to be detected and the texture image feature map, determining a first feature map, and determining a target detection result based on the first feature map.

In the application, the fusion of the image feature map to be detected and the texture image feature map can be the dimension splicing fusion of the image feature map to be detected and the texture image feature map, or the weighted summation fusion of pixel points at the corresponding positions of the image feature map to be detected and the texture image feature map. Preferably, in order to make the object detection model focus more on the foreground object, in the present application, fusing the image feature map to be detected and the texture image feature map, determining the first feature map includes: and carrying out point-by-point multiplication fusion on the image feature map to be detected and the texture image feature map, and determining the first feature map.

Fig. 3 is a schematic diagram of a target detection process provided by the present application, including the following steps:

s201: and obtaining an image to be detected of a solid background, extracting textures of the image to be detected, and determining a texture image corresponding to the image to be detected.

S202: and respectively inputting the image to be detected and the texture image into a target detection model, wherein the target detection model comprises an image extraction network to be detected and a texture image extraction network.

S203: performing feature extraction on the image to be detected based on the image to be detected extraction network to obtain a feature map of the image to be detected; and carrying out feature extraction on the texture image based on the texture image extraction network to obtain a texture image feature map.

S204: and carrying out point-by-point multiplication fusion on the image feature map to be detected and the texture image feature map, determining the first feature map, and determining the target detection result based on the first feature map.

Fig. 4 is a schematic structural diagram of a target detection model provided by the present application, where the target detection model further includes a first feature extraction network, a second feature extraction network, and a third feature extraction network;

performing feature extraction on the first feature map based on the first feature extraction network to obtain a second feature map; performing feature extraction on the second feature map based on the second feature extraction network to obtain a third feature map; performing feature extraction on the third feature map based on the third feature extraction network to obtain a fourth feature map; and fusing the second feature map, the third feature map and the fourth feature map, and determining the target detection result according to a feature map fusion result.

The first feature extraction network in the present application may be a front part of a backbone network structure, the second feature extraction network may be a middle part of the backbone network structure, and the third feature extraction network may be a rear part of the backbone network structure. The front part, the middle part and the rear part of the backbone network structure are formed by stacking CSP structures in YOLOv5 and general convolution structures Conv. In the application, the feature images of the image to be detected and the feature images of the texture image are multiplied and fused point by point, after the first feature image is determined, the first feature image is input into a first feature extraction network, and feature extraction is carried out on the first feature image based on the first feature extraction network to obtain a second feature image; inputting the second feature map into a second feature extraction network, and carrying out feature extraction on the second feature map based on the second feature extraction network to obtain a third feature map; and inputting the third feature map into a third feature extraction network, and carrying out feature extraction on the third feature map based on the third feature extraction network to obtain a fourth feature map. And finally, fusing the second feature map, the third feature map and the fourth feature map, and determining a target detection result according to the feature map fusion result.

Optionally, fusing the second feature map, the third feature map and the fourth feature map may be performing dimension stitching fusion on the second feature map, the third feature map and the fourth feature map to obtain a feature map fusion result. The third feature map and the fourth feature map may be fused to obtain a first fused feature map; fusing the second feature map and the first fused feature map to obtain a second fused feature map; fusing the first fused feature map and the second fused feature map to obtain a third fused feature map; and fusing the fourth characteristic diagram and the third fused characteristic diagram to obtain the fourth fused characteristic diagram. And determining a target detection result according to the second fusion feature map, the third fusion feature map and the fourth fusion feature map. And determining the target detection result according to the second fusion feature map, the third fusion feature map and the fourth fusion feature map is beneficial to detecting the large target and the small target.

For scenes in which objects are detected in a solid background, for example, in industrial vision scenes, the objects to be detected typically occupy a relatively large area in the image, unlike the presence of a large number of small objects in a general object image. Based on the above consideration, in order to simplify the model structure of target detection and ensure the accuracy of checking a large target, fig. 5 is a schematic diagram of the structure of the target detection model provided by the present application, where the target detection model further includes a first fusion network, a second fusion network, and a third fusion network;

Fusing the third feature map and the fourth feature map based on the first fusion network to obtain a first fusion feature map;

fusing the second feature map and the first fused feature map based on the second fusion network to obtain a second fused feature map;

fusing the first fused feature map and the second fused feature map based on the third fused network to obtain a third fused feature map;

and determining the target detection result according to the second fusion feature map and the third fusion feature map.

Optionally, the third feature map and the fourth feature map are spliced, then the spliced feature map is input into a first fusion network, and the spliced feature map is fused based on the first fusion network to obtain a first fusion feature map. And splicing the second feature map and the first fusion feature map, inputting the spliced feature map into a second fusion network, and fusing the spliced feature map based on the second fusion network to obtain a second fusion feature map. And splicing the first fusion feature map and the second fusion feature map, inputting the spliced feature map into a third fusion network, and fusing the spliced feature map based on the third fusion network to obtain a third fusion feature map. And finally, determining a target detection result according to the second fusion feature map and the third fusion feature map. The first, second, and third converged networks include Conv-BN-ReLU structures.

The structure of the target detection model shown in fig. 5 reduces the fusion of the fourth feature map and the third fusion feature map, and a fusion network of the fourth fusion feature map is obtained, so that the structure of the target detection model is simpler, and the network structure of the target detection model is improved.

Fig. 6 is a schematic diagram of a process for determining a target detection result according to the present application, including the following steps:

s301: and fusing the third feature map and the fourth feature map based on the first fusion network to obtain a first fusion feature map.

S302: and fusing the second feature map and the first fused feature map based on the second fused network to obtain a second fused feature map.

S303: and fusing the first fused feature map and the second fused feature map based on the third fused network to obtain a third fused feature map.

S304: and determining the target detection result according to the second fusion feature map and the third fusion feature map.

FIG. 7 is a schematic diagram of a target detection model structure provided by the present application, where the target detection model further includes a classification network;

inputting the second fusion feature map and the third fusion feature map into the classification network, and determining the target detection result; the classification network is used for determining the position information and the category information of each target in the image to be detected, and determining the target detection result through a non-maximum suppression algorithm and a confidence filtering algorithm according to the position information and the category information of each target.

And inputting the second fusion feature map and the third fusion feature map into a classification network, wherein the classification network is used for determining the position information and the category information of each target in the image to be detected, and the confidence coefficient corresponding to the position information and the confidence coefficient corresponding to the category information. The category information of the highest confidence is taken as the category information of the target. The positional information of each target is positional information of a detection frame of each target. Filtering the detection frames of the targets with the confidence coefficient smaller than the set confidence coefficient threshold value corresponding to the position information according to the confidence coefficient filtering algorithm, and combining the detection frames of the same type of information by a non-maximum value suppression algorithm aiming at the detection frames of the same type of information to obtain a target detection result.

Fig. 8 is a schematic diagram of a training process of the object detection model provided by the present application, including the following steps:

s401: and aiming at the solid-color background sample image in the training set, carrying out texture extraction on the sample image, and determining a sample texture image corresponding to the sample image.

S402: and inputting the sample image, the sample texture image and the target position information and the category information marked in the sample image into the target detection model, and training the target detection model.

In the application, aiming at a solid background sample image in a training set, texture extraction is carried out on the sample image through an edge extraction algorithm or a contour extraction algorithm, and a sample texture image corresponding to the sample image is determined. The sample image is marked with target position information and category information, and when the position information of the target in the sample image is marked, the target is marked with a rotary rectangular frame by using a marking tool. The specific labeling format is a 4-point format, namely, coordinates from the left upper corner, the right lower corner and the left lower corner of the target, and the formula is [ x1, y1, x2, y2, x3, y3, x4, y4].

And inputting the sample image, the sample texture image and the target position information and the category information marked in the sample image into a target detection model, and training the target detection model. Specifically, weight parameters of an image extraction network to be detected, a texture image extraction network, a first feature extraction network, a second feature extraction network, a third feature extraction network, a first fusion network, a second fusion network, a third fusion network and a classification network in the target detection model are trained.

Specifically, inputting a sample image, a sample texture image and target position information and category information marked in the sample image into a target detection model, and extracting features of the sample image based on an image extraction network to be detected to obtain a sample image feature map; performing feature extraction on the sample texture image based on a texture image extraction network to obtain a sample texture image feature map; and carrying out point-by-point multiplication fusion on the sample image feature map and the sample texture image feature map, and determining a first sample feature map. Performing feature extraction on the first sample feature map based on the first feature extraction network to obtain a second sample feature map; performing feature extraction on the second sample feature map based on the second feature extraction network to obtain a third sample feature map; and carrying out feature extraction on the third sample feature map based on the third feature extraction network to obtain a fourth sample feature map. Fusing the third sample feature map and the fourth sample feature map based on the first fusion network to obtain a first sample fusion feature map; fusing the second sample feature map and the first sample fusion feature map based on a second fusion network to obtain a second sample fusion feature map; and fusing the first sample fusion feature map and the second sample fusion feature map based on a third fusion network to obtain a third sample fusion feature map. Inputting the second sample fusion feature map and the third sample fusion feature map into a classification network, and determining the predicted position information and the predicted category information of each target in the sample image based on the classification network. According to the predicted position information and the predicted category information of each target, and the position information and the category information of each marked target, determining a loss value of the target detection model, adjusting weight parameters of each network in the target detection model according to the loss value, and when the loss value meets the requirement or reaches the model iteration number, completing weight parameter training of an image extraction network to be detected, a texture image extraction network, a first feature extraction network, a second feature extraction network, a third feature extraction network, a first fusion network, a second fusion network, a third fusion network and a classification network in the target detection model, namely completing training of the target detection model.

Aiming at the problem of detecting a rotating target of a solid background, the application provides an improved depth convolution network model (target detection model) based on texture attention according to solid characteristics. Specifically, taking the pure color background characteristics (such as pure white or pure black) of the processed image into consideration, the texture detection method is adopted to generate a texture image corresponding to the image to be detected, and the texture image and the image to be detected are taken as the input of a target detection model at the same time, so that the feature expression capability of the target detection model is improved; in addition, considering large target detection, in the target detection model, only two layers of feature maps are reserved for target prediction compared to the conventional results. The structure of the target detection model is simplified, and the accuracy of large target detection is ensured.

Fig. 9 is a schematic structural diagram of a related art object detection model, as shown in fig. 9, where the related art object detection model includes a backbone network structure front part, a backbone network structure middle part, a backbone network structure rear part, a first converged network, a second converged network, a third converged network, a fourth converged network, and a classification network. The front part of the backbone network structure corresponds to a first feature extraction network in the application, the middle part of the backbone network structure corresponds to a second feature extraction network in the application, and the rear part of the backbone network structure corresponds to a third feature extraction network in the application.

Fig. 10 is a schematic structural diagram of a target detection model provided by the present application, where, as shown in fig. 10, the target detection model includes an image extraction network to be detected, a texture image extraction network, a first feature extraction network, a second feature extraction network, a third feature extraction network, a first fusion network, a second fusion network, a third fusion network, and a classification network.

The application improves the input of the target detection model and the fusion network part of the target detection model. The input improvement to the object detection model is described below.

In industrial vision, the image usually presents a solid background, and in order to fully utilize the solid background information, the application improves the input of a general model, and specifically comprises the following steps: an image texture extraction module is added, and a texture attention structure is provided.

(1) And an image texture extraction module.

Because the image presents a solid background, in order to guide the target detection model to focus on the foreground target, the application provides that the image to be detected is converted into a texture image through texture extraction algorithms such as edges or contours. Let the image to be detected beThe texture extraction algorithm or model is +.>The texture image is output as +.>There is->. After the texture image corresponding to the image to be detected is obtained, the image to be detected and the texture image are respectively input into a target detection model, and a target detection result in the image to be detected is determined based on the target detection model. Specifically, inputting an image to be detected into a to-be-detected image extraction network in a target detection model, inputting a texture image into a texture image extraction network in the target detection model, and carrying out feature extraction on the image to be detected based on the to-be-detected image extraction network to obtain a feature map of the image to be detected; extracting features of the texture image based on a texture image extraction network to obtain a texture image feature map; and fusing the image feature map to be detected and the texture image feature map, determining a first feature map, and determining a target detection result based on the first feature map. The target detection model can realize contour focusing on the image to be detected by using the texture image, only the contour related information in the image to be detected is reserved, and the background useless information is removed. The image to be detected and the corresponding texture image are simultaneously input into the target detection model, so that the purpose of focusing on a foreground target and reducing detection complexity is achieved by using the texture image as an attention map and performing attention calculation on the image to be detected. The process of calculating the attention of the image to be detected by taking the texture image as attention force, namely carrying out point-by-point multiplication fusion on the characteristic image of the image to be detected and the characteristic image of the texture image, and determining a first characteristic image.

Here, theAny texture extraction algorithm, such as Soble operator, shur operator, canny operator, HED convolution network, etc., may be used. Considering the computational complexity and texture effects, the application is illustrated by taking a Canny operator as an example. FIG. 11 is a schematic diagram of determining texture images based on Canny operators provided by the applicationSub-carrying out image to be detected: 1. converting RGB into a gray scale map; 2. gaussian filtering noise reduction treatment; 3. calculating the amplitude and the direction by difference; 4. non-maximum suppression; 5. hysteresis threshold. And obtaining texture images of the image to be detected through the five steps.

(2) Texture attention mechanism.

The texture attention provided by the application uses the texture image as attention force diagram, and performs attention calculation on the image to be detected so as to achieve the purposes of focusing on a foreground target and reducing detection complexity. The specific calculation method is as follows:

here, theThe texture image extraction network and the image extraction network to be detected are respectively represented, and each network structure consists of two layers of stacked Conv-BN-ReLU structures. E represents a texture image, and I represents an image to be detected. />Represents a point-by-point product, and +.>The output signature is represented. Through the proposed texture attention algorithm, the feature map corresponding to the solid-color region in the image to be detected can be filtered, and only the region with correspondingly high texture is reserved.

The improvement of the fusion network part of the object detection model is described as follows.

In general, in order to deal with small target problems, a three-level fusion structure is generally adopted in the structure of the target detection model (as shown in fig. 9). However, for industrial vision, in order to improve the detection performance, all objects to be detected are usually changed into large and medium objects through camera adjustment and the like. Therefore, in order to reduce the redundancy of the model and improve the detection performance, the application improves the fusion network part of the target detection model. The application changes the three-level fusion structure of the general target detection model into a two-level fusion structure. And removing the fusion structure of the processing detail information and the small target. Therefore, the redundancy of the model is reduced, meanwhile, the target detection model is focused on large and medium targets only, the performance of the target detection model is improved, and the accuracy of detecting the large and medium targets is improved.

In the feature extraction process of the target detection model, the deeper the network structure of the target detection model is, the higher the fineness of feature extraction is, the more favorable the extraction of the detail information of the target is, and the more accurate the extraction of the small target is. That is, the fourth converged network in fig. 9 is more advantageous for extracting detailed information and checking small objects. The fourth converged network has the defects that the efficiency of target detection is low, and the training difficulty of a target detection model is increased. In the application, in order to improve the detection performance for industrial vision scenes, all targets to be detected are changed into large and medium targets usually through camera adjustment and the like. Therefore, the application removes the fourth fusion network in fig. 9, does not affect the detection of large and medium targets, simplifies the model structure, improves the target detection efficiency, and reduces the training difficulty of the target detection model.

As shown in fig. 10, the first feature extraction network performs downsampling processing on the input first feature map, that is, the scale of the second feature map obtained by feature extraction on the first feature map based on the first feature extraction network is smaller than that of the first feature map. The second feature extraction network performs downsampling processing on the input second feature map, that is to say, the scale of a third feature map obtained by feature extraction of the second feature map based on the second feature extraction network is smaller than that of the second feature map. The third feature extraction network performs downsampling processing on the input third feature map, that is, the scale of a fourth feature map obtained by feature extraction of the third feature map based on the third feature extraction network is smaller than that of the third feature map. When the image fusion is carried out, the small-scale feature images are required to be up-sampled to be the same as the large-scale feature images to be fused, and then the two feature images with the same scale are fused.

The training process of the target detection model is described as follows:

step 1: and (5) data acquisition and labeling.

Firstly, collecting a large number of sample images with solid color backgrounds on a project site, and then, using a labeling tool to label a target in the sample images by a rotary rectangular frame. The specific labeling format is a 4-point format, namely, coordinates from the left upper corner, the right lower corner and the left lower corner of the target, and the formula is [ x1, y1, x2, y2, x3, y3, x4, y4].

Step 2: according to the improvement scheme, the architecture of the target detection model is realized.

The first feature extraction network, the second feature extraction network and the third feature extraction network are formed by stacking CSP structures in YOLOv5 and general convolution structures Conv. The fusion network is obtained by adding Conv-BN-ReLU structures on the basis of the spliced characteristic diagram, and the classification network is divided into two branch structures of a regression module and a classification module. The overall structure is equivalent to the YOLOv5 original structure.

Step 3: and selecting super parameters such as an optimizer and a learning rate, and training the target detection model. Optionally, the application adopts an AdamW optimizer, the initial learning rate is 1e-3, and adopts a cosine planning method to train the target detection model.

After the training of the target detection model is completed, model reasoning application can be performed. Fig. 12 is a schematic diagram of a target detection process provided by the present application, which specifically includes the steps of:

s501: and inputting an image to be detected, and converting the image into a texture image by using a Canny algorithm.

S502: and inputting the texture image and the image to be detected into a texture attention module (an image extraction network to be detected and a texture image extraction network) to obtain an intermediate feature map.

S503: and inputting the intermediate feature map into a backbone network, carrying out feature extraction on the intermediate feature map by the backbone network, and obtaining a prediction feature map through a fusion network.

S504: and inputting the prediction feature map into a classification network, and predicting the position information and the category information of the target in the image to be detected by using the classification network.

S505: and performing post-processing operations such as non-maximum suppression NMS, confidence filtering algorithm and the like on the predicted result to obtain a final target detection result.

Fig. 13 is a schematic structural diagram of an object detection device according to the present application, including:

a first determining module 121, configured to obtain an image to be detected of a solid background, perform texture extraction on the image to be detected, and determine a texture image corresponding to the image to be detected;

a second determining module 122, configured to input the image to be detected and the texture image into a target detection model, and determine a target detection result in the image to be detected based on the target detection model; the target detection model is used for carrying out feature extraction and fusion on the image to be detected and the texture image respectively, determining a first feature map, and determining the target detection result based on the first feature map.

A second determining module 122, configured to perform feature extraction on the image to be detected based on the image to be detected extracting network, so as to obtain a feature map of the image to be detected; performing feature extraction on the texture image based on the texture image extraction network to obtain a texture image feature map; and fusing the image feature map to be detected and the texture image feature map, determining the first feature map, and determining the target detection result based on the first feature map.

The second determining module 122 is configured to perform point-by-point multiplication fusion on the image feature map to be detected and the texture image feature map, and determine the first feature map.

A second determining module 122, configured to perform feature extraction on the first feature map based on the first feature extraction network to obtain a second feature map; performing feature extraction on the second feature map based on the second feature extraction network to obtain a third feature map; performing feature extraction on the third feature map based on the third feature extraction network to obtain a fourth feature map; and fusing the second feature map, the third feature map and the fourth feature map, and determining the target detection result according to a feature map fusion result.

A second determining module 122, configured to fuse the third feature map and the fourth feature map based on the first fusion network to obtain a first fusion feature map; fusing the second feature map and the first fused feature map based on the second fusion network to obtain a second fused feature map; fusing the first fused feature map and the second fused feature map based on the third fused network to obtain a third fused feature map; and determining the target detection result according to the second fusion feature map and the third fusion feature map.

A second determining module 122, configured to input the second fused feature map and the third fused feature map into the classification network, and determine the target detection result; the classification network is used for determining the position information and the category information of each target in the image to be detected, and determining the target detection result through a non-maximum suppression algorithm and a confidence filtering algorithm according to the position information and the category information of each target.

The apparatus further comprises:

the training module 123 is configured to perform texture extraction on a sample image for a solid background sample image in a training set, and determine a sample texture image corresponding to the sample image; and inputting the sample image, the sample texture image and the target position information and the category information marked in the sample image into the target detection model, and training the target detection model.

The present application also provides an electronic device, as shown in fig. 14, including: processor 131, communication interface 132, memory 133 and communication bus 134, wherein processor 131, communication interface 132, memory 133 accomplish the communication each other through communication bus 134;

the memory 133 has stored therein a computer program which, when executed by the processor 131, causes the processor 131 to perform any of the above method steps.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 132 is used for communication between the above-described electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

The application also provides a computer-readable storage medium having stored thereon a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform any of the above method steps.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of target detection, the method comprising:

2. The method of claim 1, wherein the object detection model comprises an image extraction network to be detected and a texture image extraction network;

performing feature extraction on the image to be detected based on the image to be detected extraction network to obtain a feature map of the image to be detected; performing feature extraction on the texture image based on the texture image extraction network to obtain a texture image feature map;

and fusing the image feature map to be detected and the texture image feature map, determining the first feature map, and determining the target detection result based on the first feature map.

3. The method of claim 2, wherein fusing the image feature map to be detected and the texture image feature map, determining the first feature map comprises:

and carrying out point-by-point multiplication fusion on the image feature map to be detected and the texture image feature map, and determining the first feature map.

4. A method according to any one of claims 1 to 3, wherein the object detection model further comprises a first feature extraction network, a second feature extraction network and a third feature extraction network;

5. The method of claim 4, wherein the object detection model further comprises a first converged network, a second converged network, and a third converged network;

6. The method of claim 5, wherein the object detection model further comprises a classification network;

7. The method of claim 1, wherein the training process of the object detection model comprises:

Aiming at a solid background sample image in a training set, carrying out texture extraction on the sample image, and determining a sample texture image corresponding to the sample image;

and inputting the sample image, the sample texture image and the target position information and the category information marked in the sample image into the target detection model, and training the target detection model.

8. An object detection device, the device comprising:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-7 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-7.