CN114332473A

CN114332473A - Object detection method, object detection device, computer equipment, storage medium and program product

Info

Publication number: CN114332473A
Application number: CN202111154945.5A
Authority: CN
Inventors: 刘文龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-04-12

Abstract

The application provides a target detection method, a target detection device, computer equipment, a storage medium and a program product, and relates to the technical fields of computer vision, image processing, artificial intelligence and the like. The method comprises the steps that an image to be detected is segmented to obtain at least two slices, and pixel point features and pixel point context features of all the slices are extracted through a feature extraction layer to obtain at least two feature graphs; detecting the characteristic diagram to obtain defect information of the slice; the characteristic extraction layer comprises a target network layer for extracting pixel point characteristics of the slices, and the operation of down-sampling is omitted when the pixel point characteristics of the slices are directly extracted through the target network layer, so that the detailed characteristics of an original image are reserved, and even small targets in a small area range can be accurately detected; moreover, a feature extraction layer is designed to extract pixel point features and pixel point context features, so that the robustness of detection on targets with different sizes is improved, and the accuracy of target detection is further improved.

Description

Object detection method, object detection device, computer equipment, storage medium and program product

Technical Field

The present application relates to the technical fields of computer vision, image processing, artificial intelligence, and the like, and relates to a target detection method, apparatus, computer device, storage medium, and program product.

Background

With the development of science and technology, computer vision technology is applied more and more in industrial scenes. For example, in modern industrial manufacturing, mechanical parts, electronic components, and other products produced in an industrial production line inevitably have defects. Therefore, in the quality control process, the product image is usually used for target detection to identify the product defects.

In the related art, the process of target detection may include: and inputting the image of the product into a neural network for detection so as to output the position of the possible defect in the image. However, generally, the neural network cannot satisfy the detection of the image with a larger resolution, thus resulting in lower accuracy and applicability of the target detection.

Disclosure of Invention

The application provides a target detection method, a target detection device, a computer device, a storage medium and a program product, which can solve the problems of low accuracy and applicability of target detection in the related art. The technical scheme is as follows:

in one aspect, a target detection method is provided, and the method includes:

determining an image to be detected comprising a target object, and determining at least two slices of the image to be detected;

performing pixel point feature extraction and pixel point context feature extraction on the at least two slices through a feature extraction layer of a target model to obtain at least two feature maps, wherein any feature point of the feature maps is used for indicating the feature and the context feature of a corresponding pixel point in the slices, the size of a middle feature map output by a target network layer of the feature extraction layer is the same as that of the slices, and the target network layer is used for performing pixel point feature extraction on the slices;

and detecting the corresponding characteristic map of each slice to obtain defect information of each slice, wherein the defect information is used for indicating the defects of the target object included in the slice.

In another aspect, an object detecting apparatus is provided, including:

the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining an image to be detected comprising a target object and determining at least two slices of the image to be detected;

the characteristic extraction module is used for carrying out pixel point characteristic extraction and pixel point context characteristic extraction on the at least two slices through a characteristic extraction layer of a target model to obtain at least two characteristic graphs, any characteristic point of the characteristic graphs is used for indicating the characteristic and the context characteristic of a corresponding pixel point in the slices, the size of a middle characteristic graph output by a target network layer of the characteristic extraction layer is the same as that of the slices, and the target network layer is used for carrying out pixel point characteristic extraction on the slices;

and the detection module is used for detecting the corresponding characteristic diagram of each slice to obtain the defect information of each slice, and the defect information is used for indicating the defects of the target object included in the slice.

In another aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the object detection method described above.

In another aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the object detection method described above.

In another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the object detection method described above.

The beneficial effect that technical scheme that this application provided brought is:

segmenting an image to be detected to obtain at least two slices, and extracting pixel point characteristics and pixel point context characteristics of each slice through a characteristic extraction layer of a target model to obtain at least two characteristic graphs; detecting the characteristic diagram to obtain defect information of the slice; the limitation on the resolution is removed, and the method is suitable for the condition of any resolution image; the characteristic extraction layer comprises a target network layer used for extracting pixel point characteristics from slices, the size of a middle characteristic graph output by the target network layer is the same as that of the slices, namely down sampling during extraction of the pixel point characteristics of the slices is omitted, so that the detail characteristics of an original image are reserved, the characteristic extraction layer is further designed to extract the pixel point characteristics and the pixel point context characteristics, the robustness of detection of the targets with different sizes is improved, even if small targets in a small area range are detected, the small targets can be accurately detected, and the accuracy of target detection is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic diagram of an implementation environment of a target detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a target detection method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a calculation method of an overlap ratio according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an offset parameter according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of data processing in a regression branch network training process according to an embodiment of the present disclosure;

FIG. 6 is a schematic view of a detected defect provided in an embodiment of the present application;

fig. 7 is a schematic diagram of target detection based on various network layers according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" indicates either an implementation as "a", or an implementation as "a and B".

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, blockchains, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The application provides a target detection method, and relates to the technical fields of artificial intelligence, machine learning and the like. For example, the target detection of the image to be detected based on the model obtained by training can be realized by using the technologies of machine simulation, cloud computing and the like in the artificial intelligence technology, for example, the image to be detected is divided into a plurality of slices by using the target model; extracting pixel point characteristics and context characteristics of each slice by using a characteristic extraction layer of the target model; and detecting small targets in the high-resolution image by using the obtained characteristic image of the slice. Of course, the above machine learning technique may also be used to perform reinforcement learning on the initial model by using the sample set, so as to obtain a more robust target model.

In modern industrial manufacturing, production efficiency is improved by introducing a flow line. But the complicated process inevitably causes the generation of product defects. However, these defects are often dependent on environmental conditions, and are generated probabilistically, and it is necessary to perform statistical analysis on the defects at a later stage. Therefore, the method is an essential link in the modern production process for detecting and diagnosing the defects of the finished product.

In the traditional method, an enterprise mostly adopts a manual observation mode to detect the defects of products. In this example, there is a problem that the detection cost (personnel cost) is large for the enterprise; for the staff, the staff has a large loss rate due to the problems of large working intensity and single working content because the defect area is small (difficult to detect); for the algorithm, the defect forms are different in size, and the oversize defect and the undersize defect can cause a certain degree of missed detection, so that the actual production line yield is influenced.

At present, a fast RCNN (mask area convolutional Neural network) network is generally adopted to detect defects of an article image. When the fast RCNN is adopted, images are cut in a resize (image size conversion) mode and then input into the neural network, and the images are sampled down in the fast RCNN; and small-sized defect areas in the image are lost due to cutting, down-sampling and the like, so that the accuracy of target detection is low, the service required by high-resolution detection cannot be met, and the practicability is low.

Fig. 1 is a schematic diagram of an implementation environment of a target detection method provided in the present application. As shown in FIG. 1, the implementation environment may include: a computer device 101. For example, the computer device may be configured with a trained target model in advance, and perform target detection on the image to be detected by using the target model; alternatively, the computer apparatus 101 may also use a large number of samples to train to obtain the target model, so as to perform target detection on the image to be detected by using the target model.

In one possible scenario, as shown in fig. 1, the implementation environment may further include: the image acquisition device 102 can acquire an image to be detected of a target object to be detected, and send the image to be detected to the computer device 101, the computer device 101 is configured with a trained target model in advance, the computer device 101 performs target detection on the image to be detected by using the target model, and sends a detection result to the image acquisition device 102. For example, the terminal 102 may be installed with an application having an object detection function, and the server 101 may be a background server of the application. The terminal 102 and the server 102 can perform data interaction based on the application program, and realize real-time data transmission of the image to be detected and the detection result.

In the scenario shown in fig. 1, the target detection method may be performed in any device such as a server, a terminal, a server cluster, or a cloud computing service cluster. For example, the server or the terminal may have both image acquisition and target detection functions, for example, the server acquires an image to be detected and performs target detection based on the image to be detected and the target model.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, Wi-Fi, and other networks that enable wireless communication. The terminal may be a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile Internet Devices), a PDA (personal digital assistant), a desktop computer, a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal, a vehicle-mounted computer, etc.), a smart speaker, a smart watch, etc., and the terminal and the server may be directly or indirectly connected through wired or wireless communication, but are not limited thereto. The determination may also be based on the requirements of the actual application scenario, and is not limited herein.

Fig. 2 is a schematic flowchart of a target detection method according to an embodiment of the present application. The execution main body of the method may be a computer device, and the computer device may be a server, a terminal, or any electronic device with a target detection function, and the like, which is not specifically limited in this embodiment of the present application. In the embodiments of the present application, a server is taken as an example for description. As shown in fig. 2, the method includes the following steps.

Step 201, the server determines an image to be detected including a target object.

The target detection method provided by the embodiment of the application can be used for detecting the defects possibly existing on the target object as the target. Defects of the target object may be included in the image to be detected. Illustratively, the server may acquire an image to be detected from the image acquisition device. Alternatively, the server may store the image to be detected in advance. For example, the image to be detected may be an image obtained by shooting a target object to be detected by an image acquisition device or a server.

The target detection process of the embodiment of the present application may refer to: small target detection targeting a small range of defects. The small target detection may refer to the detection of defects that are small in size or relative size. In one possible example, small target detection may refer to detection within a range of sizes less than a certain threshold. For example, detection is performed with a defect having a size smaller than 32 × 32 as a target. In another possible example, the small target detection may also refer to defect detection within a range of relative sizes less than a certain threshold; for example, the detection is performed with a defect whose width and height are lower than one tenth of the width and height of the image to be detected, respectively, as a target.

It should be noted that the image to be detected may be an image with a higher resolution; that is, the image to be detected may be an image having a resolution exceeding a target resolution threshold and a size exceeding a target size threshold. For example, the original image of the target object with high resolution and large size acquired by a high-definition camera can be obtained; different from a common target detection method based on a neural network, the target detection method provided by the embodiment of the application has a better detection effect on high-resolution and large-size images. Of course, the image to be detected may also be a low-resolution image, and the target detection method of the embodiment of the present application is also applicable to the low-resolution image.

In one possible example, the target object may be a 3C (Computer, Communication, consumer electronics) or an accessory of a 3C product, such as a Computer, tablet, cell phone or digital audio player, cell phone camera holder, etc. For example, in the embodiment of the present application, a defect existing in a 3C product accessory may be detected according to image data of the 3C product accessory acquired by a camera.

Step 202, the server determines at least two slices of the image to be detected.

And the server segments the original image to be detected based on a sliding window slicing mode. The server may segment the image to be detected into a plurality of slices according to a certain size. This step may include: the server divides the image to be detected based on a sliding window of a target step length and a target size by adopting a sliding window mode to obtain at least two slices of the image to be detected. Illustratively, the size of the target step size may be configured in association with the target dimension, e.g., the target step size may be 0.7 times the target dimension. For example, in practical operation, the image to be detected may be sliced into a certain number of slices based on needs, and the data size of the slices may be tens, hundreds, or thousands, etc. For example, when the resolution of the image to be detected is higher or the accuracy requirement of target detection to be performed is higher, the step length of the slice is set to be smaller, and the number of the sliced slices is larger, so that the detection accuracy is improved, and the detection accuracy is further improved.

For example, the image to be detected may be an original image, and the server may segment (Patch Split) the original image, for example, the server sets a fixed step size stride, configures a sliding window with a certain window size, and autonomously segments the original image with high resolution by the sliding window method; for example, the fixed step size versus window size may be: stride-size (0.7 to 0.8).

By means of the method, the target detection method can be suitable for the image to be detected with any resolution ratio, the limitation of conditions such as the resolution ratio of the original image to be detected is removed, the application range is widened, the applicability is further improved, follow-up detection can be carried out based on the high-resolution image, and the possibility of accurate detection is improved.

Step 203, the server performs pixel point feature extraction and pixel point context feature extraction on the at least two slices through a feature extraction layer of the target model to obtain at least two feature maps.

The extraction of the pixel point features may include extracting semantic features of the sliced pixel points, and the features extracted in the process may include primary semantic features of the pixel points. The server performs semantic feature extraction and context feature extraction on the slices through a feature extraction layer of the target model to obtain a multi-channel feature map. The extraction of the contextual characteristics of the pixel points can comprise the extraction process of the self characteristics of the pixel points and the contextual characteristics fused with the pixel points, and the extracted characteristics in the process can comprise the high-level semantic characteristics of the pixel points.

Any feature point of the feature map is used for indicating the feature and the context feature of a corresponding pixel point in the slice, the size of a middle feature map output by a target network layer of the feature extraction layer is the same as that of the slice, and the target network layer is used for extracting the pixel point feature of the slice; each slice corresponds to a feature map of a plurality of channels. The server can firstly extract pixel point characteristics of the slice through the characteristic extraction layer to obtain at least two slice characteristic graphs; and then extracting the contextual characteristics of the pixel points of the at least two slice characteristic images to obtain the at least two characteristic images. For example, the server may first perform pixel point feature extraction on the slice through the feature extraction layer to obtain a primary semantic feature map; and extracting the context feature of the primary semantic feature map to obtain the high-level semantic feature map.

In one possible implementation, the feature extraction layer includes a micro backbone network and an attention network; this step 203 can be achieved by the following steps 2031-2033.

Step 2031, the server extracts the pixel point feature of the slice through the micro backbone network to obtain a slice feature map.

The slice feature map may include primary semantic features of the pixel points. Any feature point of the slice feature map is used for indicating the feature of a corresponding pixel point in the slice, the number of network layers included in the micro backbone network does not exceed a first threshold, and the number of convolution kernels included in each network layer does not exceed a second threshold. The target network layer may be located in the micro backbone network, a size of a middle slice feature map output by the target network layer is the same as a size of the slice, and the target network layer may be a convolutional layer for performing pixel feature extraction on a slice original image. Illustratively, the micro backbone network comprises a plurality of network layers, and the server sequentially inputs at least two slices into a first network layer according to a data propagation sequence of the plurality of network layers, inputs an output result of the first network layer into a second network layer, and so on until an output result of a last network layer of the micro backbone network is used as a slice characteristic diagram.

The target network layer may be a first network layer of the plurality of network layers. In the first network layer of the micro backbone network, the operation of downsampling the slice is omitted. That is, the size of the middle slice feature map output by the first network layer is the same as the size of the slice input by the first network layer.

For example, the micro Backbone network may be a TBNet (Tiny-Backbone-Net), which may be a lightweight residual network with a smaller number of network layers, a smaller number of convolution kernels, and a smaller number of parameters, as shown in table 1 below, and has the following structure:

TABLE 1

The size of the slice may be 640 × 640, as shown in table 1, the target network layer may be Conv-1 (convolutional layer-1), the size of the slice input by Conv-1 may be 640 × 640, and the output size by Conv-1 may be 640 × 640. In the Conv-1, the operation of down-sampling is omitted to ensure the detection performance of small targets and improve the accuracy. The network structure column shows the structure included in each network layer, for example, in Conv-1, the network structure is [3 × 3,12] × 2, which means that there are 2 sets of convolution kernels in Conv-1, and each set includes 12 convolution kernels of 3 × 3 size. Of course, the micro backbone network may also adopt other lightweight networks to implement the same function, such as deleting, compressing, or performing other optimization operations on MobileNet and shuffenet to generate the micro backbone network for implementing the same function. This is not particularly limited in the embodiments of the present application.

It should be noted that, the existing basic Network structure parameters such as VGG (Visual Geometry Group Network, super-resolution test sequence Network), inclusion Network or ResNet (Residual Neural Network) are large, so that the real-time efficiency is low; in addition, the existing basic network structure has more parameters and is easy to over-fit, especially under the condition of limited samples; the micro backbone network shown in table 1 has small parameter amount, does not need pre-training, avoids domain gap between the pre-training data set and the target data set, and further improves the accuracy of the model.

By inputting the slice with the original resolution into the first network layer, the output result of the first network layer is still the middle slice characteristic diagram with the same size as the slice, so that the detail characteristics of the original image are reserved, the possible targets in a small range cannot be lost, and the accuracy of target detection is improved; in addition, the number of network layers of the micro backbone network and the number of convolution kernels included in each network layer are within a certain threshold range, and by adopting the lightweight network result, the sliced slice characteristic diagram can be extracted quickly, so that the detection efficiency is improved. Compared with the commonly used networks including a large number of parameters, such as ResNet, MobileNet and ShuffleNet, the micro backbone network has fewer parameters and is not easy to over-fit, so that the micro backbone network is not limited by the size of a data set, is suitable for various large and small-scale data set scenes, reduces the calculation amount on the premise of meeting the small target detection and ensuring the detection accuracy, greatly improves the detection efficiency, is suitable for various low-configuration hardware scenes falling to the ground, enlarges the application range of the micro backbone network, and improves the applicability.

Step 2032, for each slice, the server extracts the context feature of each feature point in the slice feature map of the slice through the pooling layer of the attention network, so as to obtain the context feature map of the slice.

The pooling layer may be a multi-scale pooling layer. For each slice of the slice feature map, the server may perform context feature extraction on the slice feature map through a plurality of different-sized convolution kernels of the pooling layer, and obtain the context feature map of the slice based on the obtained intermediate context feature map. In one possible implementation, step 2032 may comprise: the server extracts the context feature of each feature point in the slice feature map through each convolution kernel of at least two convolution kernels included in the pooling layer, so as to obtain at least two first context feature maps of the slice feature map, wherein the at least two first context feature maps are different in size. Illustratively, the server may extract context features of each slice feature map separately through each convolution kernel; and obtaining at least two first context feature maps corresponding to the at least two convolution kernels for each slice feature map. For example, for the slice feature map a, context feature map extraction may be performed on a through three convolution kernels 1 × 1, 2 × 2, and 4 × 4, respectively, to obtain three context feature maps a1, a2, and A3 of a.

It should be noted that the Attention network may be a Global Attention Module (GAM) configured in the target model, and the GAM Module is connected to the back of the TBNet and may be implemented by a feature pyramid pooling Module; the feature pyramid pooling module may include different pyramid layers. For example, pooling of the sliced feature maps is performed by convolution kernels of three different pyramid layers, 1 × 1, 2 × 2, 4 × 4, respectively. Through the pooling layer of the attention network, context features of the slice features are extracted, so that each feature point in the context feature map can represent the pixel point features of corresponding pixel points and can also represent the features of the pixel points in the surrounding areas of the corresponding pixel points, the pixel points corresponding to each feature point and the features of the contexts of the surrounding areas are fused, the feature expression capability of each feature point in the context feature map is improved, meanwhile, some non-target information can be inhibited, and the accuracy of target detection is further improved.

Step 2033, the server performs feature fusion on the at least one context feature map to obtain a feature map of the slice.

In one possible implementation manner, when the server performs context feature extraction by using at least two convolution kernels to obtain at least two first context feature maps with different sizes, in this step, the server may perform size transformation first and then perform feature fusion. This step may include: the server up-samples the at least two first context feature maps to obtain at least two second context feature maps, wherein the sizes of the at least two second context feature maps are the same as the size of the slice feature map; and the server performs feature fusion on the at least two second context feature maps to obtain the feature map of the slice.

It should be noted that the server may implement an upsampling process by using a bilinear interpolation method through the GAM module, so as to restore the size of the context feature map to be the same as that of the slice feature map.

Carrying out lightweight operation through a miniature backbone network in the feature extraction layer, omitting a downsampling process in a target network layer of the miniature backbone network, and quickly extracting slice features to obtain a slice feature map; context features of the slice feature map are extracted through an attention network, global context information is fused based on features of pixel points corresponding to the feature points and surrounding feature points, feature characterization capability of each feature point in the context feature map is further enhanced, and meanwhile, some non-target information can be suppressed; and then, restoring the size of the slice characteristic diagram by using the context characteristic diagram subjected to upsampling and pooling, and then performing characteristic fusion, so that more context information can be obtained by the characteristic diagram, the receptive field can be expanded to the whole image, the subsequent detection of the target model on the targets with different sizes is facilitated, the target detection precision is improved, and the small target detection accuracy is further improved.

Moreover, the infrastructure network structure in the prior art is generally used for an image classification task, and the features acquired in the prior art are not suitable for a detection task; different from the prior art, the method and the device can not only extract the characteristics of the pixel points, but also fuse the characteristics of the surrounding pixel points, thereby realizing the fusion of the global context characteristics, obtaining the fusion characteristics with stronger expression capability more suitable for detection tasks, further matching the service requirements and improving the accuracy of target detection.

And 204, detecting the corresponding characteristic graph of each slice by the server to obtain the defect information of each slice.

The defect information is used to indicate a defect of a target object included in the slice. The target model may include a detector for target detection, and the server may detect the feature map through the detector to obtain defect information of the slice. Illustratively, the detector is used for target detection based on the input characteristic diagram and outputting defect information of the target object. In this step, for each slice, the server may determine, by a detector included in the target model, a target frame based on a candidate frame corresponding to each feature point in the feature map corresponding to the slice, and output a defect position and a defect classification result of the slice based on the target frame, where the target frame is used to indicate an area where a defect of the target object is located; the defect classification result includes a defect type and a type probability, the defect type refers to a defect type to which the defect belongs, and the type probability refers to a probability that the defect belongs to the defect type. For example, the defect category may include at least one of pressure injury, sticking, starving, and fouling, for example, one or more thereof. The defect position can be represented by the coordinates of the upper left corner and the lower right corner of the rectangular frame; of course, the lower left corner coordinate and the upper right corner coordinate may also be used for representation, and the representation manner of the defect position is not specifically limited in the embodiment of the present application.

For example, the server may generate at least two candidate frames for each feature point in the corresponding pixel region of the slice through the detector, and further regress the candidate frames into one target frame based on the confidence degrees of the at least two candidate frames corresponding to the feature points. This step may include: the server generates at least two candidate frames corresponding to each feature point in the feature map based on the offset parameters through a regression branch network of the detector; the server determines the contribution degree of each candidate frame except the current maximum candidate frame with the highest confidence degree in the at least two candidate frames based on the current maximum candidate frame, deletes a first candidate frame of which the contribution degree does not meet the target condition in each candidate frame, and executes the operation of determining the contribution degree and deleting the first candidate frame again based on the remaining candidate frames after each deletion until each candidate frame is traversed, and determines the target frame based on at least one second candidate frame remaining after the deletion operation; and the server classifies the area included by the target frame through the classification branch network of the detector and outputs the defect classification result. The offset parameter is used for indicating the offset distance between each boundary of the candidate frame and the positive sample pixel point in the corresponding slice of the feature map. The positive sample pixel point may be a pixel point included in a region where the defect is located in the slice. Illustratively, the contribution degree of the candidate box refers to a value degree of the candidate box in the process of determining the target box based on a plurality of candidate boxes. For example, in this step, the server may determine the target frame in a manner of multiple iterations, and the iteration process may include the following steps (1) - (4).

Step (1): the server performs descending order arrangement on the candidate frames based on the confidence degrees of the candidate frames, and selects the current maximum value candidate frame with the maximum confidence degree;

step (2): for each other candidate frame except the current maximum value candidate frame, the server sequentially determines the overlapping ratio between each other candidate frame and the current maximum value candidate frame;

and (3): deleting the other candidate box when an overlap ratio between the other candidate box and the current maximum value candidate box exceeds a target threshold;

and (4): the server repeatedly executes the steps (1) - (3) again based on the current remaining candidate frames, namely, screening out the current maximum candidate frame with the highest confidence level in the current remaining candidate frames, and executing deletion operation on each other candidate frame based on the overlapping ratio of the current maximum candidate frame with each other candidate frame; until all candidate boxes are traversed. The server may determine the target frame based on at least one second candidate frame currently remaining after all candidate frames have been traversed. For example, the at least one second candidate frame is merged to obtain the target frame.

In one possible example, the overlapping ratio of the candidate box and the current maximum value candidate box can be adopted to represent the contribution degree of the candidate box; the larger the overlap ratio of the candidate frame to the current maximum value candidate frame is, the smaller the degree of contribution of the candidate frame is. The overlap ratio is represented as IoU (overlap-over-Union), and refers to the overlap ratio between the candidate frame and the original mark frame, that is, as shown in fig. 3, the ratio between the Intersection and the Union of the candidate frame and the original mark frame, the numerator in fig. 3 may be the area of the black area of the Intersection, and the denominator may be the area of the black area of the Union. In this step, in the process of determining the target frame based on the candidate frame, the original mark frame may be the current maximum value candidate frame, and the overlapping ratio of the candidate frame and the current maximum value candidate frame is the ratio between the intersection and the union of the candidate frame and the current maximum value candidate frame.

In one possible example, the confidence level is used to indicate the probability that the region within the candidate box is the region in which the defect is located. Each feature point in the feature map corresponds to a pixel region in the slice, and the pixel region includes a plurality of pixel points, for example, a feature point corresponds to a pixel region composed of 16 × 16 pixel points in the slice. The server may generate at least one candidate box for each feature point corresponding to a pixel region using the detector and calculate a confidence for each candidate box. The offset parameter is used for indicating the offset distance between each boundary of the target frame and the pixel point of the positive sample in the corresponding slice of the feature map. The positive sample pixel points may be pixel points corresponding to feature points within the target region of the slice. As shown in fig. 4, the offset parameters may include four parameters l, t, r, and b in the horizontal and vertical directions, wherein l represents the offset distance between the positive sample pixel point and the left boundary of the target frame, t represents the offset distance between the positive sample pixel point and the top boundary of the target frame, r represents the offset distance between the positive sample pixel point and the right boundary of the target frame, and b represents the offset distance between the positive sample pixel point and the bottom boundary of the target frame.

In one possible example, the server predicts offsets of output and four boundaries of the target frame by taking the positive sample pixel point as a reference, and finely adjusts the positive sample pixel point based on the offset parameter; in one possible example, the Regression branch network may be a Regression type network; the Classification branch network may be a Classification type network, and the Classification branch network may include a Classification function for determining the defect class. For example, the classification branch network may also perform class determination using the candidate boxes, for example, the classification branch network may determine defect classes and class probabilities corresponding to the respective candidate boxes. The processes of the above steps (1) to (4) are performed for candidate frames belonging to the same defect class. For example, the defect category and the category probability of the final target frame can be obtained from the remaining at least one second candidate frame; for example, the category probability of the target frame may be calculated by averaging or merging at least one second candidate frame. Wherein the classification branching network may include 2 convolutional layers and 1 classification layer implementation. The Detector can be realized by adopting an anchorfree Detector, so that the anchorfree Detector with smaller calculated amount is used, the anchorfree Detector can be used without presetting anchorfree hyper-parameters by prior knowledge, and a result can be rapidly and accurately output by adopting small calculated amount, thereby improving the detection efficiency.

In one possible implementation, the object model may further include a classifier for outputting defect indication information of the slice, and the process may include: for each slice, the server outputs defect indication information of the slice based on the feature map of the slice through a classifier of the target model, wherein the defect indication information is used for indicating whether a local target object included in the slice has defects or not. For example, the defect indication information may include a defect probability of whether the slice has a defect. In one possible example, the Classifier may be a Classifier type network, and the classification branching network may include a classification function, such as a binary classification function, for determining the defect class. For example, the classifier may include two convolution modules (conv-bn-relu) and a Global Average Pooling (GAP) network, and the feature map may be processed by the two convolution modules included in the classifier and the GAP to be subjected to Global average pooling, and then processed by a two-class classifier to output a defect probability of whether a defect exists.

In one possible implementation mode, a target model is finally obtained by performing combined training on the classifier and the detector; the training process of the target model comprises the following steps: the server inputs a sample set into an initial model, wherein the sample set refers to a set comprising a sample image of a target object and a truth label of the sample image, and the initial model comprises an initial detector and an initial classifier; the server determines a joint difference degree between a joint result output by the initial model and a truth label through a joint loss function based on the sample detection position and the sample detection classification output by the initial detector, the sample indication information output by the initial classifier and the truth label of the sample set, wherein the joint result comprises the sample detection position, the sample detection classification and the sample indication information; and the server adjusts the model parameters of the initial model based on the joint difference degree, and stops adjusting until the model parameters meet target conditions to obtain the target model. The model parameters include at least an initial offset parameter of the initial detector. The server can optimize the parameters of each initial network layer of the initial model through the joint difference, execute the sample set input process again based on the optimized parameters, calculate the joint difference based on the input result and the sample truth value, optimize the parameters again based on the latest joint difference, and repeat the optimization model for multiple times in an iterative mode until the target condition is met and the optimization is stopped. The target condition may be that the joint difference degree is smaller than a target difference threshold, or that the number of iterative adjustments exceeds a target number threshold, or the like.

In one possible example, the process for the server to calculate the joint difference degree based on the joint loss function may include: the server determining a first difference between a sample candidate box and a truth box based on the sample candidate box, the truth box of a sample set, and a first loss function predicted by an initial regression branch network of the initial detector; the server determines a second difference between the prediction class probability and a prediction frame based on an overlapping ratio between a truth box and the prediction frame of the sample set, the prediction class probability predicted by an initial classification branch network of the initial detector and a second loss function, wherein the prediction frame is an area where the defect is predicted based on the sample candidate frame; the server determining a third difference between the sample indication information and the true probability based on the sample indication information predicted by the classifier, the true probability of the sample set, and a third loss function; the server determines the joint difference degree through the joint loss function based on the first difference, the second difference, the third difference and a supervision signal of the classifier. The server can predict the area where the defect is located based on the plurality of sample candidate frames to obtain a prediction frame; for example, a process similar to the iterative process of the above steps (1) - (4) may be adopted to obtain the prediction frame based on a plurality of sample candidate frames, which is not described in detail herein. The sample indication information may be a probability of whether the sample image predicted by the classifier of the initial model includes a defect; the true probability is 1 when the sample image includes a defect, and 0 otherwise. The truth box may be the area where the sample image defect is actually located, indicating the true value of the location of the defect.

In one possible example, the server may calculate the first difference based on a prediction box of the initial detector prediction samples and based on the prediction box and a truth box. The process may include: for each sample image, the server predicts at least two sample candidate frames of the sample image through the initial regression branch network based on an initial offset parameter of the initial regression branch network; the server recombines the boundaries of the at least two sample candidate frames to obtain at least two recombination frames based on the deviation between the boundaries of the at least two sample candidate frames and the boundaries of the true value frames; the server determines a first difference between each of the reconstruction boxes and the truth box based on an overlapping ratio of each of the reconstruction boxes and the truth box, the number of pixel points of the truth box, and the first loss function. The initial offset parameter may be an initial value of the offset parameter, for example, may be initial values of four parameters, i, t, r, and b. The confidence of the recombination box is used for indicating the probability that the recombination box is the true value box, and the confidence of the sample candidate box is used for indicating the probability that the sample candidate box is the true value box. For example, the overlap ratio of the recombination box (sample candidate box) and the true value box may be adopted as the confidence of the recombination box (confidence of the sample candidate box). The deviation of the boundary of the sample candidate box from the boundary of the true box may be: the deviation between two boundaries at the same relative position of the candidate box and the true box may be, for example, a distance difference. The relative position may be the position of the boundary relative to the frame, e.g., the left boundary to the left of the frame, the top boundary at the top of the frame, etc., relative to the positive sample pixel points of the frame. The same relative position may be the left boundary of the candidate box and the left boundary of the true value box, the top boundary of the candidate box and the top boundary of the true value box.

For example, the process of the server reorganizing the sample candidate box to obtain the reorganized box may include: the server decomposes and divides the boundaries of the at least two sample candidate frames into at least two boundary sets based on the confidence of each sample candidate frame, wherein each boundary set comprises at least two boundaries of the at least two sample candidate frames at the same relative position; the server calculates deviation between each boundary in the boundary set and a true value boundary of a corresponding true value box for each boundary set, and sorts at least two boundaries in the boundary set based on the deviation corresponding to each boundary; the service regroups the boundaries having the same arrangement order in the boundary sets into a regrouping box based on the arrangement order of each boundary in the boundary sets, and calculates the overlapping ratio between the regrouping box and the truth box. For example, the candidate box, the prediction box, and the truth box may be rectangular boxes. In one possible example, the process of the server predicting a sample candidate box of a sample based on an initial detector and calculating the first difference may include: the five processes of decomposition, sorting, recombination, distribution, calculation of difference and the like are specifically corresponding to the following steps a-e.

Step a: decomposition (decomposition), the server adopts the confidence of four boundaries to represent based on the confidence of the predicted candidate frame, and then divides the boundaries on four relative positions into four groups to establish four boundary sets:

left＝{l₀,l₁,l₂…l_n}；right＝{r₀,r₁,r₂…r_n}；

top＝{t₀,t₁,t₂…t_n}；bottom＝{b₀,b₁,b₂…b_n}；

wherein left is a set of confidence degrees of left boundaries of the candidate frames; right is each waitingA set of confidence levels for the right border of the box; top is the set of confidence levels of the top boundaries of the candidate frames; bottom is the set of confidence degrees of the bottom boundaries of the candidate frames; as shown in fig. 5, step a is to decompose (decomposition) the boundaries of the three candidate boxes S0, S1, and S2 to obtain four boundary sets. As shown in FIG. 5 (a), the set of right1 includes three candidate boxes S₀、S₁、S₂The right border of (a).

Step b: sorting (ranking), wherein the server respectively calculates the deviation of each edge in the four boundary sets left, right, top and bottom to the corresponding true value boundary with the same relative position in the true value frame based on the target example boundary, and sorts the four boundary sets based on the deviation;

as shown in FIG. 5 (b), three candidate frames S are in the right boundary set₀、S₁、S₂Is shown in fig. 5, it is apparent that S is the relative position between the right boundary of the true value box and the right boundary of the true value box₂Is closest to the right boundary of the true value box, S₀Next to the right border of (S)₁Is farthest from the right boundary of the true value box. Wherein S is₂Is delta from the right boundary of the true value box₃，S₀Is delta from the right boundary of the true value box₁，S₁Is delta from the right boundary of the true value box₂Obviously, based on the deviation pair S₀、S₁、S₂The right border of (a) is arranged in ascending order to obtain Rank: s₂Right boundary first position, S₀To the right of₁The right boundary last bit of (1). And b, obtaining the arrangement sequence of each boundary in each boundary set by sorting (ranking).

Step c: recombining (recombination), namely recombining the boundaries with the same arrangement sequence in each boundary set into a recombination frame, and calculating the overlapping ratio of the recombination frame and the target frame to be used as the recombination confidence of the four boundaries of the recombination frame.

As shown in FIG. 5 (c), the boundary sets with the same level are recombined into a new box, and the recombination box and the truth box are calculatedThe overlap ratio therebetween. Three candidate frames S₀、S₁、S₂Has confidence of S₀'、S₁'、S₂'。

Step d: assigning (assignment), and re-assigning confidence of each boundary in the group box based on the confidence of the re-grouped box and the confidence of the candidate box.

For example, for each bounding box in the regrouping box, the greater value between the confidence of the candidate box in which the bounding box is located and the confidence of the regrouping box in which the bounding box is located may be used as the confidence of the bounding box. That is, as shown in FIG. 5 (d), two sets of boundary scores, e.g., S, are now obtained for the boundaries of the original candidate frame and the recomposed frame₀' value is max (S)₁，S₀') that is, take the original sample candidate box S₁And recombination frame S₀' the maximum value of the mean. The final confidence for each boundary is assigned using the higher of the two sets of boundary scores rather than using one of the sets at all.

It should be noted that if the confidence of the recombined boxes is low, that is, the included boundaries are far from the boundary of the truth box group route, which may cause the confidence of the four recombined boundaries to be far lower than the original boundaries, the heavily shifted confidence scores may cause unstable gradient back propagation in the training phase, and therefore, the group with higher score is selected to ensure the accuracy of the training.

Step e: the calculated difference, first loss function, may be in the form of equation one below. The step of determining, by the server, a first difference between each of the reconstruction boxes and the truth box based on the overlapping ratio of the reconstruction boxes and the truth box, the number of the pixel points of the truth box, and the first loss function includes: the server determines a first difference between each recombination frame and the truth frame according to a first formula based on the overlapping ratio of the recombination frame and the truth frame, the overlapping ratio of the candidate frame and the truth frame, and the number of pixel points of the truth frame.

The formula I is as follows:

wherein N is_posThe number of pixels in the true value frame is represented, that is, the true value frame corresponds to the number of pixels included in the corresponding pixel region in the sample image.

Is an indicator function, S'_IRepresents the score (i.e., confidence, i.e., the overlap ratio between the I-th regrouping box and the truth box) of the I-th regrouping box of the plurality of regrouping boxes, S_IRepresenting the score of the I-th sample candidate box before the reorganization (i.e., the confidence, i.e., the overlap ratio between the I-th sample candidate box and the truth box), indicating that the function value is 1 if the reorganization box confidence is higher than the confidence of the sample candidate box, and 0 otherwise; l is_iou(B′_I,T_I) Represents a reconstructed frame B'_IAnd true value box T_IThe overlap ratio between the two, namely the confidence of the recombination box; l is_iou(B_I,T_I) Representing sample candidate box B_IAnd true value box T_IThe overlap ratio between them, i.e. the confidence of the sample candidate box.

In one possible implementation, the second Loss function may be an QFL (Quality Focal Loss) function, the QFL function may be in the form of the following formula two, and the step of the server determining a second difference between the predicted class probability and the overlap ratio based on the overlap ratio between the truth box and the prediction box of the sample set, the predicted class probability predicted by the initial classification branch network, and the second Loss function may include: the server calculates the second difference based on the overlap ratio between the truth box and the prediction category probability by the following formula two:

the formula II is as follows: QFL (sigma) — | y-sigma non-conducting fume^β((1-y)log(1-σ)+ylog(σ))；

Wherein σ is the prediction output, i.e., the prediction class probability of predicting to a certain class; y is the overlap ratio of the prediction box to the true value box, 0< y < 1;

in one possible implementation, the three loss functions may be loss functions of the classifier, and the loss functions may be expressed by softmax loss functions, that is, the server may calculate the third difference based on the sample indication information predicted by the classifier, the true value probability of the sample, and the softmax loss functions. In one possible example, the server may represent a joint loss function by using the above formula one, formula two, and softmax loss functions, and the step of determining the joint difference degree by the server based on the first difference, the second difference, the third difference, and the supervision signal of the classifier through the joint loss function may include: the server calculates the joint difference degree based on the first difference, the second difference, the third difference and the supervision signal of the classifier through a formula three corresponding to the following joint loss function:

the formula III is as follows:

wherein, L is the joint difference degree; n is a supervision signal corresponding to the classifier, and can represent a true value probability, where n is {0,1}, n is 1, which represents that the currently input slice contains a defect to be detected, and n is 0, which represents that the currently input slice does not contain a defect; l is_clsQFL denotes the loss function of the classification branch network of the detector, i.e. the second loss function;

a loss function representing a regression branch network of the detector, i.e. a first loss function;

indicating that the function value is 1 if n is 1 (indicating that the currently input slice contains the defect to be detected), and otherwise, indicating that the function value is 0; γ is a hyper-parameter and can be configured on an as-needed basis, e.g., γ is set to 0.25.

The server may calculate a joint difference degree by using a joint loss function shown in the above formula three, and optimize parameters of each network layer, such as a micro backbone network of a feature extraction layer of the target model, an attention network, a regression network score included in the detector, a classification network branch, and a classifier, based on the joint difference degree, for example, a gradient descent algorithm may be used to optimize the parameters of each network layer, for example, an offset parameter of the detector may be optimized, so that the detector outputs more accurate defect information by using the optimized parameters.

Parameters of each network layer in the optimization model are jointly trained through output results of the joint classifier and the detector, and a joint loss function is used, so that the detector and the classifier mutually strengthen training optimization, the false detection probability is reduced, the detection accuracy is improved, and the detection performance is improved. In addition, optimization can be further strengthened by combining the first difference and the second difference through respective loss functions of the regression branch and the classification branch of the detector, and the accuracy of the trained target model is improved.

Fig. 6 is a schematic view of a visualization of a defect detected using the method of the present application, as shown in fig. 6. As shown in fig. 7, fig. 7 is a schematic flowchart of target detection based on each network layer in the target model. In fig. 7, taking the mobile phone camera holder accessory as an example, the image to be detected may include defects of pressure damage, adhesion, material shortage, dirt, and the like of the mobile phone camera holder accessory. Fig. 7 (a) shows a plurality of small slices obtained by segmenting the image to be detected, and fig. 7 (a) shows two of the small slices. And inputting the two small slices into a micro backbone network TBNet of a target model, passing through a target network layer conv-1 of the micro backbone network, omitting down-sampling operation, outputting an intermediate characteristic diagram with the same size as the slices, and sequentially passing through other network layers of the TBNet network to obtain a slice characteristic diagram. Inputting the slice feature map into an attention module GAM, pooling the slice feature map in a pyramid layer in the GAM, for example, pooling the slice feature map in three convolution kernels 1 × 1, 2 × 2 and 4 × 4 shown in fig. 7 (c), respectively obtaining three context feature maps with different scales, and recovering the three context feature maps with the same scale by upsampling; for example, three context feature maps, each having the size of the slice feature map, are fused into one feature map through a feature fusion layer Eltwise. The feature map is input (d) into a Detector and (e) into a Classifier, respectively. In (D), 4 convolution layers and D & R (Decomposition and Recombination module) are respectively passed through Regression branch network; the feature map output by the feature fusion layer Eltwise is data of N × 128 × H × W, N is the number of slices, H is the slice height, W is the slice width, and 128 is the number of channels. And finally outputting the N multiplied by 4A multiplied by H multiplied by W characteristic vector after passing through the D & R module, wherein 4A is the coordinate position of the target frame where the defect is located. In (d), the data is nx128 × hxw through 2 convolutional layers and 1 Classification layer of the Classification branch network Classification, and then, by multiplying the objective probability data of nxaxhxw of the Regression branch network Regression with the nx128 × hxw data output by 2 convolutional layers and 1 Classification layer of the Classification branch network Classification, the feature vector of nxka × hxw is output, K represents K defect classes, and the feature vector of nxka × hxw can represent class probabilities corresponding to the K defect classes, respectively. In the Classifier (e), after passing through two convolution modules, the data is changed from N × 128 × H × W to N × 128 × (H/2) × (W/2), then is subjected to average Pool (Ave Pool) processing by the global average pooling layer, and then N × 256 data is output, and then is passed through a Classifier, such as a Classifier of two classes, and finally an N × 2 matrix is output, where the N × 2 matrix may indicate Defect y/N (yes/no Defect), i.e., whether there is a Defect.

As can be seen from fig. 7, in the target model of the present application, the network structure is clear, and each network layer and module has better generalization capability; by segmenting the high-resolution image to be detected, any resolution image input monitoring can be supported; the pixel point characteristics and the pixel point context characteristics are extracted through the characteristic extraction layer, a characteristic diagram with greatly improved characteristic representation capability is obtained, the robustness of the model is enhanced, and the accuracy of model detection is further improved. Particularly for small target detection, the model is required to have extremely strong robustness to the size, and the target detection method can utilize the segmentation of the target model, the characteristics of the global context, the omission of upsampling through a target network layer and the like, so that the characteristic diagram can obtain more and more detailed characteristics, the receptive field can be expanded to the whole image, and the robustness of the model to different-scale target detection is improved.

For example, the following is experimental data for target detection by the method of the embodiment of the present application, and the experimental result data is as follows:

the model was trained on a self-constructed training set containing 7545 pictures and tested using 3458 pictures, with the results shown in table 2 below:

TABLE 2

Method	Size of picture	Amount of ginseng	Indexes (mAP/APs)	Time of inference
					FasterRCNN	4096×3000	41.53M	0.89/0.362	135ms
Method of the present application	4096×3000	20.21M	0.92/0.406	65ms

In table 2, the indices include: mAP (mean average accuracy) and APs (Small target average accuracy). As shown in table 2, the parameters in other methods (e.g., fast RCNN) are significantly twice as large, and the indexes of other methods are significantly lower and the time consumption is more for the same size of the picture. By contrast, the method and the device can adopt smaller parameters to achieve higher indexes, and are less in time consumption and higher in efficiency. As shown in fig. 6, fig. 6 is a visual chart of a test, and it is obvious that when the target detection method of the present application is used for detection, even a small and tiny defect as shown in fig. 6 can be detected, so that the accuracy and precision of detection are improved.

According to the target detection method, at least two slices are obtained by segmenting an image to be detected, and then pixel point characteristics and pixel point context characteristics are extracted from each slice through a characteristic extraction layer of a target model to obtain at least two characteristic graphs; detecting the characteristic diagram to obtain defect information of the slice; the limitation on the resolution is removed, and the method is suitable for the condition of any resolution image; the characteristic extraction layer comprises a target network layer used for extracting pixel point characteristics from slices, the size of a middle characteristic graph output by the target network layer is the same as that of the slices, namely down sampling during extraction of the pixel point characteristics of the slices is omitted, so that the detail characteristics of an original image are reserved, the characteristic extraction layer is further designed to extract the pixel point characteristics and the pixel point context characteristics, the robustness of detection of the targets with different sizes is improved, even if small targets in a small area range are detected, the small targets can be accurately detected, and the accuracy of target detection is improved.

By using the joint loss function and combining the output results of the classifier and the detector, parameters of each network layer in the optimization model are trained jointly, the detector and the classifier strengthen training optimization mutually, the false detection probability is reduced, the detection accuracy is improved, and the detection performance is improved.

The network layer number, the convolution kernel number and other parameters of the micro backbone network are few, the operation amount is reduced on the premise of meeting the small target detection and ensuring the detection accuracy, the detection efficiency is greatly improved, and the method is suitable for various low-configuration hardware scenes on the ground and improves the applicability.

Fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:

a determining module 801, configured to determine an image to be detected including a target object, and determine at least two slices of the image to be detected;

a feature extraction module 802, configured to perform pixel feature extraction and pixel context feature extraction on the at least two slices through a feature extraction layer of the target model to obtain at least two feature maps, where any feature point of the feature maps is used to indicate a feature and a context feature of a corresponding pixel in the slice, a size of an intermediate feature map output by a target network layer of the feature extraction layer is the same as that of the slice, and the target network layer is used to perform pixel feature extraction on the slice;

and the detecting module 803 is configured to detect the feature map corresponding to each slice, so as to obtain defect information of each slice, where the defect information is used to indicate a defect of the target object included in the slice.

In one possible implementation, the feature extraction layer includes a micro backbone network and an attention network; the feature extraction module 802 includes:

a pixel point feature extraction unit, configured to perform pixel point feature extraction on the at least two slices through the micro backbone network to obtain at least two slice feature maps, where any feature point of a slice feature map is used to indicate a feature of a corresponding pixel point in the slice;

a context feature extraction unit, configured to, for each slice, extract, through a pooling layer of the attention network, context features of feature points in a slice feature map of the slice to obtain at least one context feature map of the slice, and perform feature fusion on the at least one context feature map to obtain a feature map of the slice;

the number of network layers included in the miniature backbone network does not exceed a first threshold, and the number of convolution kernels included in each network layer does not exceed a second threshold.

In a possible implementation manner, the context feature extraction unit is configured to extract, through each of at least two convolution kernels included in the pooling layer, a context feature of each feature point in the slice feature map, respectively, to obtain at least two first context feature maps of the slice feature map, where the at least two first context feature maps are different in size;

correspondingly, the context feature extraction unit is further configured to perform upsampling on the at least two first context feature maps to obtain at least two second context feature maps, where the sizes of the at least two second context feature maps are the same as the size of the slice feature map; and performing feature fusion on the at least two second context feature maps to obtain the feature map of the slice.

In a possible implementation manner, the detecting module 803 is configured to, for each slice, determine, by a detector included in the target model, a target frame based on a candidate frame corresponding to each feature point in a feature map corresponding to the slice, and output a defect position and a defect classification result of the slice based on the target frame, where the target frame is used to indicate a region where a defect of the target object is located;

the defect classification result includes a defect type and a type probability, the defect type refers to a defect type to which the defect belongs, and the type probability refers to a probability that the defect belongs to the defect type.

In a possible implementation manner, the detecting module 803 is configured to generate, through the regression branch network of the detector, at least two candidate frames corresponding to each feature point in the feature map based on an offset parameter, where the offset parameter is used to indicate a distance that each boundary of the candidate frames is offset from a positive sample pixel point in a corresponding slice of the feature map; determining the contribution degrees of other candidate frames except the current maximum candidate frame with the highest confidence degree in the at least two candidate frames, deleting a first candidate frame of which the contribution degrees do not meet the target condition in the other candidate frames, executing the operations of determining the contribution degrees and deleting the first candidate frame again on the basis of the candidate frames left after each deletion until each candidate frame is traversed, and determining the target frame on the basis of at least one second candidate frame left after the deletion operation; and classifying the area included by the target frame through a classification branch network of the detector, and outputting the defect classification result.

In one possible implementation, the apparatus further includes:

the classification module is used for outputting defect indication information of each slice through a classifier of the target model based on the feature map of the slice, wherein the defect indication information is used for indicating whether a local target object included in the slice has defects or not;

the target model is obtained by training combining the output result of the classifier and the output result of the detector; correspondingly, the apparatus further comprises a model training module, which comprises:

an input unit for inputting a sample set into an initial model, the sample set being a set comprising a sample image of a target object and truth labels of the sample image, the initial model comprising an initial detector and an initial classifier;

a joint difference determination unit, configured to determine, through a joint loss function, a joint difference degree between a joint result output by the initial model and the truth label based on the sample detection position and the sample detection classification output by the initial detector, the sample indication information output by the initial classifier, and the truth label of the sample set, where the joint result includes the sample detection position, the sample detection classification, and the sample indication information;

and the optimization unit is used for adjusting the model parameters of the initial model based on the joint difference degree until the model parameters meet target conditions, and stopping adjustment to obtain the target model, wherein the model parameters at least comprise the initial offset parameters of the initial detector.

In one possible implementation, the joint difference determining unit is configured to:

a first difference determination subunit, configured to determine a first difference between a sample candidate box predicted by an initial regression branch network of the initial detector, a truth box of a sample set, and a first loss function;

a second difference determining subunit, configured to determine, based on an overlap ratio between a true box and a prediction box of the sample set, a prediction class probability predicted by an initial classification branch network of the initial detector, and a second loss function, a second difference between the prediction class probability and the overlap ratio, where the prediction box is a region where a defect is predicted based on the sample candidate box;

a third difference determining subunit, configured to determine a third difference between the sample indication information and the true probability based on the sample indication information predicted by the classifier, the true probability of the sample set, and a third loss function;

a joint difference determining subunit, configured to determine the joint difference degree through the joint loss function based on the first difference, the second difference, the third difference, and the supervisory signal of the classifier.

In one possible implementation, the first difference determining subunit is configured to predict, for each sample image, at least two sample candidate frames of the sample image through the initial regression branch network based on an initial offset parameter of the initial regression branch network; recombining the boundaries of the at least two sample candidate frames to obtain at least two recombination frames based on the deviation between the boundaries of the at least two sample candidate frames and the boundaries of the true value frames; determining a first difference between each of the reconstruction boxes and the truth box based on an overlapping ratio of each of the reconstruction boxes and the truth box, the number of pixel points of the truth box, and the first loss function.

In one possible implementation, the first difference determining subunit is configured to decompose and partition boundaries of the at least two sample candidate boxes into at least two boundary sets based on the confidence of each sample candidate box, where each boundary set includes at least two boundaries of the at least two sample candidate boxes at the same relative position; for each boundary set, calculating the deviation between each boundary in the boundary set and a true value boundary of a corresponding true value box, and sequencing at least two boundaries in the boundary set based on the deviation corresponding to each boundary; and recombining the boundaries with the same arrangement sequence in each boundary set to obtain a recombination box based on the arrangement sequence of each boundary in each boundary set, and calculating the overlapping ratio between the recombination box and the truth-value box.

In a possible implementation manner, the determining module 801 is further configured to segment the image to be detected in a sliding window manner based on a target step size and a target size, so as to obtain at least two slices of the image to be detected.

The target detection device provided by the application is used for segmenting an image to be detected to obtain at least two slices, and then extracting pixel point characteristics and pixel point context characteristics of each slice through a characteristic extraction layer of a target model to obtain at least two characteristic graphs; detecting the characteristic diagram to obtain defect information of the slice; the limitation on the resolution is removed, and the method is suitable for the condition of any resolution image; the characteristic extraction layer comprises a target network layer used for extracting pixel point characteristics from slices, the size of a middle characteristic graph output by the target network layer is the same as that of the slices, namely down sampling during extraction of the pixel point characteristics of the slices is omitted, so that the detail characteristics of an original image are reserved, the characteristic extraction layer is further designed to extract the pixel point characteristics and the pixel point context characteristics, the robustness of detection of the targets with different sizes is improved, even if small targets in a small area range are detected, the small targets can be accurately detected, and the accuracy of target detection is improved.

The parameter quantity, the convolution kernel quantity and the like of the micro backbone network are small, the operation quantity is reduced on the premise of meeting the small target detection and ensuring the detection accuracy, the detection efficiency is greatly improved, and the method is suitable for various low-configuration hardware scenes falling to the ground and improves the applicability.

The target detection apparatus of this embodiment can perform the target detection method shown in the above embodiments of this application, and the implementation principles thereof are similar, and are not described herein again.

Fig. 9 is a schematic structural diagram of a computer device provided in an embodiment of the present application. As shown in fig. 9, the computer apparatus includes: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements:

the method comprises the steps of obtaining at least two slices by segmenting an image to be detected, and extracting pixel point characteristics and pixel point context characteristics of each slice through a characteristic extraction layer of a target model to obtain at least two characteristic graphs; detecting the characteristic diagram to obtain defect information of the slice; the limitation on the resolution is removed, and the method is suitable for the condition of any resolution image; the characteristic extraction layer comprises a target network layer used for extracting pixel point characteristics from slices, the size of a middle characteristic graph output by the target network layer is the same as that of the slices, namely down sampling during extraction of the pixel point characteristics of the slices is omitted, so that the detail characteristics of an original image are reserved, the characteristic extraction layer is further designed to extract the pixel point characteristics and the pixel point context characteristics, the robustness of detection of the targets with different sizes is improved, even if small targets in a small area range are detected, the small targets can be accurately detected, and the accuracy of target detection is improved.

In an alternative embodiment, a computer device is provided, as shown in FIG. 9, the computer device 900 shown in FIG. 9 comprising: a processor 901 and a memory 903. Wherein the processor 901 is coupled to the memory 903, such as via a bus 902. Optionally, the computer device 900 may further include a transceiver 904, and the transceiver 904 may be used for data interaction between the computer device and other computer devices, such as transmission of data and/or reception of data, and the like. It should be noted that the transceiver 904 is not limited to one in practical applications, and the structure of the computer device 900 is not limited to the embodiment of the present application.

The Processor 901 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 901 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others.

Bus 902 may include a path that transfers information between the above components. The bus 902 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 902 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The Memory 903 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 903 is used for storing application program codes (computer programs) for executing the present application, and the processor 901 controls the execution. The processor 901 is configured to execute application program code stored in the memory 903 to implement the content shown in the foregoing method embodiments.

Among these, computer devices include, but are not limited to: servers, terminals or service clusters, etc.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content of the object detection method in the foregoing method embodiments.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the object detection method described above.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of object detection, the method comprising:

2. The object detection method of claim 1, wherein the feature extraction layer comprises a micro backbone network and an attention network; the pixel point feature extraction and the pixel point context feature extraction are carried out on the at least two slices through the feature extraction layer of the target model, so that at least two feature graphs are obtained, and the method comprises the following steps:

pixel point feature extraction is carried out on the at least two slices through the miniature backbone network to obtain at least two slice feature maps, and any feature point of the slice feature maps is used for indicating the feature of a corresponding pixel point in the slices;

for each slice, extracting the context feature of each feature point in the slice feature map of the slice through the pooling layer of the attention network to obtain at least one context feature map of the slice, and performing feature fusion on the at least one context feature map to obtain the feature map of the slice;

3. The object detection method according to claim 2, wherein the extracting, through the pooling layer of the attention network, the context feature of each feature point in the slice feature map of the slice to obtain at least one context feature map of the slice comprises:

respectively extracting the context feature of each feature point in the slice feature map through each convolution kernel in at least two convolution kernels included in the pooling layer to obtain at least two first context feature maps of the slice feature map, wherein the sizes of the at least two first context feature maps are different;

correspondingly, the performing feature fusion on the at least one context feature map to obtain the feature map of the slice includes:

upsampling the at least two first context feature maps to obtain at least two second context feature maps, wherein the sizes of the at least two second context feature maps are the same as the size of the slice feature map;

and performing feature fusion on the at least two second context feature maps to obtain the feature map of the slice.

4. The object detection method of claim 1, wherein the detecting the feature map corresponding to each slice to obtain the defect information of each slice comprises:

for each slice, determining a target frame based on a candidate frame corresponding to each feature point in a feature map corresponding to the slice through a detector included in the target model, and outputting a defect position and a defect classification result of the slice based on the target frame, wherein the target frame is used for indicating an area where a defect of the target object is located;

the defect classification result comprises a defect class and a class probability, wherein the defect class refers to a defect class to which the defect belongs, and the class probability refers to a probability that the defect belongs to the defect class.

5. The method according to claim 4, wherein the determining, by the detector included in the target model, a target frame in which the defect is located based on the candidate frame corresponding to each feature point in the feature map corresponding to the slice, and outputting the defect position and the defect classification result of the slice based on the target frame includes:

generating at least two candidate frames corresponding to each feature point in the feature map based on an offset parameter through a regression branch network of the detector, wherein the offset parameter is used for indicating the offset distance between each boundary of the candidate frames and a positive sample pixel point in a corresponding slice of the feature map;

determining the contribution degree of each candidate frame except the current maximum candidate frame with the highest confidence degree in the at least two candidate frames, deleting a first candidate frame of which the contribution degree does not meet a target condition in the other candidate frames, executing the operation of determining the contribution degree and deleting the first candidate frame again on the basis of the candidate frames left after each deletion until each candidate frame is traversed, and determining the target frame on the basis of at least one second candidate frame left after the operation of deletion;

and classifying the region included by the target frame through a classification branch network of the detector, and outputting the defect classification result.

6. The object detection method of claim 4, further comprising:

for each slice, outputting defect indication information of the slice based on the feature map of the slice through a classifier of the target model, wherein the defect indication information is used for indicating whether a local target object included in the slice has defects or not;

the target model is obtained by training combining the output result of the classifier and the output result of the detector; correspondingly, the training process of the target model comprises the following steps:

inputting a sample set into an initial model, wherein the sample set refers to a set comprising a sample image of a target object and a truth label of the sample image, and the initial model comprises an initial detector and an initial classifier;

determining a joint difference degree between a joint result output by the initial model and a truth label through a joint loss function based on the sample detection position and the sample detection classification output by the initial detector, the sample indication information output by the initial classifier and the truth label of the sample set, wherein the joint result comprises the sample detection position, the sample detection classification and the sample indication information;

and adjusting model parameters of the initial model based on the joint difference degree, and stopping adjusting until target conditions are met to obtain the target model, wherein the model parameters at least comprise initial offset parameters of the initial detector.

7. The method of claim 6, wherein the determining the joint difference between the joint result of the initial model output and the truth label based on the sample detection position and the sample detection classification of the initial detector output, the sample indication information of the initial classifier output, and the truth label of the sample set through a joint loss function comprises:

determining a first difference between a sample candidate box and a truth box based on the sample candidate box, the truth box of a set of samples, and a first loss function predicted by an initial regression branch network of the initial detector;

determining a second difference between the prediction class probability and a prediction frame based on an overlapping ratio between a true value frame and the prediction frame of the sample set, the prediction class probability predicted by an initial classification branch network of the initial detector and a second loss function, wherein the prediction frame is a region where the defect is predicted based on the sample candidate frame;

determining a third difference between sample indication information and a true probability based on the sample indication information predicted by the classifier, the true probability of the set of samples, and a third loss function;

determining the joint difference measure by the joint loss function based on the first difference, the second difference, the third difference and a supervisory signal of the classifier.

8. The method of claim 7, wherein determining the first difference between the sample candidate box and the truth box based on the sample candidate box, the truth box of the set of samples, and the first loss function predicted by the initial regression branch network of the initial detector comprises:

for each sample image, predicting, by the initial regression branch network, at least two sample candidate boxes for the sample image based on an initial offset parameter of the initial regression branch network;

recombining the boundaries of the at least two sample candidate frames to obtain at least two recombination frames based on the deviation between the boundaries of the at least two sample candidate frames and the boundaries of the true value frames;

determining a first difference between each reconstruction frame and a true value frame based on an overlapping ratio of each reconstruction frame and the true value frame, the number of pixel points of the true value frame, and the first loss function.

9. The method of claim 8, wherein the reconstructing the boundaries of the at least two sample candidate frames based on the deviations of the boundaries of the at least two sample candidate frames from the boundaries of the true value frame to obtain at least two reconstructed frames comprises:

decomposing and dividing boundaries of the at least two sample candidate boxes into at least two boundary sets based on the confidence of each sample candidate box, wherein each boundary set comprises at least two boundaries of the at least two sample candidate boxes on the same relative position;

for each boundary set, calculating the deviation between each boundary in the boundary set and a true value boundary of a corresponding true value box, and sequencing at least two boundaries in the boundary set based on the deviation corresponding to each boundary;

and recombining the boundaries with the same arrangement sequence in each boundary set to obtain a recombination box based on the arrangement sequence of each boundary in each boundary set, and calculating the overlapping ratio between the recombination box and the truth-value box.

10. The object detection method according to claim 1, wherein said determining at least two slices of said image to be detected comprises:

and segmenting the image to be detected by adopting a sliding window mode based on a target step length and a target size to obtain at least two slices of the image to be detected.

11. An object detection device, comprising:

12. The object detection device of claim 11, wherein the feature extraction layer comprises a micro backbone network and an attention network;

the feature extraction module comprises:

13. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the object detection method of any of claims 1 to 10.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the object detection method of any one of claims 1 to 10.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the object detection method of any one of claims 1 to 10.