WO2022252565A1

WO2022252565A1 - Target detection system, method and apparatus, and device and medium

Info

Publication number: WO2022252565A1
Application number: PCT/CN2021/139062
Authority: WO
Inventors: 廖丹萍
Original assignee: 浙江智慧视频安防创新中心有限公司
Priority date: 2021-06-04
Filing date: 2021-12-17
Publication date: 2022-12-08
Also published as: CN113255682A; CN113255682B

Abstract

A target detection system, method and apparatus, and a medium and a device. The system comprises: an input module, which is used for receiving output image data; a feature extraction module, which is used for performing feature extraction on the image data by means of a convolutional neural network, so as to obtain an extracted feature map; a candidate region proposal module, which is used for receiving the feature map, and outputting a coarse bounding-box position of a foreground region that includes a target, and a bounding-box position of a background region; a candidate region extraction module, which is used for cropping the feature map to obtain a candidate background region and a candidate foreground region by using the bounding-box positions which are output by the candidate region proposal module, and adjusting the regions to the same size, so as to obtain candidate regions; and a detection module, which is used for classifying the obtained candidate regions, and further modifying a bounding-box position of a foreground candidate region by using a bounding-box regression algorithm, so as to obtain a final position of a detected target.

Description

A target detection system, method, device, equipment and medium

technical field

The present disclosure relates to the technical field of deep learning, and more specifically, the present disclosure relates to a target detection system, method, device, equipment and medium.

Background technique

Object detection is an important research direction of computer vision and digital image processing, and it is widely used in robot navigation, intelligent video surveillance, industrial inspection, aerospace and many other fields. The goal of object detection is to find the object of interest in the image, including two subtasks of object location and object classification, that is, to determine the category and location of the object at the same time.

At present, the target detection mode that uses convolutional neural network combined with a large amount of image data for training has become the mainstream method in the industry. Algorithms based on neural networks can basically be classified into two categories: two-stage algorithms represented by Faster R-CNN and one-stage algorithms represented by YOLO and SSD.

The two-stage model represented by Faster R-CNN roughly includes five modules:

Input Module: This module receives an input image.

Feature extraction module: This module extracts feature maps from input images through a series of convolutional neural networks.

Region Proposal Network (RPN): This module receives the feature map and outputs the rough bounding box position of the foreground area containing the target and the bounding box position of the background area.

Candidate region extraction module: This module uses the frame position output by RPN to cut out the candidate background region and foreground region from the feature map, and adjusts the candidate region to the same size.

Detection module: This module classifies the obtained candidate areas, and uses the frame regression algorithm to further correct the frame position to obtain the final position of the detection area.

The detection module needs to classify the obtained candidate areas to determine which type of foreground object or background it belongs to. The prerequisite step for classification is to construct a training set of candidate region feature maps, including feature maps and labels corresponding to candidate regions. The label of the candidate area is generally determined by the intersection over union (IoU) of the candidate area and the real border. Usually, the detection module sets a fixed IoU threshold. When the IoU between the candidate area and a real frame is greater than the IoU threshold, its label is the object category (positive sample) contained in the real frame. If the IoU of the candidate area and all ground-truth bounding boxes is less than the IoU threshold, its label is the background class (negative sample). Experimental observations show that when the IoU threshold is set relatively low, a large number of low-quality candidate regions will be labeled as positive samples. In this case, the detector produces more inaccurate bounding boxes. When the threshold of IoU is set relatively high, although the quality of the candidate area is improved, the number of positive samples is greatly reduced, and the model is easy to overfit.

Contents of the invention

In order to solve the technical problem that the accuracy of the existing target detection algorithm based on deep learning is not high enough, the present disclosure provides a target detection system, including:

An input module, configured to receive output image data;

A feature extraction module is used to extract the feature map through the convolutional neural network through the image data;

The candidate area suggestion module is used to receive the feature map, and output the rough frame position of the foreground area containing the target and the frame position of the background area;

The candidate area extraction module is used to use the frame position output by the candidate area suggestion module to cut out the candidate background area and the foreground area from the feature map, and adjust the areas to the same size to obtain the candidate area;

The detection module is used to classify the obtained candidate areas, and use the frame regression algorithm to further correct the frame position of the foreground candidate area to obtain the final position of the detection target. .

further,

The detection module specifically includes: no less than one detector, wherein each detector is preset with a corresponding IoU threshold for classifying candidate regions into positive samples and negative samples, wherein the real The candidate area whose intersection ratio is greater than the IoU threshold is a positive sample, and the candidate area whose intersection ratio with the real border is smaller than the IoU threshold is a negative sample;

The detection module is specifically used for:

Filtering the candidate regions extracted by the candidate region extraction module, calculating the intersection ratio between the candidate region and the real border, and according to the intersection ratio, searching for a detector corresponding to the intersection ratio threshold, and The candidate regions are input to the corresponding detectors.

Further, the detection module is also used for:

After inputting the candidate area to the detector, classify and adjust the position of the candidate area, and recalculate the IoU of the adjusted candidate area with the real label, and input it to the detection corresponding to its IoU value range device.

Further, the number of the detectors is three, respectively the first detector, the second detector and the third detector;

The intersection-over-union ratio threshold of the first detector is preset to be 0.45-0.55;

The intersection-over-union ratio threshold of the second detector is preset to be 0.56-0.65;

The intersection-over-union ratio threshold of the third detector is preset to be 0.66-0.75.

In order to achieve the above-mentioned technical purpose, the present disclosure can also provide a target detection method, which is applied to the above-mentioned system, and the method includes:

Collect image data and target tags corresponding to the image data, wherein the target tags include object categories and frame positions in the image;

inputting the image data to the target detection system to obtain the detection result of each detector;

A loss function is used to compare the detection result with the ground truth label to get the loss of each detector.

Further, after the step of comparing the detection result with the real label using the loss function to obtain the loss of each detector, it also includes:

The losses of all the detectors are summed to obtain the overall loss of the object detection system.

Further, when the system is used for target classification, the loss function is a cross-entropy loss function;

When the system is used for position regression analysis, the loss function is a Smooth L1 loss function or a GIoU loss function.

In order to achieve the above technical purpose, the present disclosure can also provide a target detection device, including:

An image data collection module, configured to collect image data and target tags corresponding to the image data, wherein the target tags include object categories and frame positions in the image;

A target detection module, configured to input the image data to the target detection system to obtain the detection result of each detector;

The loss calculation module is used to compare the detection result with the real label by using the loss function to obtain the loss of each detector.

To achieve the above-mentioned technical purpose, the present disclosure can also provide a computer storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is used to realize the steps of the above-mentioned object detection method.

In order to achieve the above-mentioned technical purpose, the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the above-mentioned target detection method is realized. step.

The beneficial effects of the disclosure are:

Compared with the traditional target detection system and algorithm model, this disclosure designs multiple detectors with different intersection and union ratio thresholds, and specifically selects a candidate area suitable for the detector for each detector, which is more beneficial to a single detector. training, so it can improve the performance very well.

Description of drawings

FIG. 1 shows a schematic structural view of Embodiment 1 of the present disclosure;

Figure 2 shows a schematic structural view of a preferred implementation of Example 1 of the present disclosure;

Figure 3 shows a schematic diagram of the testing phase of Example 1 of the present disclosure;

FIG. 4 shows a schematic flow diagram of Embodiment 2 of the present disclosure;

FIG. 5 shows a schematic structural diagram of Embodiment 3 of the present disclosure;

FIG. 6 shows a schematic structural diagram of Embodiment 5 of the present disclosure.

Detailed ways

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. It should be understood, however, that these descriptions are exemplary only, and are not intended to limit the scope of the present disclosure. Also, in the following description, descriptions of well-known structures and techniques are omitted to avoid unnecessarily obscuring the concept of the present disclosure.

Various structural schematic diagrams according to embodiments of the present disclosure are shown in the accompanying drawings. The figures are not drawn to scale, with certain details exaggerated and possibly omitted for clarity of presentation. The shapes of the various regions and layers shown in the figure, as well as their relative sizes and positional relationships are only exemplary, and may deviate due to manufacturing tolerances or technical limitations in practice, and those skilled in the art will Regions/layers with different shapes, sizes, and relative positions can be additionally designed as needed.

Embodiment one:

As shown in Figure 1:

The present disclosure provides a target detection system, including:

An input module, configured to receive output image data;

A feature extraction module is used to extract the feature map by taking the image data through a convolutional neural network feature;

The detection module is used to classify the obtained candidate areas, and use the frame regression algorithm to further correct the frame position of the foreground candidate area to obtain the final position of the detection target.

Further, the detection module specifically includes: no less than one detector, wherein each detector is preset with a corresponding IoU threshold for classifying candidate regions into positive samples and negative samples, wherein , the candidate area whose intersection ratio with the real frame is greater than the IoU threshold is a positive sample, and the candidate area whose intersection ratio with the real frame is smaller than the IoU threshold is a negative sample;

The detection module is specifically used for:

Further, the detection module is also used for:

The target detection system of the present disclosure will be explained in detail below in conjunction with a preferred implementation of a specific embodiment 1:

as shown in picture 2:

The detection module of this preferred embodiment has a total of three detectors, namely the first detector H1, the second detector H2 and the third detector H3;

The cross-over union ratio of the first detector H1 is preset to be 0.5;

The intersection and union ratio of the second detector H2 is preset to be 0.6;

The cross-over union ratio of the third detector H3 is preset to be 0.7.

During the detection process of the detection module, if the IoU between the candidate area and the real frame is between 0.5 and 0.6, the candidate area is input to the first detector H1. Input the candidate area B1 of the first detector H1 to obtain the classification information C1;

If the IoU between the candidate area and the real frame is between 0.6 and 0.7, the candidate area is input to the second detector H2. Input the candidate area B2 of the second detector H2 to obtain the classification information C2;

If the IoU between the candidate area and the real frame is higher than 0.7, the candidate area is input to the third detector H3. Input the candidate area B3 of the third detector H3 to obtain the classification information C3;

At the same time, the candidate area B1 adjusted by the first detector H1 is screened, and if its IoU is between 0.6 and 0.7, the candidate area is input to the second detector H2. If the IoU between the candidate area and the real frame is higher than 0.7, the candidate area is input to the third detector H3.

The candidate area B2 adjusted by the second detector H2 is screened, and if the IoU between the candidate area and the real border is higher than 0.7, the candidate area is input to the third detector H3.

As shown in Figure 3, in the test phase, the image to be detected is input to the neural network. All the candidate regions B0 obtained by the candidate region extraction module are input to the detector H1 to obtain the adjusted candidate region B1. Input all B1 to the detector H2 to obtain the candidate area B2. Input B2 to H3 to get detection area B3 and corresponding classification information C3. Use the NMS algorithm to deduplicate B3 to obtain the final detection area.

Embodiment two:

As shown in Figure 4,

The present disclosure can also provide a target detection method, which is applied to the target detection system according to Embodiment 1, and the method includes:

S201: Collect image data and a target label corresponding to the image data, wherein the target label includes object category and frame position in the image;

S202: Input the image data to the target detection system to obtain the detection result of each detector;

S203: Using a loss function to compare the detection result with the real label to obtain the loss of each detector.

Embodiment three:

As shown in Figure 5,

The present disclosure can also provide a target detection device, including:

An image data collection module 301, configured to collect image data and target tags corresponding to the image data, wherein the target tags include object categories and frame positions in the image;

A target detection module 302, configured to input the image data to the target detection system to obtain a detection result of each detector;

The loss calculation module 303 is configured to use a loss function to compare the detection result with the real label to obtain the loss of each detector.

Wherein, the image data collection module 301 described in this disclosure is sequentially connected with the target detection module 302 and the loss calculation module 303 .

Embodiment four:

The present disclosure can also provide a computer storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is used to realize the steps of the above object detection method.

The computer storage medium of the present disclosure may be implemented using semiconductor memory, magnetic core memory, magnetic drum memory, or magnetic disk memory.

Semiconductor memory, mainly used in computers, mainly has two types of semiconductor memory elements: Mos and bipolar. Mos components are highly integrated, the process is simple but the speed is slow. Bipolar components are complex in process, high in power consumption, low in integration but fast in speed. After the advent of NMos and CMos, Mos memory began to play a major role in semiconductor memory. NMos is fast, for example, the access time of Intel's 1K-bit SRAM is 45ns. CMos consumes less power, and the 4K-bit CMos static memory access time is 300ns. The above-mentioned semiconductor memories are all random access memories (RAM), that is, they can be read and written into new content randomly during the working process. The semiconductor read-only memory (ROM) can be read randomly but cannot be written in during the working process, and it is used to store solidified programs and data. ROM is divided into non-rewritable fuse-type read-only memory ─ ─ PROM and rewritable read-only memory EPROM two.

Magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of actual use experience. Before the mid-1970s, magnetic core memory was widely used as the main memory. Its storage capacity can reach more than 10 bits, and the fastest access time is 300ns. The typical magnetic core memory capacity in the world is 4MS ~ 8MB, and the access cycle is 1.0 ~ 1.5μs. After the rapid development of semiconductor storage replaced the magnetic core memory as the main memory, the magnetic core memory can still be used as a large-capacity expansion memory.

Drum memory, a magnetically recorded external memory. Due to its fast information access speed and stable and reliable work, although its capacity is small, it is gradually being replaced by disk storage, but it is still used as an external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and microcomputers, ultra-small magnetic drums have appeared, which are small in size, light in weight, high in reliability, and easy to use.

Disk storage, a type of magnetically recorded external storage. It has the advantages of magnetic drum and magnetic tape storage, that is, its storage capacity is larger than that of magnetic drum, and its access speed is faster than that of magnetic tape storage, and it can be stored offline. Therefore, disks are widely used as large storage devices in various computer systems. capacity of external memory. Disks are generally divided into two categories: hard disks and floppy disks.

There are many types of hard disk storage. Structurally, it can be divided into interchangeable type and fixed type. Interchangeable disk platters can be exchanged, and fixed disk platters are fixed. There are two kinds of replaceable and fixed disks: multi-chip combination and single-chip structure, and both can be divided into fixed head type and movable head type. The capacity of the fixed head type disk is small, the recording density is low and the access speed is high, but the cost is high. The moving head type disk has a high recording density (up to 1000-6250 bits/inch), so it has a large capacity, but its access speed is lower than that of a fixed head disk. The storage capacity of disk products can reach hundreds of megabytes, the bit density is 6 250 bits per inch, and the track density is 475 tracks per inch. Among them, the multi-chip interchangeable disk storage has a large off-body capacity because the disk group can be replaced, and has a large capacity and high speed, and can store large-capacity intelligence data. It is widely used in online information retrieval systems and database management systems.

Embodiment five:

The present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the steps of the above target detection method are realized.

Fig. 6 is a schematic diagram of the internal structure of an electronic device in one embodiment. As shown in Figure 6, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. Wherein, the storage medium of the computer device stores an operating system, a database, and computer-readable instructions, the database may store control information sequences, and when the computer-readable instructions are executed by the processor, the processor may implement a target detection method . The processor of the electrical device is used to provide computing and control capabilities, and supports the operation of the entire computer device. Computer-readable instructions may be stored in the memory of the computer device, and when executed by the processor, the computer-readable instruction may cause the processor to execute an object detection method. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation on the computer equipment to which the solution of this application is applied. The specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

The electronic devices include, but are not limited to, smart phones, computers, tablet computers, wearable smart devices, artificial intelligence devices, power banks, etc.

In some embodiments, the processor can be composed of integrated circuits, for example, it can be composed of a single packaged integrated circuit, or it can be composed of multiple integrated circuits with the same function or different functions, including one or more central Processor (Central Processing unit, CPU), microprocessor, digital processing chip, graphics processor and a combination of various control chips, etc. The processor is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing programs or modules stored in the memory (such as executing remote data read and write programs, etc.), and call the data stored in the memory to execute various functions of the electronic device and process data.

The bus may be a peripheral component interconnect standard (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement communication between the memory and at least one processor.

Figure 6 only shows an electronic device with components, and those skilled in the art can understand that the structure shown in Figure 6 does not constitute a limitation to the electronic device, and may include fewer or more components than shown in the figure , or combinations of certain components, or different arrangements of components.

For example, although not shown, the electronic device may also include a power supply (such as a battery) for supplying power to each component. Preferably, the power supply may be logically connected to the at least one processor through a power management device, thereby realizing Charge management, discharge management, and power management functions. The power supply may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components. The electronic device may also include various sensors, bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

Further, the electronic device may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which are usually used to communicate between the electronic device and A communication link is established between other electronic devices.

Optionally, the electronic device may further include a user interface. The user interface may be a display (Display) or an input unit (such as a keyboard (Keyboard)). Optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. Wherein, the display may also be properly referred to as a display screen or a display unit, and is used for displaying information processed in the electronic device and for displaying a visualized user interface.

Further, the computer-usable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; use of the created data, etc.

In the several embodiments provided by the present invention, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, or each unit may physically exist separately, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software function modules.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the present disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of the present disclosure, and these substitutions and modifications should all fall within the scope of the present disclosure.

Claims

A target detection system, characterized in that it comprises:

An input module, configured to receive output image data;

A feature extraction module is used to extract the feature map by taking the image data through a convolutional neural network feature;

The candidate area suggestion module is used to receive the feature map, and output the rough frame position of the foreground area containing the target and the frame position of the background area;

The candidate area extraction module is used to use the frame position output by the candidate area suggestion module to cut out the candidate background area and the foreground area from the feature map, and adjust the areas to the same size to obtain the candidate area;

The detection module is used to classify the obtained candidate areas, and use the frame regression algorithm to further correct the frame position of the foreground candidate area to obtain the final position of the detection target.
The system according to claim 1, wherein the detection module specifically includes: no less than one detector, wherein each detector is preset with a corresponding intersection-over-union ratio (IoU) threshold, which is used to select candidate Regions are classified into positive samples and negative samples. Among them, the candidate region whose intersection ratio with the real frame is greater than the IoU threshold is a positive sample, and the candidate region whose intersection ratio with the real frame is smaller than the IoU threshold is a negative sample;

The detection module is specifically used for:

Filtering the candidate regions extracted by the candidate region extraction module, calculating the intersection ratio between the candidate region and the real border, and according to the intersection ratio, searching for a detector corresponding to the intersection ratio threshold, and The candidate regions are input to the corresponding detectors.
The system according to claim 2, wherein the detection module is also used for:

After inputting the candidate area to the detector, classify and adjust the position of the candidate area, and recalculate the IoU of the adjusted candidate area with the real label, and input it to the detection corresponding to its IoU value range device.
The system according to any one of claims 2 or 3, wherein the number of said detectors is three, being respectively a first detector, a second detector and a third detector;

The intersection-over-union ratio threshold of the first detector is preset to be 0.45-0.55;

The intersection-over-union ratio threshold of the second detector is preset to be 0.56-0.65;

The intersection-over-union ratio threshold of the third detector is preset to be 0.66-0.75.
A target detection method applied to the system according to any one of claims 1 to 4, characterized in that the method comprises:

Collect image data and target labels corresponding to the image data, wherein the target labels include object categories and frame positions in the image;

inputting the image data to the target detection system to obtain the detection result of each detector;

A loss function is used to compare the detection result with the ground truth label to get the loss of each detector.
The method according to claim 5, characterized in that, after the step of comparing the detection result with the real label using a loss function to obtain the loss of each detector, it also includes:

The losses of all the detectors are summed to obtain the overall loss of the object detection system.
According to the method described in any one of claim 5 or 6, it is characterized in that, when the system is used for target classification, the loss function is a cross-entropy loss function;

When the system is used for position regression analysis, the loss function is a Smooth L1 loss function or a GIoU loss function.
A target detection device, characterized in that it comprises:

An image data collection module, configured to collect image data and target tags corresponding to the image data, wherein the target tags include object categories and frame positions in the image;

A target detection module, configured to input the image data to the target detection system to obtain the detection result of each detector;

The loss calculation module is used to compare the detection result with the real label by using the loss function to obtain the loss of each detector.
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, it realizes the object detection method corresponding to any one of claims 5 to 7. step.
A computer storage medium, on which computer program instructions are stored, wherein the program instructions are used to implement the corresponding steps of the target detection method described in any one of claims 5-7 when executed by a processor.