CN116051879A

CN116051879A - Object detection method, device, apparatus, storage medium, and program product

Info

Publication number: CN116051879A
Application number: CN202111258132.0A
Authority: CN
Inventors: 张华�; 肖立强
Original assignee: Tencent Technology Shenzhen Co Ltd; Institute of Information Engineering of CAS
Current assignee: Tencent Technology Shenzhen Co Ltd; Institute of Information Engineering of CAS
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-05-02

Abstract

The application discloses a target detection method, a device, equipment, a storage medium and a program product, when target detection is carried out on an image to be detected, which is acquired under a target environmental condition, the image to be detected is subjected to feature extraction to obtain proposed features, and a target class center associated with the proposed features is determined from a class center data set according to the similarity between the proposed features and class centers in the class center data set. The class center data set comprises class centers of different classes, the class centers are representative features corresponding to objects of different classes, so that the proposed features are enhanced by utilizing the target class centers to obtain enhanced features, and target image features corresponding to the image to be detected are determined according to the enhanced features, so that the features which are richer and comprehensive and can represent objects of a certain class are obtained, and the target image features are richer and have discernability. Thus, the target detection is carried out on the image to be detected according to the characteristics of the target image, the obtained detection result is more accurate, and the accuracy of target detection is improved.

Description

Object detection method, device, apparatus, storage medium, and program product

Technical Field

The present application relates to the field of image processing, and in particular, to a target detection method, apparatus, device, storage medium, and program product.

Background

With the development of computer technology, more and more scenes need to perform target detection, such as unmanned, intelligent traffic, security and protection systems, and the like.

Under these scenes, images can be acquired through optical sensors such as cameras, infrared, thermal sensors, lasers and the like, and then features are extracted based on the acquired images for a classifier to classify targets, so that all interested targets in the images are found out, and the categories (such as people, automobiles, trees and the like) and the positions of the targets are determined.

However, under severe environmental conditions such as haze, rainy days, night, the effectiveness of image data acquired by adopting image acquisition equipment such as cameras, infrared rays, heat sensation, laser and the like is reduced, and enough features are difficult to acquire for a classifier to classify targets, so that the situations such as false detection, missed detection and the like can occur, and the accuracy rate of target detection is low.

Disclosure of Invention

In order to solve the technical problems, the application provides a target detection method, a device, equipment, a storage medium and a program product, the obtained detection result is more accurate, and the accuracy of target detection is improved.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a target detection method, including:

acquiring an image to be detected acquired under a target environmental condition;

extracting features of the image to be detected to obtain proposed features;

determining a target class center associated with the proposed feature from a class center dataset according to the similarity of the proposed feature and the class center in the class center dataset, wherein the class center dataset comprises class centers of different classes, and the class centers are representative features corresponding to objects of different classes;

enhancing the proposed features through the target class center to obtain enhanced features;

determining target image characteristics corresponding to the image to be detected according to the enhancement characteristics;

and carrying out target detection on the image to be detected according to the target image characteristics to obtain a detection result.

On the other hand, an embodiment of the present application provides a target detection apparatus, which includes an acquisition unit, an extraction unit, a determination unit, an enhancement unit, and a detection unit:

the acquisition unit is used for acquiring an image to be detected acquired under the target environmental condition;

The extraction unit is used for extracting the characteristics of the image to be detected to obtain proposed characteristics;

the determining unit is configured to determine, from a class center dataset, a target class center associated with the proposed feature according to a similarity between the proposed feature and a class center in the class center dataset, where the class center dataset includes class centers of different classes, and the class centers are representative features corresponding to objects of different classes;

the enhancement unit is used for enhancing the proposed features through the target class center to obtain enhanced features;

the determining unit is further used for determining target image features corresponding to the image to be detected according to the enhancement features;

and the detection unit is used for carrying out target detection on the image to be detected according to the target image characteristics to obtain a detection result.

In another aspect, an embodiment of the present application provides an apparatus for target detection, the apparatus including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the aforementioned object detection method according to instructions in the program code.

In another aspect, embodiments of the present application provide a computer-readable storage medium for storing program code for performing the aforementioned object detection method.

In another aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the aforementioned object detection method.

According to the technical scheme, class centers of different classes are collected in advance, are representative features corresponding to objects of different classes, and can reflect prototypes of one class. When the image to be detected acquired under the target environment condition is subjected to target detection, the image to be detected can be subjected to feature extraction to obtain proposed features, and then the target class center associated with the proposed features is determined from the class center data set according to the similarity between the proposed features and the class centers in the class center data set. The class center data set comprises class centers of different classes, the class centers are representative features corresponding to objects of different classes, so that the proposed features can be enhanced by utilizing the target class centers to obtain enhanced features, and the target image features corresponding to the image to be detected are determined according to the enhanced features, so that the features which are richer and comprehensive and can represent the objects of a certain class are obtained. Especially when the target environment condition damages the image to be detected, and the extracted proposed features are insufficient and lack of discernability, the proposed features are enhanced through the target class center to assist the target detection process, so that the finally obtained target image features are richer and have discernability. Thus, the target detection is carried out on the image to be detected according to the characteristics of the target image, the obtained detection result is more accurate, and the accuracy of target detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a schematic system architecture diagram of a target detection method according to an embodiment of the present application;

fig. 2 is a flowchart of a target detection method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process for performing object detection based on an object detection model according to an embodiment of the present application;

FIG. 4 is a flowchart of a training method of a target detection model according to an embodiment of the present application;

FIG. 5 is a diagram of an exemplary process for updating class centers of classes to which objects belong in a class center module (memory feature library) according to an embodiment of the present application;

FIG. 6 is a block diagram of an object detection device according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a terminal device according to an embodiment of the present application;

Fig. 8 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

Due to extreme circumstances such as haze, rain, night, etc., the extracted features of the object (e.g., proposed features) may lack discernability, such as cars in a thick fog being easily ignored or identified as another vehicle such as a bus.

In order to cope with the situation that sight conditions such as haze, rainy days and night are poor and solve the technical problem that accuracy of target detection is low under severe environmental conditions such as haze, rainy days and night, the embodiment of the application provides a target detection method.

Since objects of the same class (e.g., people, objects, etc.) must be similar in some way, different aspects herein represent different feature spaces, e.g., sizes, shapes, etc., the objects are correctly classified by integrating features of multiple feature spaces, e.g., by distinguishing between different classes of automobiles, trucks, etc., by the shape, size, etc., of the object, e.g., a vehicle. Therefore, in the embodiment of the application, class centers of different classes, which are representative features corresponding to objects of different classes, can be collected in advance, and a prototype of one class can be reflected, so that the recognition process is enhanced through the class centers. For example, the proposed features are obtained by extracting features from the image to be detected collected under the target environmental condition, and then the proposed features are enhanced through the target class center, so that the discriminant representation of the image to be detected, which is damaged due to the target environmental condition, is improved, the finally obtained target image features are richer and have discriminability, and the accuracy of target detection is further improved.

It should be noted that, the method provided in the embodiment of the present application may be applied to various scenes, especially, scenes where the acquired image may be affected by environmental conditions, for example, scenes such as unmanned, intelligent traffic, security system, etc., which is not limited in this embodiment of the present application.

Referring to fig. 1, fig. 1 is a schematic system architecture diagram of a target detection method according to an embodiment of the present application. The system architecture includes an image acquisition device 101 and an image processing device 102. The image capturing device 101 may be a camera, an infrared sensor, a thermal sensor, a laser sensor, etc., and fig. 1 is mainly described by taking the image capturing device 101 as an example of the camera; the image processing apparatus 102 may be a terminal apparatus or a server, and fig. 1 is mainly described in terms of the image processing apparatus 102 being a terminal apparatus. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted terminal, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

Taking the security system scenario as an example, in order to improve security, the image capturing device 101 is generally installed at a different place, for example, the image capturing device 101 is installed at a certain intersection, and a scene near the intersection (see fig. 1, where a scene of an accessory of the intersection includes a person, an automobile, etc.) can be captured by the image capturing device 101 to obtain an image to be detected. At this time, the image to be detected is acquired by the image acquisition apparatus under the target environmental condition, including various kinds of objects such as a person, an automobile, and the like. It is understood that the categories herein include primarily people, automobiles, buses, bicycles, motorcycles, traffic lights, and the like.

Since the image capturing device 101 captures an outdoor scene, the captured image to be detected may be affected by an environmental condition, and if the image to be detected is captured under a target environmental condition (for example, in a rainy day), the image capturing device 101 may send the image to be detected to the image processing device 102 so as to perform target detection on the image to be detected.

In the process of performing object detection, the image processing apparatus 102 may perform feature extraction on the image to be detected to obtain proposed features. At this time, since the image to be detected may be corrupted by the target environmental condition, the resulting proposed features may lack discernability, the target class center associated with the proposed features may be determined from the class center dataset based on the similarity of the proposed features to the class centers in the class center dataset. The class center data set comprises class centers of different classes, the class centers are representative features corresponding to objects of different classes, so that the proposed features can be enhanced by utilizing the target class centers to obtain enhanced features, and the target image features corresponding to the image to be detected are determined according to the enhanced features, so that the features which are richer and comprehensive and can represent the objects of a certain class are obtained. Thus, the target detection is carried out on the image to be detected according to the characteristics of the target image, the obtained detection result is more accurate, and the accuracy of target detection is improved.

It should be noted that, in some cases, the method provided in the embodiments of the present application may also be performed jointly by the terminal device and the server, that is, the image processing device 102 includes the terminal device and the server.

Next, an object detection method provided in an embodiment of the present application will be described with reference to the accompanying drawings. Referring to fig. 2, the method includes:

s201, acquiring an image to be detected acquired under a target environment condition.

In a real scene, the image acquisition device may acquire images under various environmental conditions, however, the environmental conditions may be severe environmental conditions, the acquired images may be damaged by the severe environmental conditions, and the method provided by the embodiment of the application may perform target detection on the images acquired under the environmental conditions to obtain more accurate detection results. Taking an environmental condition as a target environmental condition as an example, an image acquired under the target environmental condition to perform target detection may be referred to as an image to be detected.

S202, extracting features of the image to be detected to obtain proposed features.

It should be noted that, in the embodiment of the present application, after the image to be detected is obtained, the image to be detected may be input into the target detection model, so that the target detection model is used to perform target detection on the image to be detected, so as to obtain a detection result.

The object detection model comprises a feature extraction module, a class center module, a feature fusion module and an identification module, wherein the class center module is composed of class centers in a class center data set. The target detection of the image to be detected by using the target detection model mainly comprises two stages, wherein the first stage mainly performs feature extraction through the target detection model, for example, the feature extraction module performs feature extraction on the image to be detected to obtain proposed features. The second stage is mainly to carry out subsequent recognition classification based on the extracted proposed features through the target detection model, and the second stage is mainly to use a class center module, a feature fusion module and a recognition module.

First, the first stage will be described. Referring to fig. 3, the feature extraction module used in the first stage mainly includes a region suggestion network (Region Proposal Network, RPN) and a region of interest alignment module such as ROI alignment (Region Of Interest Align) or ROIpooling (Region Of Interest pooling). Firstly, the possible positions of the objects in the image to be detected are obtained through the regional proposal network RPN of the first stage, and a regional proposal frame is obtained. These region proposal boxes are then suppressed by non-maxima and then removed from the repeated region proposal boxes for the same object, which are then passed through roikooling or ROIAlign to obtain the corresponding proposal features (e.g. as shown at 301 in fig. 3), and then passed to a second stage of recognition for final classification and regression operations to obtain the final class and location.

Based on this, the method provided in the embodiments of the present application may relate to artificial intelligence (Artificial Intelligence, AI), which is the intelligence of simulating, extending and expanding a person using a digital computer or a machine controlled by a digital computer, that is, researching the design principles and implementation methods of various intelligent machines, so that the machines have the functions of sensing, reasoning and decision.

The method provided by the embodiment of the application also relates to Computer Vision technology (CV) in artificial intelligence, and Computer Vision is a science for researching how to make a machine "see", and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as recognition, tracking and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc. The proposed features are obtained by feature extraction of the image to be detected, for example, through semantic understanding of the image.

S203, determining a target class center associated with the proposed feature from the class center data set according to the similarity between the proposed feature and the class center in the class center data set.

In the embodiment of the application, a class center data set may be pre-constructed, where the class center data set includes class centers of different classes, and the class centers are representative features corresponding to objects of different classes. The class center data sets herein may constitute a class center module, which may also be referred to as a memory feature library (Memory Feature Bank).

The class centers in the class center data set can be collected in the training process of the target detection model, and in the training process of the target detection model, the representative features of the objects of each class are continuously collected as the class centers of the class, and the class centers not only participate in the training process, but also are saved to provide assistance for target detection.

Since the image to be detected may be corrupted by target environmental conditions, the resulting proposed features may lack discernability, a target class center associated with the proposed features may be selected from the class center dataset based on their similarity to the class centers in the class center dataset, as shown by dashed box 302 in fig. 3. For example, the similarity between the proposed feature and each class center in the class center data set is calculated, and the higher the similarity, the more the object corresponding to the identified class center and the object corresponding to the proposed feature may be the object of one class, so the class center with the similarity higher than a certain threshold value can be used as the target class center. Because the class center data set comprises class centers of different classes, the class centers are representative features corresponding to objects of different classes, and the determined target class center may be representative features of objects corresponding to the proposed features, the proposed features may be enhanced by the target class center later to obtain enhanced features, so that the features which are richer and comprehensive and can represent objects of a certain class are obtained.

In the case of performing object detection using the object detection model described above, the object class center associated with the proposed feature may be determined by the class center module in the object detection model in this embodiment.

In one possible implementation, for the same object, different environmental conditions may make their corresponding representative features different, so different class-center data sets corresponding to different environmental conditions may be constructed in this embodiment. The implementation of S203 may be to acquire a class center data set corresponding to the target environmental condition, and determine a target class center associated with the proposed feature from the class center data set corresponding to the target environmental condition.

For example, respectively constructing class center data sets corresponding to haze, rainy days and nights, if the image to be detected is acquired in the rainy days, when determining the target class center, calculating the similarity between the proposed features and the class center in the class center data set corresponding to the haze, and further determining the target class center from the class center data set corresponding to the haze.

Compared with the prior art that the characteristic representation of the object is enhanced by utilizing the unchanged characteristic domain under various environmental conditions, the method considers the inherent attribute under different environmental conditions, and the target class center under the same environmental condition can bring better guiding information, so that the accuracy rate of the subsequent target detection is further improved.

S204, enhancing the proposed features through the target class center to obtain enhanced features.

In general, the class center (target class center) of the same class brings better guidance information, and the feature fusion module of the embodiment of the application can be a feature fusion module based on an attention mechanism, which uses a common dot product attention formula to enhance the proposed features. In one possible implementation, a similarity measure matrix may be calculated according to the target class centers and the proposed features, where elements in the similarity measure matrix are used to identify a weight value of each target class center for the proposed features, and then the target class centers and the proposed features are weighted and summed according to the similarity measure matrix to obtain the enhanced features corresponding to the proposed features. The calculation formula of the similarity measurement matrix is as follows:

W＝softmax(P*C ^T /d ^1/2 )

wherein, P is the proposed feature, the dimension is n×d, n is the number of features in P (the number of objects in the image to be detected), and d is the feature dimension, typically 1024; c is the center of the target class, C ^T Transpose of CThe method comprises the steps of carrying out a first treatment on the surface of the W is a similarity measure matrix, and the dimension is [ n.times.m ]]Identifying a weight value of each target class center for the proposed features P, wherein n represents the number of the proposed features, and m represents the number of class centers in a class center module; softmax is a logistic regression function. Next, the self enhancement features are constructed for the proposed features of each object by weighted summation, which can be enhanced on the basis of the original features (proposed features) in order not to lose the original features of the object, as shown in the following formula:

E＝W*C+P

Wherein E is an enhanced feature, W is a similarity measurement matrix, C is a target class center, and P is a proposed feature obtained by feature extraction.

It should be noted that, in the case of performing object detection using the above-mentioned object detection model, in this embodiment, the object class center and the proposed feature may be fused by the feature fusion module in the object detection module to obtain the enhanced feature, which is shown in a dashed box 303 in fig. 3. The dashed box identified 303 includes the whole process of feature fusion by the feature fusion module, which may include three full connection layers (Fully Connected Layers, FC), a matrix multiplication operation, and an element-by-element addition operation. As can be seen from fig. 3, feature enhancement based on the attention mechanism is achieved by three fully connected layers, two matrix multiplication operations and one element-wise addition operation, resulting in enhanced features (e.g. 304 in fig. 3).

S205, determining the target image characteristics corresponding to the image to be detected according to the enhancement characteristics.

It should be noted that, in the case of performing object detection using the object detection model, in this embodiment, the feature fusion module may determine, according to the fusion feature, the object image feature corresponding to the image to be detected.

In this embodiment of the present application, the manner of determining the target image feature corresponding to the image to be detected according to the enhanced feature may include multiple manners, one manner is to directly use the enhanced feature as the target image feature, and another manner is to splice the enhanced feature with the proposed feature to obtain the target image feature, as shown in a dashed box 305 in fig. 3, and splice the enhanced feature with the proposed feature by using a splicing (splicing) process to obtain the target image feature (for example, as shown in 306 in fig. 3), so as to avoid loss of original information in the image to be detected. If the dimension of the proposed feature is n x d, then the dimension of the resulting target image feature is [ n x 2d ].

S206, performing target detection on the image to be detected according to the target image characteristics to obtain a detection result.

In the case of performing target detection by using the target detection model, in this embodiment, the detection module may perform target detection on the image to be detected according to the target image feature, to obtain a detection result. The detection result may include a category of an object included in the image to be detected and a position of the object in the image to be detected.

Referring to fig. 3, in fig. 3, CLS indicates category recognition, that is, a category of an object included in an image to be detected is recognized, REG indicates a regression location, that is, a location of an object of each category in the image to be detected is determined, referring to a location identified by a rectangular frame in fig. 3.

In one possible implementation manner, in order to improve the accuracy of target detection under severe environmental conditions, an image recovery model may be pre-trained, and after an image is acquired, the acquired image is recovered through the image recovery model, so as to improve the quality of the image input into the target detection model, and further improve the accuracy of target detection. Compared with the scheme of improving the target detection accuracy based on the image recovery model, the target detection method provided by the embodiment of the application does not need to train an additional model, reduces the data processing amount, and is more suitable for being used in real scenes.

Referring to tables 1-4, the accuracy of target detection for objects of different classes using various methods under different environmental conditions is shown, for example, table 1 describes the accuracy corresponding to each class of automobile, bus, person, bicycle, motorcycle, etc. using domain adaptive improvement (daFaster), strong and weak distribution alignment (Strong-Weak Distribution Alignment for Adaptive Object Detection, SWDA), PBDA, mask-based convolutional neural network (Mask Region-CNN Convolutional Neural Network, mask rcnn), mask rcnn+ scheme (i.e. under the method of taking Mask rcnn as the baseline and inserting the foregoing class center module and feature fusion module into Mask rcnn), and the average accuracy (mean Average Precision, mAP) of each class under each method, as can be seen from the values in table 1, the corresponding values of the method combined with the scheme are higher, i.e. the accuracy of target detection is higher. Similarly, tables 2, 3 and 4 are respectively the accuracy of identifying different categories by using different methods under rainy days, foggy days and nighttime environmental conditions, and the average accuracy of each category under each method, and the methods used in tables 2 to 4 are different from table 1, and all the aimed categories are different, but the methods combined with the scheme have larger corresponding values, namely the accuracy of target detection is higher.

TABLE 1

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

Table 5 shows the accuracy of target detection for targets of different sizes using various methods under different environmental conditions, and the recognition rate for 100 targets (the proportion of targets that can be recognized for 100 targets to 100 targets).

TABLE 5

/>

As can be seen from table 5, the method combined with the scheme has larger corresponding values, i.e. the target detection accuracy and recognition rate of the scheme under severe environmental conditions are higher.

Experiments are carried out by using different methods to obtain the implementation results in the tables 1-5, so that it can be seen that the target detection method provided by the embodiment of the application can reach the most advanced performance for the unrestricted object detection task, and the method has better robustness.

Next, a training method of the object detection model will be described, referring to fig. 4, the method further includes:

s401, acquiring a training sample image under the sample environment condition, wherein the training sample image is provided with a category label of an object included in the training sample image.

S402, inputting the training sample image into an initial detection model.

The initial detection model comprises an initial feature extraction module, an initial class center module, an initial feature fusion module and an initial recognition module, and the training sample image is provided with class labels of objects included in the training sample image.

S403, carrying out feature extraction on the training sample image through the initial feature extraction module to obtain sample proposal features.

S404, determining a target sample class center associated with the sample proposal feature from the initial class center module according to the similarity between the sample proposal feature and the class center in the class center dataset through the initial class center module.

S405, enhancing the sample proposal features by using the target sample class center through the initial feature fusion module to obtain sample enhancement features.

S406, determining target sample image features corresponding to the training sample images according to the sample enhancement features through the initial feature fusion module.

S407, performing target detection on the image to be detected according to the target sample image characteristics by the initial recognition module to obtain a sample detection result.

S408, training the initial detection model based on the category labels and the sample detection results to obtain the target detection model.

It should be noted that, in the process of the training object detection model, S401 to S407 are similar to the object detection process corresponding to fig. 2, and will not be described in detail here.

Based on this, the method provided in the embodiment of the present application further relates to the field of Machine Learning (Machine Learning, ML), which is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. In the embodiment of the application, the target detection model is obtained through training in a machine learning mode.

And in the process of training to obtain the target detection model, a class center module can be constructed. The mode of constructing the class center module can be that in the process of training to obtain a target detection model, a feature extraction module is used for extracting features of a training sample image to obtain sample proposal features of an object identified by a class label, and the class label is used for identifying the class of the object included in the training sample image; and updating the class center of the class to which the object belongs in the class center module according to the sample proposal characteristics of the object.

In the embodiment of the application, the class center module is introduced and constructed in the process of training the target detection model, so that the discriminant representation of the damaged object is improved through the representative features in the class center module, and the method is beneficial to learning the robust target detection model.

It should be noted that, the manner of updating the class center of the class to which the object belongs in the class center module according to the sample proposal feature of the object may be to use the sample proposal feature of the object as the class center of the class to which the object belongs in the class center module. However, since the class center of the class to which the class center module belongs needs to occupy a certain storage space, the more the class centers are included, the larger the storage space is occupied, so that storage resources are wasted to a certain extent and the performance of target detection may be affected. Thus, in one possible implementation, the update may be performed according to the number of class centers of the class to which the object belongs in the class center module. Normally, if the number of class centers of the class to which the object belongs is zero (i.e., the class center set Ck corresponding to the class is an empty set), the sample proposal feature of the object is taken as the class center of the class to which the object belongs; if the number of class centers of the class to which the object belongs is not zero (i.e., the class center set Ck corresponding to the class is not an empty set), calculating the similarity between the sample proposal feature of the object and the class center of the class to which the object belongs, and updating the class center of the class to which the object belongs in the class center module according to the similarity. The similarity between the sample proposed feature of the object and the class center of the class to which the object belongs may be cosine similarity.

In some possible implementations, if the similarity is less than a similarity threshold (e.g., the similarity threshold is set to 0.6), then the sample proposed features of the object are added directly to the class center of the class to which the object belongs. The similarity threshold may be set according to actual requirements, and the specific value of the similarity threshold is not limited in this embodiment.

In order to avoid excessive number of class centers in the class center module due to addition of the proposed features, in some possible implementations, the method of updating the class centers of the classes to which the object belongs in the class center module according to the similarity may be to determine a magnitude relation between the number of class centers of the classes to which the object belongs and the magnitude threshold if the similarity is smaller than the similarity threshold, and then update the class centers of the classes to which the object belongs in the class center module according to the magnitude relation.

In general, if the size relationship indicates that the number of class centers of the class to which the object belongs in the class center module reaches the number threshold (the number threshold may be set to 50), it indicates that the class centers in the class center module are full, and at this time, the class center with the highest similarity may be replaced by the proposed feature. The number threshold may be set according to actual requirements, and the specific value of the number threshold is not limited in this embodiment.

By the method, the number of class centers can be reduced as much as possible on the premise of obtaining the representative characteristics of the object, so that the occupation of storage space is reduced, and the target detection performance is ensured.

Thus, as the training of the target detection model is performed, a class center module containing representative features can be obtained, and the whole class center module can be subjected to end-to-end training along with the target detection model by the aid of the acquisition mode.

Referring to fig. 5, fig. 5 shows an exemplary process diagram for updating a class center (class center) of a class to which an object belongs in a class center module (memory feature library). For training sample images in each small lot (miniband), category labels of the prior knowledge (group-trunk) can be used to obtain all object frames (also called prior knowledge frames, english group-trunk boxes), and proposed features of objects corresponding to the object frames can be obtained through ROIAlign. Each proposal feature has a true class label, and the obtained proposal feature is used for updating the class center corresponding to the class label. In fig. 5, the class labels include class label 1 (label 1), class label 2 (label 2), class label 3 (label 3), … …, and the number of class centers of label 1 has reached the number threshold, and since the proposed feature corresponding to the new label 1 has a relatively low similarity to all class centers in the class centers corresponding to label 1, one of the class centers most similar to the new label 1 is replaced with the proposed feature corresponding to the new label 1 (as shown in 501 in fig. 5). The number of class centers of label 2 does not reach the number threshold, but the similarity of the proposed feature of the new label 2 to all class centers in the class centers of label 2 is below the similarity threshold, so this proposed feature is put into the class center of label 2 (as shown at 502 in fig. 5). Although the number of class centers of the label 3 does not reach the number threshold, the proposed features of the label 3 which are new are quite similar to all class centers in the class centers of the label 3, so the proposed features of the label 3 are directly discarded.

It should be noted that, when updating the class center of the class to which the object belongs in the class center module, in order to obtain the enhanced feature E with better representation and discernability, a target loss function may be constructed to monitor the enhanced feature. For example, a target loss function is constructed according to the difference between the class label and the obtained sample enhancement feature, and class centers in the class center module are updated according to the target loss function to construct the class center module.

The target loss function may be, among other things, a different type of loss function, such as an L2norm loss, an L2 loss function, an L1 loss function, etc. Taking the L2norm as an example, the formula can be as follows:

Ll2＝L2norm(L*Lt–E*Et)

where L is each class label, L2norm is L2 normalized, E is the sample enhancement feature, and Lt and Et are coefficients, respectively.

According to the embodiment of the application, when the class center of the class to which the object belongs in the class center module is updated, the object loss function with the embedded structure is designed for the class center module, so that the representative characteristics can be dynamically stored and the characteristics can be accurately transmitted to improve the characteristic distinction.

Based on the target detection method provided in the foregoing embodiments, the embodiment of the present application further provides a target detection apparatus 600. Referring to fig. 6, the apparatus 600 includes an acquisition unit 601, an extraction unit 602, a determination unit 603, an enhancement unit 604, and a detection unit 605:

The acquiring unit 601 is configured to acquire an image to be detected acquired under a target environmental condition;

the extracting unit 602 is configured to perform feature extraction on the image to be detected to obtain proposed features;

the determining unit 603 is configured to determine, from a class center dataset, a target class center associated with the proposed feature according to a similarity between the proposed feature and a class center in the class center dataset, where the class center dataset includes class centers of different classes, and the class centers are representative features corresponding to objects of different classes;

the enhancing unit 604 is configured to enhance the proposed feature through the target class center to obtain an enhanced feature;

the determining unit 603 is further configured to determine a target image feature corresponding to the image to be detected according to the enhancement feature;

the detecting unit 605 is configured to perform target detection on the image to be detected according to the target image feature, so as to obtain a detection result.

In a possible implementation manner, the device further includes an input unit, where the input unit is configured to input the image to be detected into a target detection model, where the target detection model includes a feature extraction module, a class center module, a feature fusion module, and an identification module, and the class center module is configured by class centers in the class center dataset;

The extracting unit 602 is configured to perform feature extraction on the image to be detected through the feature extracting module to obtain the proposed feature;

the determining unit 603 is configured to determine, by the class center module, the target class center associated with the proposed feature;

the enhancing unit 604 is configured to fuse, by using the feature fusion module, the target class center with the proposed feature to obtain the enhanced feature;

the determining unit 603 is further configured to determine, by using the feature fusion module, the target image feature corresponding to the image to be detected according to the fusion feature;

the detecting unit 605 is configured to perform target detection on the image to be detected according to the target image feature through the identifying module, so as to obtain the detection result.

In a possible implementation manner, the apparatus further includes a construction unit, where the construction unit is configured to construct the class center module, and the construction unit includes:

in the process of training to obtain the target detection model, extracting features of a training sample image through the feature extraction module to obtain sample proposal features of objects identified by category labels, wherein the category labels are used for identifying categories of the objects included in the training sample image;

And updating the class center of the class to which the object belongs in the class center module according to the sample proposal characteristics of the object.

In a possible implementation, the building unit is further configured to:

if the number of class centers of the class to which the object belongs is zero, taking the sample proposal feature of the object as the class center of the class to which the object belongs;

if the number of class centers of the class to which the object belongs is not zero, calculating the similarity between the sample proposal feature of the object and the class center of the class to which the object belongs;

and updating the class center of the class to which the object belongs in the class center module according to the similarity.

In a possible implementation, the building unit is further configured to:

and if the similarity is smaller than a similarity threshold, adding the sample proposal characteristic of the object into the class center of the class to which the object belongs.

In a possible implementation, the building unit is further configured to:

if the similarity is smaller than a similarity threshold, determining the magnitude relation between the number of class centers of the class to which the object belongs and the number threshold;

and updating the class center of the class to which the object belongs in the class center module according to the size relation.

In a possible implementation, the building unit is further configured to:

constructing a target loss function according to the difference between the class label and the obtained sample enhancement feature;

and updating the class center in the class center module according to the target loss function to construct the class center module.

In a possible implementation, the determining unit 603 is configured to:

acquiring a class center data set corresponding to the target environmental condition;

the target class center associated with the proposed feature is determined from a class center dataset corresponding to the target environmental condition.

In a possible implementation manner, the determining unit 603 is configured to:

and splicing the enhancement features and the proposal features to obtain the target image features.

In one possible implementation, the enhancing unit 604 is configured to:

calculating a similarity measurement matrix according to the target class centers and the proposed features, wherein elements in the similarity measurement matrix are used for identifying weight values of each target class center for the proposed features;

and carrying out weighted summation on the target class center and the proposed features according to the similarity measurement matrix to obtain the enhanced features corresponding to the proposed features.

In a possible implementation manner, the apparatus further includes a training unit, where the training unit is configured to:

acquiring a training sample image under a sample environment condition, wherein the training sample image is provided with a category label of an object included in the training sample image;

inputting the training sample image into an initial detection model, wherein the initial detection model comprises an initial feature extraction module, an initial class center module, an initial feature fusion module and an initial identification module, and the training sample image is provided with class labels of objects included in the training sample image;

performing feature extraction on the training sample image through the initial feature extraction module to obtain sample proposal features;

determining, by the initial class center module, a target class center associated with the sample proposal feature from the initial class center module according to a similarity of the sample proposal feature to a class center in the class center dataset;

the initial feature fusion module is used for enhancing the sample proposal features by utilizing the target sample class center to obtain sample enhancement features;

determining target sample image features corresponding to the training sample image according to the sample enhancement features through the initial feature fusion module;

Performing target detection on the image to be detected according to the target sample image characteristics through the initial recognition module to obtain a sample detection result;

training the initial detection model based on the class labels and the sample detection results to obtain the target detection model.

Based on the above embodiments, the embodiments of the present application further provide an apparatus for object detection, where the apparatus may be the foregoing image processing apparatus, so the image processing apparatus may be a terminal apparatus, and the terminal apparatus is exemplified by a smart phone:

fig. 7 is a block diagram illustrating a part of a structure of a smart phone related to a terminal provided in an embodiment of the present application. Referring to fig. 7, the smart phone includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (wireless fidelity, wiFi) module 770, processor 780, and power supply 790. The input unit 730 may include a touch panel 731 and other input devices 732, the display unit 740 may include a display panel 741, and the audio circuit 760 may include a speaker 761 and a microphone 762. Those skilled in the art will appreciate that the smartphone structure shown in fig. 7 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 780 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions and processes data of the smart phone by running or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby performing overall monitoring of the smart phone. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

In this embodiment, the processor 780 in the smartphone may perform the following steps:

extracting features of the image to be detected to obtain proposed features;

The image processing apparatus provided in this embodiment of the present application may also be a server, as shown in fig. 8, fig. 8 is a block diagram of a server 800 provided in this embodiment of the present application, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 822 (e.g. one or more processors) and a memory 832, and one or more storage media 830 (e.g. one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

Server 800 may also include one or more power supplies 826, one or more wired or wireless networksNetwork interface 850, one or more input/output interfaces 858, and/or one or more operating systems 841, e.g., windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the central processor 822 in the server 800 may perform the following steps:

extracting features of the image to be detected to obtain proposed features;

According to an aspect of the present application, there is provided a computer-readable storage medium for storing program code for performing the object detection method according to the foregoing embodiments.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of target detection, the method comprising:

extracting features of the image to be detected to obtain proposed features;

2. The method according to claim 1, wherein the method further comprises:

inputting the image to be detected into a target detection model, wherein the target detection model comprises a feature extraction module, a class center module, a feature fusion module and an identification module, and the class center module is composed of class centers in the class center data set;

The feature extraction of the image to be detected to obtain proposed features comprises the following steps:

performing feature extraction on the image to be detected through the feature extraction module to obtain the proposed features;

the determining a target class center associated with the proposed feature from the class center dataset according to the similarity of the proposed feature to a class center in the class center dataset, comprising:

determining, by the class center module, the target class center associated with the proposed feature;

the enhancing the proposed feature by the target class center to obtain an enhanced feature comprises:

fusing the target class center and the proposed features through the feature fusion module to obtain the enhanced features;

the determining the target image features corresponding to the image to be detected according to the enhancement features comprises the following steps:

determining the target image characteristics corresponding to the image to be detected according to the fusion characteristics through the characteristic fusion module;

performing target detection on the image to be detected according to the target image characteristics to obtain a detection result, including:

and carrying out target detection on the image to be detected according to the target image characteristics by the identification module to obtain the detection result.

3. The method of claim 2, further comprising building the class center module, the building the class center module comprising:

4. A method according to claim 3, wherein updating the class center of the class to which the object belongs in the class center module according to the sample proposal feature of the object comprises:

5. The method of claim 4, wherein updating the class center of the class to which the object belongs in the class center module according to the similarity comprises:

6. The method of claim 4, wherein updating the class center of the class to which the object belongs in the class center module according to the similarity comprises:

7. A method according to claim 3, wherein said constructing said class center module comprises:

8. The method of any of claims 1-7, wherein the determining a target class center associated with the proposed feature from the class center dataset based on a similarity of the proposed feature to class centers in the class center dataset corresponds to different class center datasets under different environmental conditions, comprising:

9. The method according to any one of claims 1-7, wherein determining the target image feature corresponding to the image to be detected according to the enhancement feature comprises:

10. The method of any of claims 1-7, wherein the enhancing the proposed feature by the target class center results in an enhanced feature comprising:

11. The method according to any one of claims 2-7, further comprising:

12. An object detection device, characterized in that the device comprises an acquisition unit, an extraction unit, a determination unit, an enhancement unit and a detection unit:

13. An apparatus for object detection, the apparatus comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-11 according to instructions in the program code.

14. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for performing the method of any one of claims 1-11.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-11.