CN114549926A

CN114549926A - Target detection and target detection model training method and device

Info

Publication number: CN114549926A
Application number: CN202210082040.XA
Authority: CN
Inventors: 王康康
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-27

Abstract

The invention provides a training method and device for a target detection and target detection model, electronic equipment and a readable storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as image processing, image detection and the like. The target detection method comprises the following steps: acquiring an image to be processed; extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction; and obtaining a target detection result of the image to be processed according to the image characteristics. The training method of the target detection model comprises the following steps: acquiring a training set; constructing a neural network model comprising an input network, a feature extraction network and an output network; and training the neural network model by using the target labeling results of the multiple sample images and the multiple sample images to adjust the parameters of each network in the neural network model and the parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

Description

Target detection and target detection model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as image processing, image detection and the like. Specifically, a method and a device for training a target detection and target detection model, an electronic device and a readable storage medium are provided.

Background

When the target detection is carried out on the image, the detection effect can be improved by referring to the context characteristics of the detected target. For example, in target detection, for face detection, the head, the shoulder and the human body belong to context features, and the reference to the positions of the head, the shoulder and the human body is very beneficial to the face detection.

In the prior art, the target detection with reference to the context features is generally implemented in two ways: 1) directly enlarging receptive field; 2) the context characteristics are combined in a display mode, but the two methods have the problems of large introduction of irrelevant backgrounds and high labeling cost.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided an object detection method, including: acquiring an image to be processed; extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction; and obtaining a target detection result of the image to be processed according to the image characteristics.

According to a second aspect of the present disclosure, there is provided a training method of a target detection model, including: acquiring a training set, wherein the training set comprises a plurality of sample images and target labeling results of the plurality of sample images; constructing a neural network model comprising an input network, a feature extraction network and an output network, wherein the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network; and training the neural network model by using the target labeling results of the multiple sample images and the multiple sample images to adjust the parameters of each network in the neural network model and the parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

According to a third aspect of the present disclosure, there is provided an object detection apparatus comprising: the first acquisition unit is used for acquiring an image to be processed; the processing unit is used for extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction; and the detection unit is used for obtaining a target detection result of the image to be processed according to the image characteristics.

According to a fourth aspect of the present disclosure, there is provided a training apparatus for an object detection model, comprising: the second acquisition unit is used for acquiring a training set, and the training set comprises a plurality of sample images and target labeling results of the sample images; the device comprises a construction unit, a prediction unit and a prediction unit, wherein the construction unit is used for constructing a neural network model comprising an input network, a feature extraction network and an output network, the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network; and the training unit is used for training the neural network model by using the plurality of sample images and the target labeling results of the plurality of sample images so as to adjust the parameters of each network in the neural network model and the parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

According to the technical scheme, the non-centralized convolution kernel corresponding to the preset direction is set, so that the image features extracted from the image to be processed can pay attention to other areas except the target in the image to be processed besides paying attention to the target in the image to be processed, the information contained in the extracted image features is enriched, and the accuracy of target detection is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device for implementing the target detection or training method of the target detection model of the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the target detection method of this embodiment specifically includes the following steps:

s101, acquiring an image to be processed;

s102, extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction;

s103, obtaining a target detection result of the image to be processed according to the image characteristics.

According to the target detection method, after the image to be processed is obtained, the non-centering convolution kernel corresponding to the preset direction is used for extracting the image features in the image to be processed, and then the target detection result of the image to be processed is obtained according to the image features.

In the embodiment, when S101 is executed to acquire an image to be processed, an image input by the input terminal may be used as the image to be processed, or an image selected by the input terminal in a network may be used as the image to be processed.

After the to-be-processed image is acquired in S101, S102 is performed to extract image features from the acquired to-be-processed image by using a non-centered convolution kernel corresponding to a preset direction.

The non-centered convolution kernel used in S102 in this embodiment means that when the convolution kernel extracts the image features, information around the current position is not extracted equally, but extracted with emphasis, where the emphasis is a preset direction corresponding to the used non-centered convolution kernel; the size of the non-centered convolution kernel in the present embodiment may be set in advance, for example, 3 × 3, 5 × 5.

In the embodiment, when S102 is executed, different non-centered convolution kernels correspond to different preset directions; the preset direction in this embodiment is at least one of eight directions, i.e., upper left, upper right, left, right, lower left, and the non-centered convolution kernels corresponding to different preset directions are used to extract information in a range located in the preset direction.

If the size of the non-centered convolution kernel used in the embodiment when S102 is executed is 3 × 3, the non-centered convolution kernel corresponding to the upper left direction in the embodiment when S102 is executed is { (X)₁₁，X₁₂，0)，(X₂₁，X₂₂0), (0, 0, 0) }, the non-centered convolution kernel corresponding to the upward direction is { (X)₁₁，X₁₂，X₁₃)，(X₂₁，X₂₂，X₂₃) (0, 0, 0) }, the non-centered convolution kernel corresponding to the upper-right direction is { (0, X)₁₂，X₁₃)，(0，X₂₂，X₂₃) (0, 0, 0) }, the non-centered convolution kernel corresponding to the right direction is { (0, X)₁₂，X₁₃)，(0，X₂₂，X₂₃)，(0，X₃₂，X₃₃) The non-centered convolution kernel corresponding to the lower right direction is { (0, 0, 0), (0, X)₂₂，X₂₃)，(0，X₃₂，X₃₃) The non-centered convolution kernel corresponding to the lower direction is { (0, 0, 0), (X)₂₁，X₂₂，X₂₃)，(X₃₁，X₃₂，X₃₃) The non-centered convolution kernel corresponding to the lower left direction is { (0, 0, 0), (X)₂₁，X₂₂，0)，(X₃₁，X₃₂0), the non-centered convolution kernel corresponding to the left direction is { (X)₁₁，X₁₂，0)，(X₂₁，X₂₂，0)，(X₃₁，X₃₂，0)}。

Wherein the non-neutral is corresponding to different preset directionsIn a cardioconvolution kernel, X denotes the value of a parameter at different locations in the non-centered convolution kernel, e.g. X₁₁The parameter values representing the first row and the first column in the non-centered convolution kernel. It will be appreciated that the parameter values at different locations in the non-centered convolution kernel in this embodiment may be pre-set.

Specifically, when S102 is executed to extract image features from the image to be processed by using a non-centered convolution kernel corresponding to the preset direction, the present embodiment may adopt the following optional implementation manners: determining a preset direction according to attribute information of the image to be processed, wherein the attribute information in the embodiment can be type information of a target in the image to be processed, scene information of the image to be processed and the like; and extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction.

That is to say, in this embodiment, by setting the corresponding relationship between the attribute information and the preset direction, that is, different attribute information corresponds to different preset directions, the purpose of determining the preset direction according to the attribute information of the image to be processed is achieved, so that the determined non-centralized convolution kernel corresponding to the preset direction can be more matched with the image to be processed, and the accuracy of the extracted image features is improved.

In this embodiment, when performing S102 to extract image features from the image to be processed by using a non-centered convolution kernel corresponding to the preset direction, the optional implementation manners that can be adopted are: respectively using non-centralized convolution kernels corresponding to the preset directions to perform convolution processing on the image to be processed; according to the convolution result corresponding to each preset direction, the image characteristics of the image to be processed are obtained, and the embodiment may use the addition result between the convolution results as the image characteristics.

For example, if the predetermined directions determined by performing S102 in the present embodiment are the upper left direction, the upper direction and the upper right direction, the embodiment performs S102 to obtain the convolution result corresponding to the upper left direction, the convolution result corresponding to the upper direction and the convolution result corresponding to the upper right direction.

When the convolution processing is performed on the image to be processed by using the non-centered convolution kernel corresponding to the preset direction in S102, the embodiment may first obtain a feature map (feature map) of the image to be processed by using a convolution neural network, and then perform the convolution processing on the obtained feature map by using the non-centered convolution kernel corresponding to the preset direction.

It can be understood that, when performing S102 to add the convolution results corresponding to each preset direction to obtain the image feature of the image to be processed, the embodiment may adopt an alternative implementation manner as follows: according to the image to be processed, a weight value corresponding to each preset direction is obtained, and in this embodiment, the feature map of the image to be processed may be input into a known network structure (for example, SEnet), and the weight value corresponding to each preset direction output by the known network structure is obtained; according to the convolution result corresponding to each preset direction and the weight value, the image feature of the image to be processed is obtained, in this embodiment, the convolution result corresponding to each preset direction may be multiplied by the weight value, and then the addition result between the multiplication results corresponding to each preset direction is used as the image feature.

That is to say, in this embodiment, the weighted values corresponding to different preset directions may also be obtained according to the image to be processed, so that the image feature is obtained through the convolution result and the weighted value corresponding to each preset direction, and thus the accuracy of the image feature is improved.

In addition, when S102 is executed to extract image features from the image to be processed by using the non-centered convolution kernel corresponding to the preset direction, the embodiment may further input the image to be processed into the target detection model, and extract image features from the image to be processed by using the non-centered convolution kernel corresponding to the preset direction through the feature extraction network in the target detection model.

After executing S102 to extract image features from the image to be processed, executing S103 to obtain a target detection result of the image to be processed according to the extracted image features; in this embodiment, the target detection result obtained in S103 is executed, specifically, a bounding box (bounding box) surrounding the target in the image to be processed.

When S103 is executed to obtain a target detection result of the image to be processed according to the extracted image features, the present embodiment may input the extracted image features into the target detection model, and output the target detection result of the image to be processed according to the input image features through an output network in the target detection model.

According to the target detection method provided by the embodiment, the non-centralized convolution kernels corresponding to the preset directions are used for extracting the image features of the image to be processed, and the non-centralized convolution kernels corresponding to different preset directions can extract the information in the corresponding direction ranges, so that the purpose of obtaining the image features by referring to the context features of the target is achieved, the information contained in the extracted image features is enriched, and the accuracy of target detection is improved.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in fig. 2, the training method of the target detection model of the present embodiment specifically includes the following steps:

s201, obtaining a training set, wherein the training set comprises a plurality of sample images and target labeling results of the plurality of sample images;

s202, constructing a neural network model comprising an input network, a feature extraction network and an output network, wherein the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network;

s203, training the neural network model by using the target labeling results of the multiple sample images and the multiple sample images to adjust parameters of each network in the neural network model and parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

In the training method of the target detection model of this embodiment, the constructed neural network model includes the feature extraction network composed of at least one target feature extraction layer, so that the neural network model can extract image features by using a non-centered convolution kernel corresponding to a preset direction, accuracy of the extracted image features is improved, and a detection effect of the target detection model is enhanced.

In the embodiment, in the training set obtained in S201, the target labeling result of the sample image is a bounding box enclosing the target in the sample image; in this embodiment, the plurality of sample images included in the training set obtained in S201 may correspond to an application scene in real life, such as face detection in security service, vehicle detection in urban traffic, and the like.

In this embodiment, after the target labeling result including a plurality of sample images and a plurality of sample images is obtained in step S201, step S202 is performed to construct a neural network model including an input network, a feature extraction network, and an output network.

In the neural network model constructed in S202, the input network is used to input the input sample image into the feature extraction network, so that the feature extraction network can extract image features from the sample image.

In the embodiment, in the neural network model constructed in step S202, the feature extraction network is composed of at least one feature extraction layer and at least one target feature extraction layer; the basic architecture of the feature extraction network in this embodiment is a backbone-based feature extraction network, such as VGG, ResNet, and the like.

Specifically, in the neural network model constructed in S202 in this embodiment, when the image features are obtained according to the image features output by the previous layer and the non-centered convolution kernel corresponding to the preset direction in at least one target feature extraction layer included in the feature extraction network, an optional implementation manner that may be adopted is as follows: obtaining a first image characteristic according to the image characteristic output by the previous layer, namely, performing convolution processing on the image characteristic output by the previous layer by the current target characteristic extraction layer, and taking a convolution result as the first image characteristic; performing convolution processing on the first image characteristics by using non-centralized convolution kernels corresponding to the preset directions respectively; and obtaining image characteristics according to the convolution result corresponding to each preset direction.

The preset direction in this embodiment is at least one of eight directions, i.e., upper left, upper right, left, right, lower left, and lower left.

That is to say, the target feature extraction layer in the feature extraction network of this embodiment obtains the image feature through two times of convolution processing, and the second time of convolution processing is obtained by using a non-centered convolution kernel corresponding to the preset direction, so that the image feature obtained by each layer of feature extraction layer includes information of different preset direction ranges, thereby improving the accuracy of the obtained image feature.

It can be understood that the number and the positions of the target feature extraction layers included in the feature extraction network in this embodiment may be preset; for a non-target feature extraction layer in the feature extraction network, only one convolution process is needed.

In the neural network model constructed in S202 in this embodiment, when at least one target feature extraction layer included in the feature extraction network obtains image features according to a convolution result corresponding to each preset direction, an optional implementation manner that may be adopted is as follows: according to the first image feature, a weight value corresponding to each preset direction is obtained, and in this embodiment, the first image feature may be input into a known network structure (for example, SEnet), and a weight value corresponding to each preset direction output by the known network structure is obtained; and obtaining image characteristics according to the convolution result corresponding to each preset direction and the weight value.

That is to say, for a specific task or a specific scene, different non-centralized convolution kernels have different importance, so that the embodiment further obtains a weight value corresponding to each preset direction according to the first image feature obtained by the target feature extraction layer, and further obtains the image features output by the feature extraction layer according to the convolution result and the weight value corresponding to each preset direction, so that different preset directions are distinguished by using different weight values, and the accuracy of the image features output by each feature extraction layer is further improved.

In addition, for a specific task or a specific scene, sometimes, not all the non-centered convolution kernels corresponding to different preset directions are needed, but only one or several non-centered convolution kernels in specific directions are needed, so that the image features can be accurately extracted.

Therefore, in the present embodiment, in the neural network model constructed in step S202, when the image features are obtained according to the convolution result and the weight value corresponding to each preset direction in at least one target feature extraction layer included in the feature extraction network, an optional implementation manner that may be adopted is as follows: regularizing the weight value corresponding to each preset direction to obtain a regularization result of the weight value corresponding to each preset direction; and obtaining image characteristics according to the convolution result corresponding to each preset direction and the weight value regularization result.

That is to say, the target feature extraction layer in the feature extraction network of this embodiment also uses a regularization processing mode to make the corresponding non-centered convolution kernel converge to 0 in the training process, so as to achieve the purpose of clipping the used non-centered convolution kernels corresponding to different preset directions, thereby increasing the speed of the neural network model in feature extraction.

In the neural network model constructed in S202, the output network is configured to obtain a target prediction result of the sample image according to the image features output by the feature extraction network, where the obtained target prediction result is a bounding box surrounding the target output by the neural network model for the target in the sample image.

In this embodiment, after the step S202 of constructing the neural network model including the input network, the feature extraction network and the output network is performed, the step S203 of training the neural network model using the plurality of sample images and the target labeling results of the plurality of sample images is performed to adjust parameters of each network in the neural network model and parameters of the non-centered convolution kernel corresponding to the preset direction, so as to obtain the target detection model.

Specifically, when S203 is executed to train the neural network model using the target labeling results of the multiple sample images and the multiple sample images, an optional implementation manner that can be adopted in the embodiment is as follows: respectively inputting the multiple sample images into a neural network model, and acquiring a target prediction result output by the neural network model aiming at each sample image; calculating a loss function value according to target labeling results and target prediction results of a plurality of sample images; and adjusting parameters of each network in the neural network model and parameters of the non-centralized convolution kernel corresponding to the preset direction according to the calculated loss function value until the loss function value is converged to obtain a target detection model.

It can be understood that, if at least one target feature extraction layer in the feature extraction network obtains image features in combination with the weighted values and the regularization processing in the neural network model constructed in step S202, in this embodiment, when step S203 is executed to train the neural network model using the target labeling results of the multiple sample images and the multiple sample images, the weighted values and the regularization parameters corresponding to different preset directions may also be adjusted.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in fig. 3, this embodiment shows a flowchart of the target feature extraction layer when obtaining image features by using a non-centered convolution kernel corresponding to a preset direction: the non-centered convolution kernels in fig. 3 are, from left to right, a non-centered convolution kernel corresponding to a lower left direction, a non-centered convolution kernel corresponding to a lower right direction, a non-centered convolution kernel corresponding to a right lower direction, a non-centered convolution kernel corresponding to an upper right direction, a non-centered convolution kernel corresponding to an upper left direction, and a non-centered convolution kernel corresponding to a left upper direction.

Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. As shown in fig. 4, this embodiment shows a flowchart of the target feature extraction layer when obtaining image features by using a non-centered convolution kernel corresponding to a preset direction: the non-centered convolution kernels in fig. 4 are, from left to right, a non-centered convolution kernel corresponding to the lower left direction, a non-centered convolution kernel corresponding to the lower right direction, a non-centered convolution kernel corresponding to the upper left direction, and a non-centered convolution kernel corresponding to the left direction in this order; different from fig. 3, the present embodiment further combines a method of performing regularization processing on the weight values to obtain image features.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. As shown in fig. 5, the object detecting apparatus 500 of the present embodiment includes:

a first obtaining unit 501, configured to obtain an image to be processed;

the processing unit 502 is configured to extract image features from the image to be processed by using a non-centered convolution kernel corresponding to a preset direction;

the detecting unit 503 is configured to obtain a target detection result of the image to be processed according to the image feature.

When acquiring the image to be processed, the first acquiring unit 501 may use the image input from the input terminal as the image to be processed, or use the image selected by the input terminal in the network as the image to be processed.

The present embodiment extracts, by the processing unit 502, image features from an acquired image to be processed using a non-centered convolution kernel corresponding to a preset direction after the image to be processed is acquired by the first acquisition unit 501.

The non-centered convolution kernel used by the processing unit 502 means that when the convolution kernel extracts the image features, information around the current position is not extracted equally, but extracted with emphasis, that is, a preset direction corresponding to the non-centered convolution kernel used.

In this embodiment, different non-centered convolution kernels correspond to different preset directions; the preset direction in this embodiment is at least one of eight directions, i.e., upper left, upper right, left, right, lower left, and the non-centered convolution kernels corresponding to different preset directions are used to extract information in a range located in the preset direction.

Where, in the non-centered convolution kernels corresponding to different preset directions, X represents a parameter value located at a different position in the non-centered convolution kernel, for example, X11 represents a parameter value located in the first row and the first column in the non-centered convolution kernel. It will be appreciated that the parameter values at different locations in the non-centered convolution kernel in this embodiment may be pre-set.

Specifically, when the processing unit 502 extracts image features from the image to be processed by using the non-centered convolution kernel corresponding to the preset direction, the optional implementation manners that can be adopted are: determining a preset direction according to the attribute information of the image to be processed; and extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction.

That is to say, the processing unit 502 realizes the purpose of determining the preset direction according to the attribute information of the image to be processed by setting the corresponding relationship between the attribute information and the preset direction, and different attribute information corresponds to different preset directions, so that the determined non-centralized convolution kernel corresponding to the preset direction can be more matched with the image to be processed, and the accuracy of the extracted image features is improved.

When the processing unit 502 extracts image features from the image to be processed by using the non-centered convolution kernel corresponding to the preset direction, the optional implementation manner that can be adopted is as follows: respectively using non-centralized convolution kernels corresponding to the preset directions to perform convolution processing on the image to be processed; and obtaining the image characteristics of the image to be processed according to the convolution result corresponding to each preset direction.

When the processing unit 502 performs convolution processing on the image to be processed by using the non-centered convolution kernel corresponding to the preset direction, a feature map (feature map) of the image to be processed may be obtained by using a convolution neural network, and then the feature map obtained by using the non-centered convolution kernel corresponding to the preset direction may be used to perform convolution processing.

It can be understood that, when the processing unit 502 adds the convolution results corresponding to each preset direction to obtain the image features of the image to be processed, the optional implementation manner that can be adopted is as follows: acquiring a weight value corresponding to each preset direction according to the image to be processed; and obtaining the image characteristics of the image to be processed according to the convolution result corresponding to each preset direction and the weight value.

That is to say, the processing unit 502 may further obtain weight values corresponding to different preset directions according to the image to be processed, so that the image features are obtained according to the convolution result and the weight value corresponding to each preset direction, and thus the accuracy of the image features is improved.

In addition, when the processing unit 502 extracts image features from the image to be processed by using the non-centered convolution kernel corresponding to the preset direction, the image to be processed may be input into the target detection model, and the network may be extracted from the features in the target detection model by using the non-centered convolution kernel corresponding to the preset direction to extract image features from the image to be processed.

After the processing unit 502 extracts the image features from the image to be processed, the detection unit 503 obtains the target detection result of the image to be processed according to the extracted image features; the target detection result obtained by the detection unit 503 is specifically a bounding box (bounding box) surrounding the target in the image to be processed.

When obtaining the target detection result of the image to be processed according to the extracted image features, the detection unit 503 may input the extracted image features into the target detection model, and output the target detection result of the image to be processed according to the input image features through an output network in the target detection model.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. As shown in fig. 6, the training apparatus 600 for the target detection model of the present embodiment includes:

the second obtaining unit 601 is configured to obtain a training set, where the training set includes a plurality of sample images and target labeling results of the plurality of sample images;

the device comprises a construction unit 602, a neural network model, a feature extraction network and an output network, wherein the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network;

the training unit 603 is configured to train the neural network model using the multiple sample images and the target labeling results of the multiple sample images to adjust parameters of each network in the neural network model and parameters of the non-centered convolution kernel corresponding to the preset direction, so as to obtain a target detection model.

In the training set acquired by the second acquiring unit 601, the target labeling result of the sample image is a bounding box surrounding the target in the sample image; the plurality of sample images included in the training set acquired by the second acquiring unit 601 may correspond to an application scene in real life, such as human face detection in security service, vehicle detection in urban traffic, and the like.

In this embodiment, after the second obtaining unit 601 obtains the target labeling result including the plurality of sample images and the plurality of sample images, the constructing unit 602 constructs a neural network model including an input network, a feature extraction network, and an output network.

In the neural network model constructed by the construction unit 602, the input network is used to input the input sample image into the feature extraction network, so that the feature extraction network can extract image features from the sample image.

In the neural network model constructed by the construction unit 602, the feature extraction network is composed of at least one feature extraction layer and at least one target feature extraction layer; the basic architecture of the feature extraction network in this embodiment is a backbone-based feature extraction network, such as VGG, ResNet, and the like.

Specifically, in the neural network model constructed by the construction unit 602, when the image features are obtained according to the image features output by the previous layer and the non-centered convolution kernel corresponding to the preset direction in at least one target feature extraction layer included in the feature extraction network, an optional implementation manner that can be adopted is as follows: obtaining a first image characteristic according to the image characteristic output by the previous layer; performing convolution processing on the first image characteristics by using non-centralized convolution kernels corresponding to the preset directions respectively; and obtaining image characteristics according to the convolution result corresponding to each preset direction.

That is to say, the target feature extraction layer in the feature extraction network of the construction unit 602 obtains the image feature through two times of convolution processing, and the second time of convolution processing is obtained by using a non-centered convolution kernel corresponding to the preset direction, so that the image feature obtained by each layer of feature extraction layer includes information of different preset direction ranges, thereby improving the accuracy of the obtained image feature.

In the neural network model constructed by the construction unit 602, when the at least one target feature extraction layer included in the feature extraction network obtains the image features according to the convolution result corresponding to each preset direction, an optional implementation manner that can be adopted is as follows: acquiring a weight value corresponding to each preset direction according to the first image characteristics; and obtaining image characteristics according to the convolution result corresponding to each preset direction and the weight value.

That is to say, for a specific task or a specific scene, different non-centralized convolution kernels have different importance, so the target feature extraction layer constructed by the construction unit 602 also obtains a weight value corresponding to each preset direction according to the obtained first image feature, and further obtains the image features output by the feature extraction layer according to the convolution result and the weight value corresponding to each preset direction, so that different preset directions are distinguished by using different weight values, and the accuracy of the image features output by each feature extraction layer is further improved.

Therefore, in the neural network model constructed by the construction unit 602, when the image features are obtained according to the convolution result and the weight value corresponding to each preset direction in at least one target feature extraction layer included in the feature extraction network, an optional implementation manner that may be adopted is as follows: regularizing the weight value corresponding to each preset direction to obtain a regularization result of the weight value corresponding to each preset direction; and obtaining image characteristics according to the convolution result corresponding to each preset direction and the weight value regularization result.

That is to say, the target feature extraction layer in the feature extraction network constructed by the construction unit 602 also uses a regularization processing mode to make the corresponding non-centered convolution kernel converge to 0 in the training process, so as to achieve the purpose of clipping the used non-centered convolution kernels corresponding to different preset directions, thereby improving the speed of the neural network model in performing feature extraction.

In the neural network model constructed by the construction unit 602, the output network is configured to extract image features output by the network according to the features to obtain a target prediction result of the sample image, where the obtained target prediction result is a bounding box surrounding a target in the sample image and output by the neural network model for the target.

In this embodiment, after the construction unit 602 constructs the neural network model including the input network, the feature extraction network, and the output network, the training unit 603 trains the neural network model using the plurality of sample images and the target labeling results of the plurality of sample images to adjust parameters of each network in the neural network model and parameters of the non-centered convolution kernel corresponding to the preset direction, thereby obtaining the target detection model.

Specifically, when the training unit 603 trains the neural network model using the multiple sample images and the target labeling results of the multiple sample images, the optional implementation manner that can be adopted is as follows: respectively inputting the multiple sample images into a neural network model, and acquiring a target prediction result output by the neural network model aiming at each sample image; calculating a loss function value according to the target labeling result and the target prediction result of the plurality of sample images; and adjusting parameters of each network in the neural network model and parameters of the non-centralized convolution kernel corresponding to the preset direction according to the calculated loss function value until the loss function value is converged to obtain the target detection model.

It can be understood that, if at least one target feature extraction layer in the feature extraction network in the neural network model constructed by the construction unit 602 obtains the image features in combination with the weighted values and the regularization processing, the training unit 603 may further adjust the weighted values and the regularization parameters corresponding to different preset directions when the neural network model is trained by using the target labeling results of the multiple sample images and the multiple sample images.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the customs of public sequences.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

As shown in fig. 7, is a block diagram of an electronic device of a method for object detection or training of an object detection model according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the target detection or the training method of the target detection model. For example, in some embodiments, the target detection or training method of the target detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708.

In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM702 and/or communications unit 709. When the computer program is loaded into the RAM703 and executed by the computing unit 701, one or more steps of the method for object detection or training of an object detection model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform the target detection or the training method of the target detection model.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of target detection, comprising:

acquiring an image to be processed;

extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction;

and obtaining a target detection result of the image to be processed according to the image characteristics.

2. The method of claim 1, wherein the extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction comprises:

determining a preset direction according to the attribute information of the image to be processed;

and extracting image features from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction.

3. The method according to any one of claims 1-2, wherein the extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction comprises:

performing convolution processing on the image to be processed by using non-centralized convolution kernels corresponding to preset directions respectively;

and obtaining the image characteristics of the image to be processed according to the convolution result corresponding to each preset direction.

4. The method according to claim 3, wherein the obtaining the image characteristics of the image to be processed according to the convolution result corresponding to each preset direction comprises:

acquiring a weight value corresponding to each preset direction according to the image to be processed;

and obtaining the image characteristics of the image to be processed according to the convolution result corresponding to each preset direction and the weight value.

5. The method of claim 1, wherein the extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction comprises:

inputting the image to be processed into a target detection model;

and processing the image to be processed by a feature extraction network in the target detection model so as to output the image features of the image to be processed.

6. The method according to any one of claims 1-5, wherein the obtaining of the target detection result of the image to be processed according to the image feature comprises:

inputting the image features into a target detection model;

and processing the image characteristics by an output network in the target detection model to output a target detection result of the image to be processed.

7. A method of training an object detection model, comprising:

acquiring a training set, wherein the training set comprises a plurality of sample images and target labeling results of the plurality of sample images;

constructing a neural network model comprising an input network, a feature extraction network and an output network, wherein the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network;

and training the neural network model by using the target labeling results of the multiple sample images and the multiple sample images to adjust the parameters of each network in the neural network model and the parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

8. The method of claim 7, wherein the at least one target feature extraction layer obtains the image feature according to the image feature output by the previous layer and a non-centered convolution kernel corresponding to a preset direction, and comprises:

obtaining a first image characteristic according to the image characteristic output by the previous layer;

performing convolution processing on the first image characteristics by using non-centralized convolution kernels corresponding to preset directions respectively;

and obtaining the image characteristics according to the convolution result corresponding to each preset direction.

9. The method according to claim 8, wherein the obtaining the image feature according to the convolution result corresponding to each preset direction comprises:

acquiring a weight value corresponding to each preset direction according to the first image characteristics;

and obtaining the image characteristics according to the convolution result corresponding to each preset direction and the weight value.

10. The method according to claim 9, wherein the deriving the image feature according to the convolution result and the weight value corresponding to each preset direction comprises:

regularizing the weight value corresponding to each preset direction to obtain a regularization result of the weight value corresponding to each preset direction;

and obtaining the image characteristics according to the convolution result corresponding to each preset direction and the weight value regularization result.

11. The method of claim 7, wherein the training the neural network model using the plurality of sample images and the target annotation results for the plurality of sample images comprises:

respectively inputting the multiple sample images into the neural network model, and acquiring a target prediction result output by the neural network model aiming at each sample image;

calculating a loss function value according to the target labeling results and the target prediction results of the multiple sample images;

and adjusting parameters of each network in the neural network model and parameters of the non-centralized convolution kernel corresponding to the preset direction according to the loss function value until the loss function value is converged to obtain the target detection model.

12. An object detection device comprising:

the first acquisition unit is used for acquiring an image to be processed;

the processing unit is used for extracting image characteristics from the image to be processed by using a non-centralized convolution kernel corresponding to a preset direction;

and the detection unit is used for obtaining a target detection result of the image to be processed according to the image characteristics.

13. The apparatus according to claim 12, wherein the processing unit, when extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction, specifically performs:

14. The apparatus according to any one of claims 12-13, wherein the processing unit, when extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction, specifically performs:

15. The apparatus according to claim 14, wherein the processing unit, when obtaining the image feature of the image to be processed according to the convolution result corresponding to each preset direction, specifically performs:

16. The apparatus according to claim 12, wherein the processing unit, when extracting image features from the image to be processed using a non-centered convolution kernel corresponding to a preset direction, specifically performs:

inputting the image to be processed into a target detection model;

17. The apparatus according to any one of claims 12 to 16, wherein the detecting unit, when obtaining the target detection result of the image to be processed according to the image feature, specifically performs:

inputting the image features into a target detection model;

18. A training apparatus for an object detection model, comprising:

the second acquisition unit is used for acquiring a training set, and the training set comprises a plurality of sample images and target labeling results of the sample images;

the device comprises a construction unit, a prediction unit and a prediction unit, wherein the construction unit is used for constructing a neural network model comprising an input network, a feature extraction network and an output network, the input network is used for inputting a sample image into the feature extraction network, the feature extraction network comprises at least one target feature extraction layer, the at least one target feature extraction layer is used for obtaining image features according to image features output by a previous layer and a non-centralized convolution kernel corresponding to a preset direction, and the output network is used for obtaining a target prediction result of the sample image according to the image features output by the feature extraction network;

and the training unit is used for training the neural network model by using the plurality of sample images and the target labeling results of the plurality of sample images so as to adjust the parameters of each network in the neural network model and the parameters of the non-centralized convolution kernel corresponding to the preset direction to obtain a target detection model.

19. The apparatus according to claim 18, wherein the at least one target feature extraction layer constructed by the construction unit, when obtaining the image feature according to the image feature output from the previous layer and the non-centered convolution kernel corresponding to the preset direction, specifically performs:

performing convolution processing on the first image characteristics by using non-centralized convolution kernels corresponding to a preset direction respectively;

20. The apparatus according to claim 19, wherein the at least one target feature extraction layer constructed by the construction unit, when obtaining the image feature according to the convolution result corresponding to each preset direction, specifically performs:

21. The apparatus according to claim 20, wherein the at least one target feature extraction layer constructed by the construction unit, when obtaining the image feature according to the convolution result and the weight value corresponding to each preset direction, specifically performs:

22. The apparatus according to claim 18, wherein the training unit, when training the neural network model using the plurality of sample images and the target labeling results of the plurality of sample images, specifically performs:

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.