CN115311652A

CN115311652A - Object detection method and device, electronic equipment and readable storage medium

Info

Publication number: CN115311652A
Application number: CN202210820212.9A
Authority: CN
Inventors: 邓嘉新; 陈晓炬
Original assignee: Nanjing Thunder Software Technology Co ltd
Current assignee: Nanjing Thunder Software Technology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-11-08

Abstract

The embodiment of the application discloses an object detection method, an object detection device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a view to be detected, wherein the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting a target object in a surrounding manner; extracting features of the view to be detected to obtain a feature image of each image; at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used for representing the attention degree of the image positioned at the left side and the image positioned at the right side in the two images; detecting object information of the target object according to the target characteristic image, wherein the object information comprises: the category information of the target object, and/or the position information of the target object in the view to be detected. According to the embodiment of the application, the target object can be accurately detected.

Description

Object detection method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an object detection method and apparatus, an electronic device, and a readable storage medium.

Background

At present, along with the continuous development of artificial intelligence, the neural network model also is widely used in product detection, for example detects the product of output on the production line, detects out the defective product from it, can reduce artifical input, promotes detection efficiency.

The product is usually a three-dimensional structure, and displacement and rotation inevitably occur in the shooting process, and the influence of the shooting environment, and the defect form of the product is changed, so that the final detection result is inaccurate.

Disclosure of Invention

The embodiment of the application provides an object detection method, an object detection device, an electronic device and a readable storage medium, which can solve the problem that the accuracy of the current object detection is not high.

In a first aspect, an embodiment of the present application provides an object detection method, where the method includes:

acquiring a view to be detected, wherein the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by circularly shooting a target object;

extracting features of the view to be detected to obtain a feature image of each image;

at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used for representing the attention degree of the image positioned at the left side and the image positioned at the right side in the two images;

detecting object information of the target object according to the target characteristic image, wherein the object information comprises: the category information of the target object and/or the position information of the target object in the view to be detected.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a view to be detected, the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by circularly shooting a target object;

the extraction module is used for extracting the features of the view to be detected to obtain a feature image of each image;

the fusion module is used for at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used for representing the attention degree of the image positioned at the left side and the image positioned at the right side in the two images;

a detection module, configured to detect object information of a target object according to a target feature image, where the object information includes: the category information of the target object and/or the position information of the target object in the view to be detected.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, performs the method as in the first aspect or any possible implementation of the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the method as in the first aspect or any possible implementation manner of the first aspect.

According to the object detection method, the view to be detected is obtained, the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting the target object in a surrounding mode. And performing feature extraction on the view to be detected to obtain a feature image of each image. The method comprises the steps of fusing at least characteristic images of two connected images based on an attention weight matrix used for representing the attention degree of the image positioned on the left side in the two images to the image positioned on the right side to obtain a target characteristic image, wherein the characteristic images obtained from different visual angles can be fully utilized for fusion, the characteristic images at one visual angle can be used for effectively supplementing the characteristic images at the other visual angle connected with the characteristic images through the characteristic images at one visual angle, and the information included in the fused target characteristic image is richer. The object information of the target object is detected according to the target characteristic image, the object information of the target object can be rapidly and accurately inferred, and the detection efficiency and the detection precision are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below, and for those skilled in the art, other drawings may be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a training process and an application process of an object detection model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a training method for an object detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a model structure provided in an embodiment of the present application;

fig. 4 is a flowchart of an object detection method provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Features of various aspects and exemplary embodiments of the present application will be described in detail below, and in order to make objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by illustrating examples thereof.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprise 8230; "do not exclude the presence of additional identical elements in any process, method, article, or apparatus that comprises the element.

First, technical terms related to embodiments of the present application will be described.

Attention mechanism, a method of automatically assigning attention to different parts or regions of an image based on image characteristics, assigns higher attention to key regions.

The feature extraction network can be a convolutional neural network, the input of the feature extraction network can be an original image, the output of the feature extraction network is a 3-dimensional feature matrix of MxNxC, mxN is the number of feature vectors, and C is the dimension of the feature vectors. Generally, M and N decrease with network depth, and C increases with network depth.

The detection head is a convolution neural network, the input of the detection head is a feature matrix output by the feature extraction network, and the shape of the detection head is MxNxC. The output is a matrix of size MxNxD, where D = (1 +4+ NUM _C), and NUM _ C represents the defect class number.

Wherein, 4 of the D-dimensional vectors are used for returning the defect position, and the rectangular frame represents the center x coordinate, the center y coordinate, the rectangular frame width and the rectangular frame height of the rectangular frame, 1 of the D-dimensional vectors is used for representing whether the defect exists, and the rest NUM _ C vectors are used for representing the category to which the defect belongs.

The detection model generally comprises a feature extraction network and a detection head, wherein if N feature matrixes are extracted from the feature extraction network, N detection heads are correspondingly arranged.

Down-sampling, also called decimation, is performed several samples apart for a sequence of samples, so that the new sequence is obtained as a down-sampling of the original sequence.

And (4) performing over-inspection, and identifying the normal sample as a defective product.

And (4) missing detection, and identifying the defective product as a normal sample.

View, the images taken by the fixed angle camera, i.e. the images taken by the different angle cameras are different views.

The inflating valve is an automobile inflating valve, is used for inflating and deflating tires, and is internally provided with a copper core, and the outer rubber is wrapped, so that the shape of the inflating valve is mostly cylindrical.

The defects of the valve are visible by naked eyes on the surface of the valve due to external pressure or a production process, and the width and the height of the defects are more than 1mm.

The object detection method provided in the embodiment of the present application can be applied to at least the following application scenarios, which are explained below.

Along with the continuous development of artificial intelligence, the neural network model is also widely used in product detection, for example detect the product of output on the production line, detect out the product of defect from it, can reduce artifical input, promote detection efficiency.

Wherein the product is generally a three-dimensional structure. Taking the valve as an example, the valve is cylindrical, and the arc shape of the valve scatters light. Multiple cameras are required to capture around so that there is overlap of images captured by two adjacent cameras to detect areas where scatter occurs, but this increases the cost of the cameras. Moreover, images shot by a plurality of cameras are not subjected to information fusion processing, so that when the angle or light is disturbed, the defect form change is large, the detection result is greatly influenced, and the probability of over-detection and missing detection is increased.

Because the air valve inevitably generates displacement and rotation in the shooting process, the defect form of the air valve can be changed, and meanwhile, the certain probability of the defect becomes difficult to determine due to the environmental influence of surface dust and the like. For other products, too, the defect form of the product is changed due to inevitable displacement and rotation in the shooting process and the influence of the shooting environment, so that the final detection result is inaccurate.

Based on the application scenario, the object detection method provided in the embodiment of the present application is described in detail below.

The object detection model provided in the embodiments of the present application will be explained in entirety first.

Fig. 1 is a schematic diagram of a training process and an application process of an object detection model according to an embodiment of the present application, and as shown in fig. 1, the training process 110 and the application process 120 are divided.

In the training process 110, a plurality of sample data is obtained, each sample data includes a sample image 111 and preset sample information 115, and the sample image 111 at least includes: the image processing device comprises a first sample image and a second sample image, wherein the first sample image and the second sample image are two connected sample images in a plurality of sample images obtained by shooting sample objects in a surrounding mode. The sample image 111 is input to a preset model 112, and feature extraction is performed on the sample image 111 to obtain a sample feature image of each sample image 111. Fusing the two sample characteristic images based on a sample attention weight matrix determined according to the sample characteristic image of each sample image to obtain a target sample characteristic image 113; the characteristic images obtained from different visual angles are fully utilized for fusion, and the information of the sample characteristic images of the views is supplemented through the sample characteristic images of the connected views.

Then, sample object information 114 of the sample object is detected from the target sample feature image, the sample object information 114 including: category information of the sample object, and/or position information of the sample object in the sample image; according to the sample object information 114 and the preset sample information 115, the preset model 112 is trained, so that the detection capability of the model can be continuously improved until the preset model meets the preset training condition, and the object detection model 122 is obtained.

In the application process 120, a view 121 to be detected is obtained, where the view 121 to be detected at least includes a first image and a second image, and the first image and the second image are two connected images of a plurality of images obtained by shooting a target object in a surrounding manner. And inputting the view 121 to be detected into the trained object detection model 122, and performing feature extraction on the view 121 to be detected to obtain a feature image of each image. At least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image 123; the attention weight matrix is used to characterize the degree of attention of the image on the left side of the two images to the right side of the image. Here, the feature images obtained by making full use of the different perspectives are fused, the feature image of one perspective can be effectively supplemented to the feature image of the other perspective connected thereto, and finally, the object information 124 of the target object is detected from the target feature image 123. The trained object detection model 122 has good inference capability for features which are difficult to identify or determine, so that the detection efficiency and the detection accuracy can be effectively improved.

The training method and the object determination method of the object detection model provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 2 is a flowchart of a training method of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method of the object detection model may include steps 210 to 240, which are specifically as follows:

step 210, obtaining a plurality of sample data, where each sample data includes a sample image and preset sample information, and the sample image at least includes: the image processing device comprises a first sample image and a second sample image, wherein the sample images are two connected sample images in a plurality of sample images obtained by shooting sample objects in a surrounding mode.

Step 220, inputting the sample image into a preset model, and detecting sample object information of the sample object, wherein the sample object information includes: category information of the sample object, and/or position information of the sample object in the sample image.

And step 230, training a preset model according to the sample object information and the preset sample information until the preset model meets preset training conditions to obtain an object detection model.

The contents of steps 210-230 are described below:

step 210 is involved.

Obtaining a plurality of sample data, wherein each sample data comprises a sample image and preset sample information, and the sample image at least comprises: the image processing device comprises a first sample image and a second sample image, wherein the sample images are two connected sample images in a plurality of sample images obtained by shooting sample objects in a surrounding mode.

The sample image is two connected sample images in a plurality of sample images obtained by shooting the sample object in a surrounding manner, and means that at least part of image areas can be overlapped in a first sample image and a second sample image included in the sample image.

Wherein the preset sample information may include: category information of the pre-marked sample object, and/or position information of the pre-marked sample object in the sample image.

Step 220 is involved.

Inputting a sample image into a preset model, and detecting sample object information of a sample object, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image.

In one possible embodiment, the preset model comprises: an initial feature extraction network, an initial attention network and an initial detection network; in step 220, the following steps may be specifically included:

inputting the sample images into a preset model, and performing feature extraction on the sample images through an initial feature extraction network to obtain a sample feature image of each sample image;

inputting the sample characteristic image of each sample image into an initial attention network, and fusing the two sample images based on a sample attention weight matrix to obtain a target sample characteristic image; the sample attention weight matrix is used for representing the attention degree of the sample image positioned on the left side in the two sample images to the sample image positioned on the right side;

inputting the target sample characteristic image into an initial detection network, and detecting sample object information of a sample object according to the target sample characteristic image, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image.

Specifically, as shown in fig. 3, first, a sample image is input to a preset model, and feature extraction is performed on the sample image through an initial feature extraction network, so as to obtain a sample feature image of each sample image, that is, a feature map in fig. 3.

The initial feature extraction network may adopt the feature extraction network ResNet18 to extract image features by using a convolutional neural network to obtain a sample feature image of each sample image.

The feature extraction network ResNet18 is a convolutional neural network comprising a plurality of groups of convolution blocks, and after each convolution block, the obtained feature map is downsampled by 2 times to obtain a new sample feature image, and the total number of the sample feature images can be downsampled by 32 times.

After each sample image is subjected to feature extraction by ResNet18, a sample feature image with the size of (M, N, C) is obtained.

Inputting the sample characteristic image of each sample image into an initial attention network, fusing the two sample images based on a sample attention weight matrix to obtain a target sample characteristic image, namely a characteristic image after information fusion; the sample attention weight matrix is used to characterize the degree of attention of the sample image on the left side versus the sample image on the right side of the two sample images.

In the initial attention network, a sample attention weight matrix used for representing the attention degree of a sample image positioned on the left side in the two sample images to a sample image on the right side is determined according to the two sample images, and then the two sample images are fused based on the sample attention weight matrix to obtain a target sample characteristic image.

Then, inputting the target sample characteristic image into an initial detection network, namely a detection head, and detecting sample object information of the sample object according to the target sample characteristic image, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image. The sample object information of the sample object can be detected by the full-connection module.

The full-connection module is a neural network consisting of two full-connection layers and an activation layer in the middle. The input and output shapes are the same. And inputting the feature map subjected to information fusion into a full-connection module, outputting a final feature map, and restoring the shape of the feature map from (MxN, C) to the original shape (M, N, C). And inputting the feature map with the shape recovered into a detection head, and performing regression of defect positions and classification of defect types.

Step 230 is involved.

Specifically, a loss value can be determined according to the sample object information and the preset sample information, and the preset model is trained according to the loss value until the preset model meets a preset training condition, for example, the loss value of the preset model meets a preset convergence condition, or the loss value is smaller than a preset threshold value, so that a trained object detection model is obtained.

According to the training method of the object detection model, a plurality of sample data are obtained, each sample data comprises a sample image and preset sample information, and the sample images are two connected sample images in the plurality of sample images obtained by shooting the sample object in a surrounding mode. Inputting the sample images into a preset model, and performing feature extraction on the sample images through an initial feature extraction network to obtain a sample feature image of each sample image; inputting the sample characteristic image of each sample image into an initial attention network, and fusing the two sample images based on a sample attention weight matrix to obtain a target sample characteristic image; the sample characteristic images obtained by different visual angles are fully fused, and the sample characteristic image at one side is used for supplementing information to the sample characteristic image at the other side connected with the sample characteristic image at one side. Inputting the target sample characteristic image into an initial detection network, and detecting sample object information of a sample object according to the target sample characteristic image, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image. And finally, training the preset model according to the sample object information and the preset sample information, and continuously improving the detection capability of the model until the preset model meets the preset training condition to obtain the object detection model.

Based on the training method of the object detection model shown in fig. 2, the present application also provides a model construction method, which may specifically include:

and constructing a preset network structure comprising a plurality of levels, wherein the preset network structure comprises a feature extraction network, an attention network and a detection network.

Each level of the preset network structure comprises a feature extraction network for feature extraction; the attention network is used for carrying out fusion processing on the output of the feature extraction network; the detection network is used for calculating sample object information according to the output of the attention network and calculating a loss value according to the sample object information and preset object information;

training a preset network structure according to the loss values to determine parameters of a trained plurality of neural network layers at each of a plurality of levels;

and determining the trained preset network structure as a detection model.

Fig. 4 is a flowchart of an object detection method according to an embodiment of the present application.

As shown in fig. 4, the object detection method may include steps 410 to 440, and the method is applied to an object detection apparatus, and is specifically as follows:

step 410, obtaining a view to be detected, where the view to be detected at least includes a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting a target object in a surrounding manner.

And step 420, performing feature extraction on the view to be detected to obtain a feature image of each image.

Step 430, at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used to characterize the degree of attention of the image on the left side versus the image on the right side of the two images.

Step 440, detecting object information of the target object according to the target feature image, wherein the object information includes: the category information of the target object and/or the position information of the target object in the view to be detected.

According to the object detection method, the view to be detected is obtained, the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting the target object in a surrounding mode. And performing feature extraction on the view to be detected to obtain a feature image of each image. The method comprises the steps of fusing at least characteristic images of two connected images based on an attention weight matrix used for representing the attention degree of the image on the right side of the image on the left side in the two images to obtain a target characteristic image, wherein the characteristic images obtained from different visual angles can be fully utilized for fusion, the characteristic image of one visual angle can be effectively supplemented with the characteristic image of the other visual angle connected with the characteristic image of the other visual angle, and information contained in the fused target characteristic image is richer. The object information of the target object is detected according to the target characteristic image, the object information of the target object can be rapidly and accurately deduced, and the detection efficiency and the detection precision are improved.

The contents of step 410 to step 440 are described below:

step 410 is involved.

The method comprises the steps of obtaining a view to be detected, wherein the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting a target object in a surrounding mode.

The first image and the second image are two images connected with each other in a plurality of images obtained by imaging the target object in a surrounding manner, and the first image and the second image may overlap at least partial image areas.

Step 420 is involved.

And performing feature extraction on the view to be detected to obtain a feature image of each image.

In a possible embodiment, the step 420 of detecting the view corresponding to the view identification information may specifically include the following steps:

performing feature extraction on a view to be detected to obtain a view feature image;

determining a first projection image according to the view identification information;

determining a second projection image according to the view characteristic image;

and adding the view characteristic image, the first projection image and the second projection image for each view to be detected to obtain a characteristic image of each image.

Wherein the view identification information is used to identify the views, such as numbering each view in the event sequence of the surround shooting, for example: a first image, a second image, an nth view, etc., N being an integer greater than 1.

Determining a first projection image according to the view identification information; in the step of determining the second projection image according to the view feature image, the second projection image may be determined by using the following position projection formula:

concat(sin(2πXB ^T )，cos(2πXB ^T ))

wherein, X represents the original coordinate vector, if the coordinate is a one-dimensional coordinate, the vector dimension is 1, if the coordinate is a two-dimensional coordinate, the vector dimension is 2, and so on. X is a characteristic diagram of (M, N, C).

B is a projection matrix whose matrix elements are from a standard gaussian distribution and have the shape of (Z/2, original coordinate vector dimension), Z representing the dimension of the projected position vector. Concat represents a stitching operation that stitches together two vectors of shape (1, Z/2), with the final projected position vector dimension being Z.

The gaussian distribution, also called normal distribution, is the most important continuous probability distribution in statistics. Research shows that in physical science and economics, the distribution of a large amount of data is generally subject to Gaussian distribution, so when the potential distribution mode of the data is not clear, the approximate or accurate description of the Gaussian distribution can be preferentially used.

A projection matrix (projection matrix) is referred to as a projection matrix if the matrix a is both a symmetric matrix and an idempotent matrix. If the matrix satisfies the symmetric matrix (the main diagonal is taken as the symmetric axis, each element corresponds to the equal matrix).

And adding the view characteristic image, the first projection image and the second projection image for each view to be detected to obtain a characteristic image corresponding to each view to be detected to obtain a new characteristic image containing position information, namely the characteristic image corresponding to each view to be detected, wherein the shape of the characteristic image corresponding to each view to be detected is still (M, N, C).

The step of determining the first projection image according to the view identification information may specifically include the following steps:

converting the view identification information into a first position vector based on a preset projection relation;

determining a first projection image according to the first position vector;

the step of determining the second projection image according to the view feature image may specifically include the following steps:

converting the view characteristic image into a second position vector based on a preset projection relation;

a second projection image is determined based on the second position vector.

a first projection image is determined based on the first position vector.

Specifically, view identification information, namely the view of the image, is converted into a first position vector by using a position projection formula, wherein the dimension of the first position vector is the same as the vector dimension of the view feature image.

Each view feature image has a view position projection with the shape of (1, C), which is expanded to (1, C) and copied to the size of (M, N, C).

a second projection image is determined based on the second position vector.

And converting the position of the characteristic map vector, namely 2-dimensional coordinates in the characteristic map, into a second position vector by using a position projection formula, wherein the vector dimension of the second position vector is the same as the dimension of the view characteristic image.

Each view feature image has a second projection image having a shape (M, N, C).

In a possible embodiment, performing feature extraction on the view to be detected to obtain a feature image of each image, includes:

inputting an image to be detected into a pre-trained object detection model, wherein the object detection model comprises a feature extraction network, an attention network and a detection network, and extracting features of the view to be detected through the feature extraction network to obtain a feature image of each image;

based on the attention weight matrix, at least fusing the characteristic images of the two connected images to obtain a target characteristic image, wherein the method comprises the following steps:

inputting the characteristic images of the two connected images into an attention network, and determining an attention weight matrix;

at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image;

detecting object information of a target object according to the target characteristic image, comprising:

and inputting the target characteristic image into a detection network to obtain the object information of the target object.

The pre-trained object detection model can quickly and accurately extract the characteristic image of each image, and the characteristic images of the first image and the second image are collected. The attention weight matrix can be quickly and accurately determined according to the characteristic images of two connected images.

The method comprises the steps of at least fusing the characteristic images of two connected images based on an attention weight matrix determined by the characteristic images of the two connected images to obtain a target characteristic image, wherein the characteristic images obtained from different visual angles can be fully utilized for fusion, and the characteristic image of one visual angle can be effectively supplemented to the characteristic image of the other visual angle connected with the characteristic image of one visual angle, so that the details in the target characteristic image obtained by fusion are richer, and finally, the object information of a target object can be rapidly and accurately deduced according to the trained detection network target characteristic image, and the detection efficiency and the detection precision are improved.

Step 430 is involved.

At least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used to characterize the degree of attention of the image on the left side of the two images to the right side of the image.

In a possible embodiment, before step 430, the following steps may be further included:

from the feature images of both images, an attention weight matrix is determined.

Specifically, the attention weight matrix may be determined according to the following Softmax formula and attention weight calculation formula:

y is a scalar and is 1, and Z in the formula represents a matrix formed by C vectors, Z _d Representing the d-th vector in the Z matrix.

The output of Softmax is the probability that each classification is taken. He maps some inputs to real numbers between 0-1 and normalizes the guaranteed sum to 1, so the sum of the probabilities for the multi-classes is also exactly 1.

The output of the Softmax function is interrelated with the sum of the probabilities always being 1, e.g., 0.04+0.21+0.05+0.70=1.00. So in the softmax function, the probability of one class increases and the probability of the other classes decreases. If the output of the model is mutually exclusive and only one category can be selected, the Softmax function is employed.

Determining an attention weight matrix according to the feature images of the two images, specifically adopting the following attention weight calculation formula:

wherein Q, K represent the matrix formed by the query vector and the key vector, the shape is (N, d), d is the dimension of the vector, and N represents the number of the vectors. After softmax, an attention weight matrix a with the shape of (NxN) is obtained, the sum of the values of each row in the matrix is 1, and the element Aij in the matrix represents the attention allocation value of the ith Qi to the jth Kj.

Wherein Q is the characteristic image of the first image, and K is the characteristic image of the second image.

The query vector Q is used for matching key vectors K of all elements in the sequence, the attention weights are obtained by point multiplication of the Q of the element and the K of all the elements respectively, then the attention weights are cross-multiplied by value vectors of corresponding positions respectively, and the self-attention results of the current elements are obtained by summation. The query vector Q is mainly used for calculating attention for the current element, and the key vector K is mainly used for calculating attention for other elements.

The step of determining the attention weight matrix based on the feature images of the two images may be specifically implemented according to the following formula:

this step represents extracting information from K for fusion according to the attention of Q to each K. After fusion

The shape of (2) is the same as the original Q.

Wherein, the first and the second end of the pipe are connected with each other,

is the target feature image, K is the second feature image of the second image, and a is the first attention weight matrix.

Flattening the characteristic diagram containing the position information, namely changing the shape into (M x N, C), calculating an attention weight matrix between each characteristic image and an adjacent view, and then carrying out information fusion on the basis of the attention weight to update the characteristic diagram, thus obtaining the target characteristic image.

In a possible embodiment, the aforementioned related view to be detected further includes a third image connected to the other side of the first image or the second image, and the step of determining the attention weight matrix according to the feature images of the two images may specifically include the following steps:

determining a first attention weight matrix according to the characteristic image of the first image and the characteristic image of the second image; and (c) a second step of,

determining a second attention weight matrix according to the characteristic image of the third image and the characteristic image of the image connected with the third image;

fusing the characteristic image of the first image and the characteristic image of the second image based on the first attention weight matrix to obtain a fused characteristic image;

and fusing the characteristic image of the third image and the fused characteristic image based on the second attention weight matrix to obtain a target characteristic image.

The views to be detected are obtained by shooting target objects in a surrounding mode in sequence. The views to be detected also include a third image that borders on the other side of the first or second image. I.e. the third image is contiguous with the first image, the shooting sequence may comprise: a third image, a first image, a second image; or the second image, the first image and the third image; or, the first image, the second image, the third image; or the third image, the second image, the first image.

The step of determining the attention weight matrix according to the feature images of the two images may specifically include the following steps: determining a first attention weight matrix according to the characteristic image of the first image and the characteristic image of the second image; and determining a second attention weight matrix according to the characteristic image of the third image and the characteristic image of the image connected with the third image.

Take the shooting sequence as the second image, the first image, and the third image as an example. The first image and the second image are contiguous, as are the first image and the third image. Determining a first attention weight matrix according to the first image and the second image respectively; and determining a second attention weight matrix from the first image and the third image.

Correspondingly, the step of fusing at least the feature images of the two connected images based on the attention weight matrix to obtain the target feature image may specifically include the following steps:

Specifically, for each feature image, information of a right feature image adjacent to the feature image may be fused first, and then information of a left feature image adjacent to the feature image may be fused, so that information fusion of a plurality of feature images is achieved.

Therefore, by sufficiently utilizing the characteristic images obtained from different visual angles to perform fusion, the characteristic image at the other visual angle connected with the characteristic image at the right visual angle can be effectively supplemented, and the fused characteristic image can be obtained. And then, the fused characteristic image is effectively supplemented through the characteristic image of the left visual angle to obtain a target characteristic image, so that the information contained in the fused target characteristic image is richer, and the subsequent detection of the object information of the target object according to the target characteristic image is facilitated.

Step 440, detecting object information of the target object according to the target feature image, wherein the object information includes: category information of the object, and/or position information of the object in the view to be detected.

According to the object detection method, the view to be detected is obtained, the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting the target object in a surrounding mode. And performing feature extraction on the view to be detected to obtain a feature image of each image. The method comprises the steps of fusing at least characteristic images of two connected images based on an attention weight matrix used for representing the attention degree of the image on the right side of the image on the left side in the two images to obtain a target characteristic image, wherein the characteristic images obtained from different visual angles can be fully utilized for fusion, the characteristic image of one visual angle can be effectively supplemented with the characteristic image of the other visual angle connected with the characteristic image of the other visual angle, and information contained in the fused target characteristic image is richer. The object information of the target object is detected according to the target characteristic image, the object information of the target object can be rapidly and accurately inferred, and the detection efficiency and the detection precision are improved.

Based on the object detection method shown in fig. 4, an embodiment of the present application further provides an object detection apparatus, as shown in fig. 5, where the apparatus 500 may include:

the acquiring module 510 is configured to acquire a view to be detected, where the view to be detected at least includes a first image and a second image, and the first image and the second image are two connected images of a plurality of images obtained by shooting a target object in a surrounding manner.

The extracting module 520 is configured to perform feature extraction on the view to be detected to obtain a feature image of each image.

A fusion module 530, configured to fuse at least the feature images of the two connected images based on the attention weight matrix to obtain a target feature image; the attention weight matrix is used to characterize the degree of attention of the image on the left side versus the image on the right side of the two images.

A detecting module 540, configured to detect object information of the target object according to the target feature image, where the object information includes: the category information of the target object, and/or the position information of the target object in the view to be detected.

In one possible implementation, the apparatus 500 may further include:

the first determining module is used for determining an attention weight matrix according to the characteristic images of the two images.

In a possible implementation manner, the view to be detected further includes a third image connected to the other side of the first image or the second image, and the first determining module is specifically configured to:

a second attention weight matrix is determined from the feature image of the third image and the feature images of the images bordering the third image.

The fusion module 530 is specifically configured to:

In a possible implementation manner, the extraction module 520 is specifically configured to, for the view identification information corresponding to the view to be detected:

and performing feature extraction on the view to be detected to obtain a view feature image.

The extracting module 520 may specifically include:

and the second determining module is used for determining the first projection image according to the view identification information.

And the third determining module is used for determining a second projection image according to the view characteristic image.

And the adding module is used for adding the view characteristic image, the first projection image and the second projection image to each view to be detected to obtain a characteristic image of each image.

In a possible implementation manner, the second determining module is specifically configured to:

a first projection image is determined based on the first position vector.

A third determining module, specifically configured to:

a second projection image is determined based on the second position vector.

In a possible implementation manner, the extracting module 520 is specifically configured to:

inputting the image to be detected into a pre-trained object detection model, wherein the object detection model comprises a feature extraction network, an attention network and a detection network, and extracting features of the view to be detected through the feature extraction network to obtain a feature image of each image.

The fusion module 530 is specifically configured to:

and at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image.

The detecting module 540 is specifically configured to input the target feature image to a detection network, so as to obtain object information of the target object.

In one possible implementation, the apparatus 500 may further include:

the first acquisition module is used for acquiring a plurality of sample data, each sample data comprises a sample image and preset sample information, and the sample image at least comprises: the image processing device comprises a first sample image and a second sample image, wherein the sample images are two connected sample images in a plurality of sample images obtained by shooting sample objects in a surrounding mode.

The input module is used for inputting the sample image into a preset model and detecting sample object information of the sample object, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image.

And the training module is used for training the preset model according to the sample object information and the preset sample information until the preset model meets the preset training condition, so as to obtain the object detection model.

In one possible implementation, the presetting model includes: an initial feature extraction network, an initial attention network and an initial detection network; an input module specifically configured to:

In the object detection device, the view to be detected is obtained and at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting the target object in a surrounding mode. And performing feature extraction on the view to be detected to obtain a feature image of each image. The method comprises the steps of fusing at least characteristic images of two connected images based on an attention weight matrix used for representing the attention degree of the image positioned on the left side in the two images to the image positioned on the right side to obtain a target characteristic image, wherein the characteristic images obtained from different visual angles can be fully utilized for fusion, the characteristic images at one visual angle can be used for effectively supplementing the characteristic images at the other visual angle connected with the characteristic images through the characteristic images at one visual angle, and the information included in the fused target characteristic image is richer. The object information of the target object is detected according to the target characteristic image, the object information of the target object can be rapidly and accurately inferred, and the detection efficiency and the detection precision are improved.

Fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

The electronic device may comprise a processor 601 and a memory 602 in which computer program instructions are stored.

Specifically, the processor 601 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 602 may include mass storage for data or instructions. By way of example, and not limitation, memory 602 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 602 may include removable or non-removable (or fixed) media, where appropriate. The memory 602 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 602 is a non-volatile solid-state memory. In certain embodiments, memory 602 comprises Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

The processor 601 implements any one of the object detection methods in the illustrated embodiments by reading and executing computer program instructions stored in the memory 602.

In one example, the electronic device may also include a communication interface 603 and a bus 610. As shown in fig. 6, the processor 601, the memory 602, and the communication interface 603 are connected via a bus 610 to complete communication therebetween.

The communication interface 603 is mainly used for implementing communication between modules, apparatuses, units and/or devices in this embodiment.

The bus 610 includes hardware, software, or both to couple the components of the electronic device to one another. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industrial Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hyper Transport (HT) interconnect, an Industrial Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 610 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device may execute the object detection method in the embodiment of the present application, thereby implementing the object detection method described in conjunction with fig. 1 to 4.

In addition, in combination with the object detection method in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the object detection method of fig. 1-4.

It is to be understood that the present application is not limited to the particular arrangements and instrumentalities described above and shown in the attached drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions, or change the order between the steps, after comprehending the spirit of the present application.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an Erasable ROM (EROM), a floppy disk, a CD-ROM, an optical disk, a hard disk, an optical fiber medium, a Radio Frequency (RF) link, and so forth. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed at the same time.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. An object detection method, characterized in that the method comprises:

acquiring a view to be detected, wherein the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting a target object in a surrounding manner;

extracting the features of the view to be detected to obtain a feature image of each image;

at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used for representing the attention degree of the image positioned on the right side of the image on the left side in the two images;

detecting object information of the target object according to the target feature image, wherein the object information comprises: the type information of the target object, and/or the position information of the target object in the view to be detected.

2. The method according to claim 1, wherein before the fusing at least the feature images of the two connected images based on the attention weight matrix to obtain the target feature image, the method further comprises:

and determining the attention weight matrix according to the characteristic images of the two images.

3. The method according to claim 2, wherein the view to be detected further comprises a third image bordering the other side of the first or second image, and wherein determining the attention weight matrix from the feature images of the two images comprises:

determining a first attention weight matrix according to the characteristic image of the first image and the characteristic image of the second image; and the number of the first and second groups,

the fusing at least the feature images of the two connected images based on the attention weight matrix to obtain a target feature image, including:

and fusing the characteristic image of the third image and the fused characteristic image based on the second attention weight matrix to obtain the target characteristic image.

4. The method according to claim 1, wherein the view to be detected corresponds to view identification information, and the extracting the features of the view to be detected to obtain the feature image of each image comprises:

extracting features of the view to be detected to obtain a view feature image;

and for each view to be detected, adding the view characteristic image, the first projection image and the second projection image to obtain a characteristic image of each image.

5. The method of claim 4, wherein determining a first projection image based on the view identification information comprises:

determining the first projection image according to the first position vector;

determining a second projection image according to the view characteristic image, including:

converting the view characteristic image into a second position vector based on the preset projection relation;

determining the second projection image according to the second position vector.

6. The method according to any one of claims 1 to 5, wherein the extracting features of the view to be detected to obtain a feature image of each image comprises:

inputting the images to be detected into a pre-trained object detection model, wherein the object detection model comprises a feature extraction network, an attention network and a detection network, and extracting features of the views to be detected through the feature extraction network to obtain feature images of each image;

the fusion of the feature images of at least the two connected images based on the attention weight matrix to obtain a target feature image comprises:

inputting the characteristic images of the two connected images into the attention network, and determining the attention weight matrix;

at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain the target characteristic image;

the detecting object information of the target object according to the target feature image includes:

and inputting the target characteristic image to the detection network to obtain the object information of the target object.

7. The method according to claim 6, wherein before the feature extraction of the view to be detected to obtain the feature image of each image, the method further comprises:

obtaining a plurality of sample data, wherein each sample data comprises a sample image and preset sample information, and the sample image at least comprises: the image processing method comprises the steps that a first sample image and a second sample image are obtained, wherein the sample images are two connected sample images in a plurality of sample images obtained by shooting sample objects in a surrounding mode;

inputting the sample image into a preset model, and detecting sample object information of the sample object, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image;

and training the preset model according to the sample object information and the preset sample information until the preset model meets preset training conditions to obtain the object detection model.

8. The method of claim 7, wherein the pre-set model comprises: an initial feature extraction network, an initial attention network and an initial detection network; the inputting the sample image into a preset model, and detecting sample object information of the sample object, includes:

inputting the sample images into the preset model, and performing feature extraction on the sample images through the initial feature extraction network to obtain a sample feature image of each sample image;

inputting the sample characteristic image of each sample image into the initial attention network, and fusing the two sample images based on a sample attention weight matrix to obtain a target sample characteristic image; the sample attention weight matrix is used for representing the attention degree of the sample image positioned on the left side in the two sample images to the sample image positioned on the right side;

inputting the target sample characteristic image into the initial detection network, and detecting sample object information of the sample object according to the target sample characteristic image, wherein the sample object information comprises: category information of the sample object, and/or position information of the sample object in the sample image.

9. An object detection apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a view to be detected, the view to be detected at least comprises a first image and a second image, and the first image and the second image are two connected images in a plurality of images obtained by shooting a target object in a surrounding manner;

the extraction module is used for extracting the characteristics of the view to be detected to obtain a characteristic image of each image;

the fusion module is used for at least fusing the characteristic images of the two connected images based on the attention weight matrix to obtain a target characteristic image; the attention weight matrix is used for representing the attention degree of the image positioned on the right side of the image on the left side in the two images;

a detection module, configured to detect object information of the target object according to the target feature image, where the object information includes: the type information of the target object, and/or the position information of the target object in the view to be detected.

10. An electronic device, characterized in that the device comprises: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements an object detection method as claimed in any of claims 1-8.

11. A readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the object detection method of any one of claims 1-8.