CN116246254A

CN116246254A - Target object identification method and device, electronic equipment and storage medium

Info

Publication number: CN116246254A
Application number: CN202310260940.3A
Authority: CN
Inventors: 李宁; 朱磊; 贾双成; 万如
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-06-09

Abstract

The application relates to the field of data identification, and discloses a target object identification method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting an image to be identified containing a target object into a target identification model to obtain a result image; the target recognition model comprises an encoder, a decomposer and a decoder; the encoder comprises a plurality of convolution layers, the decomposer comprises at least one decomposition layer, the decoder comprises a plurality of deconvolution layers, each deconvolution layer is used for fusing a received image, a first characteristic image output by a convolution layer of a current level and a characteristic matrix output by a connected decomposition layer to obtain a second characteristic image of the current level, and the second characteristic image of the current level is transmitted to the next deconvolution layer. According to the scheme provided by the application, the comprehensive analysis of the multi-level features is realized through the target recognition model, more semantic features are obtained through the feature decomposition and fusion modes, the recognition precision is improved, and the problem of low recognition precision of the traditional recognition model is solved.

Description

Target object identification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data identification technologies, and in particular, to a method and apparatus for identifying a target object, an electronic device, and a storage medium.

Background

In the field of data recognition, it is often the case that recognition of objects in images, especially in an autopilot scenario, the accuracy of recognition of objects will directly affect the safety of autopilot.

In the related art, the task of identifying the target object is generally realized through a machine learning algorithm, for example, a traditional identification model built based on a deep learning network such as a CNN (Convolutional Neural Networks, convolutional neural network) and a FCN (Fully Convolutional Networks, full convolutional network), and in the process of identifying the target object by using the traditional identification model, the identification precision of the target object in the image is lower due to the structural reasons of the deep learning network, the identification effect of the target object with smaller size in the image is poorer, and the higher-precision data identification requirement in practical application is difficult to meet.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides a method, a device, an electronic device and a storage medium for identifying an object, which can improve the identification precision of the object with smaller size by improving the structure of an object identification model.

The first aspect of the present application provides a method for identifying a target, including:

Inputting an image to be identified containing a target object into a target identification model to obtain a result image output by the target identification model;

the result image comprises an identification result of marking the target object in the image to be identified;

the object recognition model comprises an encoder, a decomposer and a decoder;

the encoder comprises a plurality of convolution layers, wherein each convolution layer is used for extracting semantic features corresponding to a received image at a current level and outputting a first feature map containing the semantic features to a next convolution layer connected with the current convolution layer and the decoder;

the decoder comprises a plurality of deconvolution layers, each deconvolution layer is respectively connected with one convolution layer and one deconvolution layer, each deconvolution layer is used for carrying out characteristic decomposition on the received first characteristic graph to obtain a characteristic matrix, and the characteristic matrix is output to a deconvolution layer connected with the current deconvolution layer;

and each deconvolution layer is used for fusing the received image, the first feature image output by the convolution layer corresponding to the current layer and the feature matrix output by the connected decomposition layer to obtain a second feature image of the current layer, and transmitting the second feature image of the current layer to the next deconvolution layer connected with the current deconvolution layer.

According to the target object identification method provided by the application, each decomposition layer carries out feature decomposition on the received first feature map specifically through the following process:

determining an image matrix corresponding to the received first feature map;

performing singular value decomposition on the image matrix to obtain a singular value decomposition result;

and determining a feature matrix corresponding to the first feature map according to the singular value decomposition result.

According to the target object identification method provided by the application, the singular value decomposition is performed on the image matrix to obtain a singular value decomposition result, which comprises the following steps:

determining the transposition of the image matrix, and respectively determining a left singular matrix and a right singular matrix corresponding to the image matrix according to the image matrix and the transposition of the image matrix; the left singular matrix comprises a plurality of left singular vectors, and the right singular matrix comprises a plurality of right singular vectors;

determining a singular value matrix corresponding to the image matrix according to the image matrix, the left singular vectors and the right singular vectors;

and taking the left singular matrix, the right singular matrix and the singular value matrix as the singular value decomposition result.

According to the method for identifying a target object provided in the present application, determining, according to the singular value decomposition result, a feature matrix corresponding to the first feature map includes:

and determining a feature matrix corresponding to the first feature map according to at least one of a left singular matrix, a right singular matrix and a singular value matrix in the singular value decomposition result.

According to the target object identification method provided by the application, according to the image matrix and the transpose of the image matrix, determining a left singular matrix corresponding to the image matrix comprises the following steps:

multiplying the image matrix by the transpose of the image matrix to obtain a first matrix;

performing feature decomposition on the first matrix to obtain a plurality of left singular vectors;

and splicing the plurality of left singular vectors to obtain the left singular matrix.

According to the target object identification method provided by the application, the right singular matrix corresponding to the image matrix is determined according to the image matrix and the transpose of the image matrix, and the method comprises the following steps:

multiplying the transpose of the image matrix with the image matrix to obtain a second matrix;

performing feature decomposition on the second matrix to obtain a plurality of right singular vectors;

And splicing the right singular vectors to obtain the right singular matrix.

According to the target object identification method provided by the application, the deconvolution layer fuses the received image, the first feature map output by the convolution layer corresponding to the current level and the feature matrix output by the connected decomposition layer specifically through the following processes:

determining an image matrix of the received image to obtain a third matrix;

determining an image matrix of a first feature map output by a convolution layer corresponding to the current level to obtain a fourth matrix;

and fusing the third matrix, the fourth matrix and the feature matrix output by the connected decomposition layer.

A second aspect of the present application provides an apparatus for identifying an object, the apparatus comprising:

the recognition module is used for inputting an image to be recognized containing a target object into a target recognition model to obtain a result image output by the target recognition model;

the object recognition model comprises an encoder, a decomposer and a decoder;

The decoder comprises a plurality of deconvolution layers, each deconvolution layer is respectively connected with one convolution layer and one deconvolution layer, each deconvolution layer is used for carrying out characteristic decomposition on the received first characteristic graph to obtain a characteristic matrix, and the characteristic matrix is output to the deconvolution layer corresponding to the current deconvolution layer;

A third aspect of the present application provides an electronic device, comprising:

a processor; and

a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as described above.

The technical scheme that this application provided can include following beneficial effect:

the encoder in the target recognition model can extract semantic features from multiple levels by utilizing the multi-layer convolution layer, the decomposer can decompose the received first feature image to obtain a feature matrix, and the decoder can fuse the semantic features of the corresponding level with the feature matrix output by the connected decomposition layer by utilizing the multi-layer deconvolution layer, so that the target recognition model can realize comprehensive analysis of the multi-level features in the image to be recognized, can obtain more semantic features in a feature decomposition and fusion mode, can fully obtain the semantic features in the image to be recognized, can further more accurately recognize smaller-sized targets in the image to be recognized, and improves the recognition accuracy of the targets.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

Fig. 1 is a flow chart illustrating a method for identifying an object according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an improved object recognition model according to an embodiment of the present application;

FIG. 3 is a schematic view of a road image including a guideboard in an embodiment of the present application;

FIG. 4 is a schematic diagram of a guideboard label corresponding to a road image including a guideboard in an embodiment of the present application;

FIG. 5 is a schematic diagram of a guideboard recognition result obtained by a conventional recognition model according to an embodiment of the present application;

FIG. 6 is a second schematic diagram of a guideboard recognition result obtained by a conventional recognition model in an embodiment of the present application;

FIG. 7 is a schematic diagram of a guideboard recognition result obtained by the improved target recognition model according to an embodiment of the present application;

FIG. 8 is a second schematic diagram of a guideboard recognition result obtained by the improved target recognition model in the embodiment of the present application;

fig. 9 is a schematic structural diagram of an identification device of an object according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The embodiment relates to the field of data identification, and in particular, can be applied to a target object identification scene, such as an identification scene of an automatic driving process on a target object in a vehicle driving environment, in the related technology, as a network structure of a machine learning model used for target object identification is not reasonable enough, semantic features which can be extracted from an image to be identified are fewer, so that a target object with a smaller size in the image to be identified cannot be effectively identified, and the identification precision is lower.

In view of the above problems, embodiments of the present application provide a method for identifying an object, which can improve accuracy of identifying an object with a smaller size by improving a structure of an object identification model.

The following describes in detail a method, an apparatus, an electronic device, and a technical solution of a storage medium for identifying an object in an embodiment of the present application with reference to fig. 1 to 10.

Fig. 1 is a flow chart of a method for identifying an object according to an embodiment of the present application.

Referring to fig. 1, the method for identifying a target object provided in the embodiment of the present application specifically includes:

step 101: inputting an image to be identified containing a target object into a target identification model to obtain a result image output by the target identification model;

The object recognition model comprises an encoder, a decomposer and a decoder;

the encoder comprises a plurality of convolution layers, wherein each convolution layer is used for extracting semantic features corresponding to a received image at a current level and outputting a first feature map containing the semantic features to a next convolution layer connected with the current convolution layer and a decoder;

the decomposer comprises at least one decomposition layer, the decoder comprises a plurality of deconvolution layers, each decomposition layer is respectively connected with one convolution layer and one deconvolution layer, each decomposition layer is used for carrying out characteristic decomposition on the received first characteristic graph to obtain a characteristic matrix, and the characteristic matrix is output to the deconvolution layer connected with the current decomposition layer;

each deconvolution layer is used for fusing the received image, the first feature image output by the corresponding deconvolution layer of the current level and the feature matrix output by the connected decomposition layer to obtain a second feature image of the current level, and transmitting the second feature image of the current level to the next deconvolution layer connected with the current deconvolution layer.

In this embodiment, the execution subject of the target object identification method may be a processor or a server, for example, in an autopilot scenario, the execution subject of the target object identification method may be a processor disposed on a vehicle or a processor or a server disposed outside the vehicle.

It can be appreciated that the image to be identified may be acquired by an image acquisition device, for example, in an automatic driving scenario, by a camera mounted on the vehicle, or by road test devices mounted on both sides of the road.

In this embodiment, the identification of the target object may be understood as locating the edge of the target object from the image to be identified, and further determining the outline of the target object, so as to extract the target object from the image to be identified, where the target object may be a marker that is required in a subsequent decision making process, for example, in an automatic driving scene, the target object may be a traffic light, a lane line, or the like.

The result image is an image obtained by marking the position of the target object on the basis of the image to be identified, and the position of the target object, namely the identification result of the target object, can be obtained through the result image.

It should be noted that, the target recognition model is obtained after training the machine learning model based on the image sample to be recognized, and semantic features of related targets are extracted from the image to be recognized in a semantic segmentation manner, so as to realize recognition of the targets.

According to the embodiment, the network structure of the target recognition model is improved, semantic feature extraction of different levels can be achieved through construction of the encoder, the decomposer and the decoder, meanwhile, the feature matrix can be extracted on the basis of the first feature map extracted by the convolution layer, more semantic features are obtained, feature extraction is more sufficient, recognition accuracy of a target object with a smaller size in an image to be recognized is higher, and recognition effect is better.

In some embodiments, the encoder may be constructed by sequentially connecting multiple convolution layers, in which, the input end of each convolution layer is connected with the output end of the previous convolution layer, and the output end of each convolution layer is connected with the input end of the next convolution layer, so that the semantic features extracted by each convolution layer can be transferred step by step until being transferred to the last convolution layer, to realize the function of acquiring the multi-level semantic features in the image to be identified.

In an exemplary embodiment, the encoder may include a first convolution layer, a second convolution layer, a third convolution layer, and a fourth convolution layer, which are sequentially connected;

the first convolution layer is used for extracting semantic features of the image to be identified at a first level, and outputting a first feature map containing the semantic features of the first level to the second convolution layer and the decoder respectively;

the second convolution layer is used for extracting semantic features of the image to be identified at a second level based on the semantic features of the first level, and outputting first feature images containing the semantic features of the second level to the third convolution layer and the decoder respectively;

the third convolution layer is used for extracting semantic features of the image to be identified at a third level based on the semantic features of the second level, and outputting first feature images containing the semantic features of the third level to the fourth convolution layer and the decoder respectively;

The fourth convolution layer is used for extracting semantic features of the image to be identified at a fourth level based on the semantic features of the third level, and outputting a first feature map containing the semantic features of the fourth level to a decoder.

In this embodiment, for the case that the encoder includes four convolutional layers, referring to fig. 2, the network architecture of the entire target recognition model may be roughly divided into four base layers, that is, four row layer structures corresponding from top to bottom in fig. 2, and the four base layers may be further divided into three parts of an encoder 201, a decoder 202 and a decomposer 203, where the encoder 201 and the decoder 202 may be connected through a splicing channel, the network of the target recognition model shown in fig. 2 corresponds to a kernel size of 3×3, padding is 1, and stride is 1.

As shown in fig. 2, four layers of convolution layers are sequentially connected inside the encoder 201, each layer of convolution layer is defined according to a data transmission direction, that is, the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer sequentially correspond to each row of base layers from top to bottom in the encoder 201 in fig. 2, and semantic features output by each convolution layer in the encoder 201 are transmitted through a first feature map capable of representing semantic information of a current level.

On the one hand, each convolution layer performs a downsampling operation on an input image to double the number of channels of the input image, the size of the image (specifically, the length and width of the image) is reduced to half of the original size, for example, the size of the input image is 480×800, and the size of the image is reduced to 240×400 after performing the downsampling operation; on the other hand, each convolution layer carries out convolution operation, normalization operation and activation operation on the input image respectively so as to extract the semantic features of the input image at the current level, and the extraction of the semantic features of a plurality of levels is realized through the multi-layer convolution layers.

In some embodiments, the decomposer may specifically include multiple decomposition layers, where an input end of each decomposition layer is connected to a convolution layer corresponding to a current level, and an output end of each decomposition layer is connected to a deconvolution layer corresponding to a next level;

each decomposition layer is used for carrying out feature decomposition on the first feature map of the current level to obtain a feature matrix, and inputting the feature matrix into the deconvolution layer corresponding to the next level.

Referring to fig. 2, in the present embodiment, the number of layers of the decoder 202 is the same as that of the encoder 201, and in the scenario shown in fig. 2, the decoder 202 includes four deconvolution layers sequentially connected, where the four deconvolution layers of the decoder 202 are sequentially defined according to the data flow direction, specifically, the first deconvolution layer, the second deconvolution layer, the third deconvolution layer, and the fourth deconvolution layer sequentially correspond to each row of base layers from bottom to top in the decoder 202 in fig. 2.

Referring to fig. 2, three decomposition layers are provided in the present embodiment, and the three decomposition layers are respectively provided at a first level, a second level, and a third level corresponding to the convolution layers, that is, the decomposer 203 in the present embodiment specifically includes a first decomposition layer, a second decomposition layer, and a third decomposition layer;

the input end of the first decomposition layer is connected with the first convolution layer, the output end of the first decomposition layer is connected with the third convolution layer, the input end of the second decomposition layer is connected with the second convolution layer, the output end of the second decomposition layer is connected with the second deconvolution layer, the input end of the third decomposition layer is connected with the third convolution layer, and the output end of the third decomposition layer is connected with the first deconvolution layer;

the first decomposition layer is used for carrying out feature decomposition on the first feature map of the first level to obtain a feature matrix of the first level, and transmitting the feature matrix of the first level to the third deconvolution layer;

the second decomposition layer is used for carrying out feature decomposition on the first feature map of the second level to obtain a feature matrix of the second level, and transmitting the feature matrix of the second level to the second deconvolution layer;

the third decomposition layer is used for carrying out feature decomposition on the first feature map of the third level to obtain a feature matrix of the third level, and transmitting the feature matrix of the third level to the first deconvolution layer.

It should be noted that, in this embodiment, the image size of the first feature map of the fourth level is mainly considered to be smaller after the multiple downsampling operations, and feature decomposition may not be performed on the first feature map of the level, so no decomposition layer is set for the convolution layer of the level, if the object recognition method is applied to a scene with a higher recognition accuracy requirement, a decomposition layer, that is, a fourth decomposition layer, may also be set for the fourth convolution layer, so that semantic features are more fully extracted, and the number of layers of the decomposition layer set in the decomposer may be reasonably set according to the actual application scene requirement, and will not be repeated herein.

The setting of the decomposition layer in the embodiment can obtain richer semantic features on the basis of the semantic features extracted by the convolution layer, so that the extracted semantic features are richer.

In some embodiments, each decomposition layer may specifically perform feature decomposition on the received first feature map through the following process:

determining an image matrix corresponding to the received first feature map;

In this embodiment, considering that the first feature map is still an image in nature, the first feature map may be represented in the form of an image matrix, for example, pixel values of each pixel point in the first feature map may be taken as matrix elements, and an image matrix may be constructed, where the image matrix may represent pixel information corresponding to the first feature map.

After the image matrix is built, the feature matrix corresponding to the first feature image can be determined through an SVD (Singular Value Decomposition ) algorithm, the SVD algorithm can convert the image matrix into a plurality of submatrices through a matrix decomposition mode, the transformed submatrices can represent the feature information of the image matrix, and therefore the feature matrix of the first feature image can be conveniently and accurately determined according to a singular value decomposition result.

In some embodiments, singular value decomposition is performed on the image matrix to obtain a singular value decomposition result, which specifically includes:

determining the transposition of an image matrix, and respectively determining a left singular matrix and a right singular matrix corresponding to the image matrix according to the image matrix and the transposition of the image matrix; the left singular matrix comprises a plurality of left singular vectors, and the right singular matrix comprises a plurality of right singular vectors;

and taking the left singular matrix, the right singular matrix and the singular value matrix as singular value decomposition results.

In this embodiment, an image matrix is set to be a, where the image matrix a is an mxn matrix, and a singular value decomposition expression of the image matrix a may be expressed as follows:

（1）

wherein A represents an image matrix, U represents a left singular matrix, U represents an m×m matrix, Σ represents a singular value matrix, Σ represents an m×n matrix, Σ represents 0 except for elements on a main diagonal, each element on the main diagonal is called a singular value, V represents a right singular matrix, V represents an n×n matrix,

representing the transpose of the right singular matrix.

In the practical application process, the left singular matrix U and the right singular matrix V can pass through the image matrix A and the transposition of the image matrix

After the left singular matrix U and the right singular matrix V are obtained through solving, the singular value matrix sigma can be further determined by utilizing a plurality of left singular vectors in the left singular matrix U, a plurality of right singular vectors in the right singular matrix V and the image matrix, and then a singular value decomposition result is obtained.

In an exemplary embodiment, determining a left singular matrix corresponding to the image matrix according to the image matrix and a transpose of the image matrix specifically includes:

and splicing the plurality of left singular vectors to obtain a left singular matrix.

In this embodiment, when determining the left singular matrix, first, the image matrix A is transposed with the image matrix

Matrix multiplication is carried out to obtain an m multiplied by m square matrix AA ^T I.e. the first matrix, due to the first matrix AA ^T Is a square matrix which can be subjected to characteristic decomposition, and then the first matrix AA is a matrix ^T And performing feature decomposition, wherein the obtained feature value and feature vector (namely left singular vector) meet the following expression:

（2）

wherein AA is ^T The first matrix is represented by a first matrix,

representing the i-th left singular vector,/>

And representing the characteristic value corresponding to the i-th left singular vector.

After each left singular vector is obtained, all the left singular vectors are spliced, and then an m multiplied by m left singular matrix U can be formed.

In an exemplary embodiment, determining a right singular matrix corresponding to the image matrix according to the image matrix and a transpose of the image matrix specifically includes:

and splicing the right singular vectors to obtain a right singular matrix.

In this embodiment, when determining the right singular matrix, the transpose of the image matrix may be performed

Matrix multiplication is carried out with the image matrix A to obtain an n multiplied by n square matrix A ^T A, i.e. the second matrix, due to the second matrix A ^T A is a square matrixThe square matrix can be subjected to characteristic decomposition, and then the second matrix A is subjected to ^T And A, carrying out feature decomposition, wherein the obtained feature value and feature vector (namely right singular vector) meet the following expression:

（3）

wherein A is ^T A represents a second matrix of the matrix,

represents the i-th right singular vector,/>

And representing the eigenvalue corresponding to the ith right singular vector.

After each right singular vector is obtained, all right singular vectors are spliced, and then an n multiplied by n right singular matrix V can be formed.

After the left singular matrix U and the right singular matrix V are determined, since the singular values on the singular value matrix Σ are all 0 except the singular values on the diagonal, all the singular values can be solved first, and then the singular value matrix Σ can be determined.

Each singular value may be solved by the following formula:

（4）

Wherein, the liquid crystal display device comprises a liquid crystal display device,

represents the i th singular value,/->

The i-th eigenvalue is represented, and the eigenvalue may be the eigenvalue corresponding to the i-th left singular vector or the eigenvalue corresponding to the i-th right singular vector.

After all the singular values are determined, taking each singular value as an element on a diagonal line of the singular value matrix sigma, and taking 0 for the rest elements to obtain the singular value matrix sigma.

In some embodiments, determining the feature matrix corresponding to the first feature map according to the singular value decomposition result specifically includes:

and determining a feature matrix corresponding to the first feature map according to at least one of the left singular matrix, the right singular matrix and the singular value matrix in the singular value decomposition result.

In this embodiment, any one of the left singular matrix, the right singular matrix and the singular value matrix may be used as the feature matrix corresponding to the first feature map, or any several of the left singular matrix, the right singular matrix and the singular value matrix may be spliced to obtain the feature matrix corresponding to the first feature map.

It can be understood that the dimension ratio between the feature matrix obtained by the decomposition layer and the first feature map corresponding to the feature matrix in the embodiment is 1:2, so that the data dimension reduction is realized in a singular value decomposition mode, the model parameters can be conveniently and rapidly converged in the model training stage, and the model training efficiency can be improved.

According to the embodiment, the decomposer is arranged, so that the first feature map output by the convolution layer can be subjected to feature decomposition, further other semantic features except the convolution layer can be obtained by determining the feature matrix, and the extracted semantic features are richer.

In some embodiments, the decoder comprises in particular a plurality of deconvolution layers, connected in sequence;

each deconvolution layer is used for fusing the received image, the first feature image output by the corresponding deconvolution layer of the current level and the feature matrix output by the connected decomposition layer to obtain a second feature image containing fusion features of the current level, and transmitting the second feature image of the current level to the next deconvolution layer connected with the current deconvolution layer until the final deconvolution layer outputs a result image.

It can be understood that the deconvolution layer in this embodiment may be the second feature map of the previous level transferred by the previous deconvolution layer connected to the current deconvolution layer, or may be other feature maps except the first feature map of the output of the corresponding deconvolution layer of the current level, for example, may be the first feature map of the output of the last level of the deconvolution layer, that is, the feature map of the last downsampled output.

In the embodiment, the decoder is formed by sequentially connecting multiple deconvolution layers, and mainly realizes the fusion of semantic features in the first feature graphs of different levels output by the encoder, and on the basis, the feature matrix obtained by the decomposition layer can be fused, so that the semantic features contained in the image to be identified are more fully analyzed.

In this embodiment, the deconvolution layers are in one-to-one correspondence with the convolution layers, each deconvolution layer corresponds to one layer of the convolution layer of the level, the convolution layer can output the first feature map of the current level to the deconvolution layer corresponding to the current convolution layer through the splicing channel, meanwhile, at least part of deconvolution layers can be connected with at least one layer of decomposition layer, in this embodiment, after receiving the first feature map of the current level, the decomposition layer extracts the feature matrix of the first feature map and can transfer the extracted feature matrix to the deconvolution layer corresponding to the next level, for example, taking the decomposition layer arranged at the second level as an example, the decomposition layer can receive the first feature map output by the convolution layer of the second level, and after extracting the feature matrix, the feature matrix can be transferred to the deconvolution layer of the third level.

In an exemplary embodiment, in case the encoder includes four layers of convolution layers, the decoder includes a first deconvolution layer, a second deconvolution layer, a third deconvolution layer, and a fourth deconvolution layer, which are sequentially connected;

The first deconvolution layer is used for fusing a first feature map which is output by a fourth deconvolution layer in the encoder and contains semantic features of a fourth level with a feature matrix which is output by a connected decomposition layer, and the obtained second feature map which contains fusion features of the first level is transmitted to the second deconvolution layer;

the second deconvolution layer is used for fusing the second feature map output by the first deconvolution layer, the first feature map which is output by a third deconvolution layer in the encoder and contains the semantic features of a third level and the feature matrix output by the connected decomposition layer, and transmitting the obtained second feature map which contains the fusion features of the second level to the third deconvolution layer;

the third deconvolution layer is used for fusing the second feature map output by the second deconvolution layer, the first feature map which is output by the second deconvolution layer and contains the semantic features of the second level in the encoder and the feature matrix output by the connected decomposition layer, and transmitting the obtained second feature map which contains the fusion features of the third level to the fourth deconvolution layer;

the fourth deconvolution layer is used for fusing the second feature map output by the third deconvolution layer with the first feature map which is output by the first deconvolution layer in the encoder and contains the semantic features of the first level, and a result image is obtained.

In some embodiments, in order to more fully fuse and analyze semantic features of different levels, in this embodiment, the first deconvolution layer may be further connected to an input end of the fourth convolution layer through a splicing channel, so that a first feature map of a third level input by an input end of the fourth convolution layer may be fused during feature fusion analysis, so that a feature fusion effect is better, and further, recognition accuracy of a target recognition model may be further improved.

On the one hand, each deconvolution layer can perform an up-sampling operation on an input feature map, the up-sampling operation can reduce the channel number of the input feature map by one time, and the size (specifically, the length and the width of an image) of the feature map can be doubled, for example, the image size of the input feature map is 480×800, and the size of the image after performing the up-sampling operation is 960×1600;

on the other hand, the deconvolution layer can fuse the first feature map of the current level output by the encoder, the feature map output by the connected previous deconvolution layer and the feature matrix output by the connected decomposition layer, so as to extract more detailed semantic features.

In some embodiments, the deconvolution layer may specifically fuse the received image, the first feature map of the convolution layer output corresponding to the current level, and the feature matrix of the connected decomposition layer output by:

Determining an image matrix of the received image to obtain a third matrix;

and fusing the characteristic matrixes output by the third matrix, the fourth matrix and the connected decomposition layer.

In this embodiment, the process of feature fusion by the deconvolution layer may be understood as a process of feature fusion of matrix layers, which may be specifically implemented by a matrix merging manner, in an actual application process, a matrix merging operation may be performed according to data actually input into the deconvolution layer, for example, in a case that input data includes three parts, i.e., a feature map output by a previous deconvolution layer, a first feature map output by a convolution layer corresponding to a current level, and a feature matrix output by a connected decomposition layer, an image matrix corresponding to the input feature map may be first obtained, in this embodiment, the image matrix of the feature map output by the previous deconvolution layer is referred to as a third matrix, the image matrix of the first feature map output by the convolution layer corresponding to the current level is referred to as a fourth matrix, and the feature matrices output by the connected decomposition layer are merged, that is, in other words, the third matrix, the fourth matrix and the feature matrix may be merged by a cat operation in the PyTorch in the actual application.

In this embodiment, the deconvolution layer in the decoder 202 is utilized to fuse the received image, the first feature map output by the encoder 201 and the feature matrix output by the decomposition layer, so as to obtain richer semantic features, and through the cooperation of the encoder 201, the decoder 202 and the decomposer 203, the target recognition model can output a more accurate result image.

In some embodiments, the object recognition model may be specifically obtained by training the following process:

generating sample data based on the acquired image samples and the tag information;

dividing the sample data into training data and test data, and storing the training data and the test data into a sample database;

and retrieving training data and test data from a sample database, training a pre-constructed target recognition network through the training data, and testing the target recognition network after training through the test data to obtain a target recognition model.

It will be appreciated that the image sample refers to an image containing the object to be identified, for example, when the object is a guideboard, the image sample may be an image containing the guideboard, and the tag information refers to the tag information of the object, specifically, the tag information of the edge of the object, for example, the tag information of the guideboard.

The embodiment can train to obtain the target recognition model in a supervised training mode, and the target recognition model can be used in a recognition scene of the target object after passing the test.

It should be noted that, in the process of training the target recognition model, after training data and test data are obtained, the training data and test data are stored first, specifically may be stored in a sample database, for example, in a MDB (Microsoft Data Base) database, then, when the target recognition model is trained, the training data and test data may be directly called from the sample database, specifically, the data read from the sample database may be parsed into a matrix and input into an edge detection network for training, for example, the data in MDB format may be parsed into a 512×512×3 matrix and input into the target recognition network for training, so as to obtain a trained target recognition model.

Compared with the mode of on-site reading and constructing training data and test data, the model training method provided by the embodiment can train more efficiently to obtain the target recognition model through the mode of firstly storing and then calling, and improves the efficiency of the training process of the target recognition model.

In some embodiments, before separating the sample data into the training data and the test data, it may further include:

correcting the abnormal data after determining that the abnormal data exists in the sample data;

the abnormal data are data with wrong label information corresponding to the image sample.

In this embodiment, before the training data and the test data are divided, the sample data may be preprocessed, specifically, abnormal data in the sample data may be extracted, and the abnormal data may be corrected, for example, sample data, which is not corresponding to the image sample due to incorrect labeling of the tag information, may be extracted, where the sample data, which is not corresponding to the image sample, may affect the training accuracy of the target recognition model if the sample data is directly used for training the target recognition model because the sample data is not standard enough.

According to the embodiment, through the preprocessing operation, the nonstandard label information can be modified, so that sample data are more accurate and standard, and further accurate and reliable data basis can be provided for training of the target recognition model.

Preferably, the method for identifying an object provided in the embodiment of the present application may further include, before inputting an image to be identified including the object into the object identification model:

And adjusting the image size of the image to be identified to a preset size.

In order to obtain a more accurate recognition result, the size of the image to be recognized can be adjusted firstly, and then the image to be recognized with the adjusted size is input into the target recognition model, so that the recognition precision of the target object can be improved.

In order to verify the recognition accuracy improving effect of the target object recognition method provided by the embodiment, the same image to be recognized is recognized by using a traditional recognition model and a target recognition model, and the traditional recognition model in the embodiment can be realized based on a canny operator.

In contrast verification, the image to be identified adopts a road image containing a guideboard, fig. 3 exemplarily shows the image to be identified, the image to be identified is acquired by a camera installed in front of a vehicle, the image to be identified contains the vehicle on the driving road, the guideboards on two sides of the road and the tree, the identification target in this embodiment is to identify the position of the key guideboard from the road image to be identified, specifically, the image to be identified shown in fig. 3 contains 4 guideboards, specifically, see the position outlined by the rectangular frame in fig. 3, the guideboard identification tag corresponding to the image to be identified can refer to fig. 4, the obtained guideboard identification result after the object identification is carried out on the image to be identified by the conventional identification model can refer to fig. 5 and 6, and the obtained guideboard identification result after the object identification is carried out on the image to be identified by the improved object identification model provided in this embodiment can refer to fig. 7 and fig. 8.

Comparing the guideboard recognition results shown in fig. 5 and 6 and the guideboard recognition results shown in fig. 7 and 8 with the guideboard recognition tags shown in fig. 4 respectively can find that the guideboard recognition results shown in fig. 5 and 6 have poor recognition effect on the farthest license plate, and the guideboard recognition results shown in fig. 7 and 8 have better recognition effect on the farthest license plate.

Therefore, for the distant guideboard, the traditional recognition model is difficult to accurately recognize the guideboard with smaller size (such as less than ten pixels) in the road image due to the smaller size of the guideboard in the road image, and the object recognition model provided by the embodiment can more fully extract the semantic features in the road image due to the improved structure, and still can accurately recognize the guideboard with smaller size in the road image.

Corresponding to the embodiment of the application function implementation method, the application further provides a target object identification device, electronic equipment and corresponding embodiments.

Fig. 9 is a schematic structural diagram of an object recognition device according to an embodiment of the present application.

Referring to fig. 9, an apparatus for identifying an object provided in an embodiment of the present application specifically includes:

the recognition module 301 is configured to input an image to be recognized including a target object into a target recognition model, and obtain a result image output by the target recognition model;

the object recognition model comprises an encoder, a decomposer and a decoder;

the decomposer comprises at least one decomposition layer, the decoder comprises a plurality of deconvolution layers, each decomposition layer is respectively connected with one deconvolution layer and one deconvolution layer, each decomposition layer is used for carrying out characteristic decomposition on the received first characteristic graph to obtain a characteristic matrix, and the characteristic matrix is output to the corresponding deconvolution layer;

In some embodiments, in the identification module 301, each decomposition layer may specifically perform feature decomposition on the received first feature map through the following process:

determining an image matrix corresponding to the received first feature map;

In some embodiments, in the identification module 301, each decomposition layer may specifically perform singular value decomposition on the image matrix through the following process to obtain a singular value decomposition result:

In some embodiments, in the identifying module 301, each decomposition layer may specifically determine the feature matrix corresponding to the first feature map according to the singular value decomposition result through the following process:

In some embodiments, in the identification module 301, each decomposition layer may specifically determine the left singular matrix corresponding to the image matrix according to the image matrix and the transpose of the image matrix through the following process:

In some embodiments, in the identification module 301, each decomposition layer may specifically determine the right singular matrix corresponding to the image matrix according to the image matrix and the transpose of the image matrix through the following process:

and splicing the right singular vectors to obtain a right singular matrix.

In some embodiments, in the identification module 301, the deconvolution layer specifically fuses the received image, the first feature map of the convolution layer output corresponding to the current level, and the feature matrix of the connected decomposition layer output by:

Determining an image matrix of the received image to obtain a third matrix;

In summary, according to the object recognition device provided by the embodiment of the application, since the object recognition model can realize comprehensive analysis of multi-level features in the image to be recognized, more semantic features can be obtained through feature decomposition and fusion, the semantic features in the image to be recognized can be fully obtained, further, objects with smaller sizes in the image to be recognized can be more accurately recognized, and the recognition accuracy of the objects is improved.

Referring to fig. 10, an electronic device 400 includes a memory 401 and a processor 402.

The processor 402 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 401 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 402 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime.

Furthermore, memory 401 may include any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 401 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROM, dual layer DVD-ROM), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 401 has stored thereon executable code which, when processed by the processor 402, may cause the processor 402 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of identifying an object, comprising:

the object recognition model comprises an encoder, a decomposer and a decoder;

2. The method for identifying an object according to claim 1, wherein each decomposition layer performs feature decomposition on the received first feature map by specifically:

determining an image matrix corresponding to the received first feature map;

3. The method for identifying an object according to claim 2, wherein the performing singular value decomposition on the image matrix to obtain a singular value decomposition result includes:

4. The method for identifying a target object according to claim 3, wherein determining the feature matrix corresponding to the first feature map according to the singular value decomposition result includes:

5. The method of claim 3, wherein determining the left singular matrix corresponding to the image matrix according to the image matrix and the transpose of the image matrix comprises:

6. The method of claim 3, wherein determining the right singular matrix corresponding to the image matrix according to the image matrix and the transpose of the image matrix comprises:

and splicing the right singular vectors to obtain the right singular matrix.

7. The method for identifying an object according to claim 1, wherein the deconvolution layer fuses the received image, the first feature map output by the convolution layer corresponding to the current level, and the feature matrix output by the connected decomposition layer by:

determining an image matrix of the received image to obtain a third matrix;

8. An apparatus for identifying an object, comprising:

the object recognition model comprises an encoder, a decomposer and a decoder;

9. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable code which when executed by a processor of an electronic device causes the processor to perform the method of any of claims 1-7.