CN112132215B

CN112132215B - Method, device and computer readable storage medium for identifying object type

Info

Publication number: CN112132215B
Application number: CN202011004689.7A
Authority: CN
Inventors: 吴晓东
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2024-04-16
Anticipated expiration: 2040-09-22
Also published as: CN112132215A

Abstract

The application is applicable to the technical field of computer application, and provides a method and a device for identifying an object type, wherein the method comprises the following steps: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; pooling processing based on interest is carried out on the second matrix to obtain a third matrix; performing full connection and activation treatment on the third matrix to obtain a feature map corresponding to the target object; based on the feature map, a type of the target object is identified. In the embodiment, the characteristics of the relatively comprehensive target object can be obtained through the steps, so that the comprehensiveness and the accuracy of characteristic identification are improved.

Description

Method, device and computer readable storage medium for identifying object type

Technical Field

The present application belongs to the technical field of computer applications, and in particular, relates to a method and apparatus for identifying an object type, and a computer readable storage medium.

Background

The vehicle model is an attribute of the vehicle, is also important information essential for vehicle identity authentication, and is one of important links in an urban intelligent traffic management system. In the related art, the type or style of the vehicle is generally detected through an image recognition mode, but in practical application, because the running environment of the vehicle is easy to change, for example, in the environment of haze, rainy days, night and the like, certain interference is caused to the accuracy of vehicle style recognition, and the problems of inaccurate vehicle style recognition and lower efficiency are easily caused.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying object types, which can solve the problems of inaccurate vehicle style identification and lower efficiency.

In a first aspect, an embodiment of the present application provides a method for identifying an object type, including: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; pooling processing based on interest is carried out on the second matrix to obtain a third matrix; performing full connection and activation treatment on the third matrix to obtain a feature map corresponding to the target object; based on the feature map, a type of the target object is identified.

In a possible implementation manner of the first aspect, before the inputting the image to be identified into the cascade network composed of at least two backbone networks to obtain the first matrix corresponding to the target object in the image, the method further includes: acquiring the size of the image to be identified; and adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

In a possible implementation manner of the first aspect, the inputting the image to be identified into a cascade network composed of at least two backbone networks, to obtain a first matrix corresponding to the target object in the image, includes: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the first aspect, the inputting the first matrix into the candidate area network, by means of feature clustering, obtains a second matrix corresponding to the candidate frame feature map in the first matrix, including: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers.

In a possible implementation manner of the first aspect, the performing a pooling process on the second matrix based on interest to obtain a third matrix includes: partitioning the second matrixes to obtain blocks corresponding to the second matrixes; and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

In a possible implementation manner of the first aspect, the method further includes: identifying the type of the vehicle in the captured vehicle image based on the captured vehicle image to obtain a type identifier of the vehicle; and simulating a traveling scene of the vehicle based on the type identification of the vehicle.

In a second aspect, embodiments of the present application provide an apparatus for identifying an object type, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; pooling processing based on interest is carried out on the second matrix to obtain a third matrix; performing full connection and activation treatment on the third matrix to obtain a feature map corresponding to the target object; based on the feature map, a type of the target object is identified.

In one possible implementation manner of the second aspect, before the inputting the image to be identified into the cascade network composed of at least two backbone networks to obtain the first matrix corresponding to the target object in the image, the method further includes: acquiring the size of the image to be identified; and adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

In a possible implementation manner of the second aspect, the inputting the image to be identified into a cascade network composed of at least two backbone networks, to obtain a first matrix corresponding to the target object in the image includes: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the second aspect, the inputting the first matrix into the candidate area network, by means of feature clustering, obtains a second matrix corresponding to the candidate frame feature map in the first matrix, including: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers.

In a possible implementation manner of the second aspect, the performing a pooling process on the second matrix based on interest to obtain a third matrix includes: partitioning the second matrixes to obtain blocks corresponding to the second matrixes; and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

In a possible implementation manner of the second aspect, the method further includes: identifying the type of the vehicle in the captured vehicle image based on the captured vehicle image to obtain a type identifier of the vehicle; and simulating a traveling scene of the vehicle based on the type identification of the vehicle.

In a third aspect, an embodiment of the present application provides an apparatus for identifying a type of an object, including: the first matrix unit is used for inputting the image to be identified into a cascade network formed by at least two backbone networks to obtain a first matrix corresponding to the target object in the image; the second matrix unit is used for inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to the candidate frame feature map in the first matrix in a feature clustering mode; the third matrix unit is used for carrying out pooling processing based on interests on the second matrix to obtain a third matrix; the feature map unit is used for carrying out full connection and activation processing on the third matrix to obtain a feature map corresponding to the target object; and the identification unit is used for identifying the type of the target object based on the characteristic diagram.

In a possible implementation manner of the third aspect, the apparatus further includes: a size acquisition unit for acquiring the size of the image to be identified; and the size adjusting unit is used for adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

In a possible implementation manner of the third aspect, the first matrix unit is configured to: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the third aspect, the second matrix unit is configured to: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers.

In a possible implementation manner of the third aspect, the third matrix unit is configured to: partitioning the second matrixes to obtain blocks corresponding to the second matrixes; and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

In a possible implementation manner of the third aspect, the apparatus further includes: the vehicle type identification unit is used for identifying the type of the vehicle in the captured vehicle image based on the captured vehicle image to obtain a vehicle type identifier; and the scene simulation unit is used for simulating the traveling scene of the vehicle based on the vehicle type identification.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a terminal device, causes the terminal device to perform the method of identifying an object type according to any one of the first aspects above.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Compared with the prior art, the embodiment of the application has the beneficial effects that: in the embodiment of the application, the image to be identified is processed to obtain the image to be identified with the same size, a first matrix corresponding to the target object is generated through a cascade network formed by a plurality of backbone networks, the candidate frame feature images in the first matrix are subjected to feature clustering to obtain a second matrix, the second matrix is subjected to pooling processing based on interests to obtain a third matrix, and finally the third matrix is subjected to full connection and activation processing to obtain a feature image corresponding to the target object, so that the type of the target object is identified based on the feature image. In the embodiment, the characteristics of the relatively comprehensive target object can be obtained through the steps, so that the comprehensiveness and the accuracy of characteristic identification are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of identifying an object type according to an embodiment of the present application;

fig. 2 is a schematic diagram of image feature extraction of a multi-backbone cascade network according to an embodiment of the present application;

fig. 3 is a schematic diagram of feature clustering of a matrix based on a candidate area network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of pooling a matrix according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a vehicle model identification algorithm based on a cascade multi-backbone according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an apparatus for identifying object types according to a fourth embodiment of the present application;

fig. 7 is a schematic diagram of an apparatus for identifying an object type according to a fourth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Referring to fig. 1, fig. 1 is a flowchart of a method for identifying an object type according to an embodiment of the present application. The execution subject of the method for identifying an object type in this embodiment is a device having a function of identifying an object type, including but not limited to a computer, a server, a tablet computer, a terminal, or the like. The method of identifying an object type as shown may include the steps of:

S110: inputting the image to be identified into a cascade network formed by at least two backbone networks to obtain a first matrix corresponding to a target object in the image.

In an embodiment of the present application, an image of a target object is acquired as a band identification image before the type of the target object is identified. After the image to be identified is obtained, the image to be identified can be preprocessed to obtain the image to be identified. The preprocessing process may include, among other things, adjustment of image size, adjustment of image gray scale, and so forth.

Further, in an embodiment of the present application, the following steps may be further included before the step: acquiring the size of the image to be identified; and adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

Specifically, because the sizes of the images to be identified are different, or the focal lengths and the like in the shooting process of the cameras are inconsistent, the inconsistent image sizes are easy to cause, and based on the conditions, the sizes of the images to be identified can be adjusted by setting the sizes in the embodiment, so that the images to be identified are obtained.

For example, in this embodiment, the original input image to be identified may be first transformed into a size under the condition of ensuring the aspect ratio, to obtain the image to be identified with the preset size. For example, the preset size in this embodiment may be 1000×600, or may be configured to be other sizes, etc. The image to be identified with the corresponding size can be obtained by means of widening, compressing or cutting.

In an embodiment of the present application, after an image to be identified is obtained, the image to be identified is input into a cascade network formed by at least two backbone networks through a cascade network formed by at least two preset backbone networks, so as to process a target object in the image to be identified, and a first matrix including characteristics of the target object is obtained.

Further, the step S110 specifically includes the following steps: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix.

In an embodiment of the present application, vehicle features in the image are then extracted through a multi-skeleton (backbone) cascade network (CascadeNet) and a feature map matrix FMap is obtained.

Fig. 2 is a schematic diagram of image feature extraction of a multi-backbone cascade network according to an embodiment of the present application.

As shown in fig. 2, where input 210 represents an output matrix corresponding to an image to be identified after the original input image has undergone a size change, skeleton 1 (backbox_1) (220), skeleton 2 (backbox_2), and skeleton n (backbox_n) represent n different feature extraction networks, respectively.

Illustratively, backbone_1 may represent DarkNet53, backbone_2 may represent ResNet50, backbone_n may represent EfficientNet, and so forth. Block_1, block_2, block_n represent partitioning operations, resulting in blocks 1-n (230), max pooling operations (240) may include MaxPooling_1, maxPooling_2, maxPooling_n, etc., and finally output results are obtained through summing operations (250).

Specifically, in the DarkNet53 network structure, the structure can be formed by convolving 1*1 with 3*3, wherein 53 convolution layers are included; in the Resnet50 network structure, firstly, convolution operation is carried out on input, then 4 residual blocks are included, and finally, full connection operation is carried out so as to be convenient for classification tasks; efficientNet is a composite of at least two networks, where the network width and depth and the image pixel size are composited together and then a better network is achieved.

For example, assuming that n is 3, the input matrix size is 1000×600×512, and the output matrix size after passing through the backbond_1, the backbond_2, and the backbond_3 is 300×400×128, 500×600×256, and 700×800×512, respectively, and then performing a block processing, where the block size is configurable, for example, divided into 60×40 blocks; then carrying out maximum pooling, namely taking the maximum pixel value from each block, and operating to obtain output matrix sizes of 60 x 40 x 128, 60 x 40 x 256 and 60 x 40 x 512 respectively; and finally, summing the three obtained output matrixes to obtain a final output matrix with the size of 60×40 (128+256+512) =60×40×896.

S120: and inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to the candidate frame feature map in the first matrix in a feature clustering mode.

In an embodiment of the present application, after obtaining a first matrix including a target image, the first matrix is input into a candidate area network, and a second matrix corresponding to each candidate frame feature map in the first matrix is obtained by feature clustering on the first matrix corresponding to the target image.

Specifically, the process in step S120 includes the following steps in detail: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers.

In an embodiment of the present application, the candidate regional network RPN network is executed on the feature map FMap and all candidate boxes CBox are generated using anchor boxes obtained in advance using kmeans++ clustering. Specifically, in this embodiment, a distance function used when the kmeans++ algorithm clusters anchor boxes is 1-IOU, where iou=i/U, I represents an intersection area of two anchor boxes, and U represents a union area of two anchor boxes.

In one embodiment of the present application, kmeans++ is an improvement on the original kmeans clustering algorithm, mainly by improving the random selection mode of k initial cluster centers in kmeans to wheel selection, for example, k is 9.

Fig. 3 is a schematic diagram of feature clustering of a matrix based on a candidate area network according to an embodiment of the present application.

The specific implementation flow is shown in fig. 3, where the input 310 is a feature map matrix FMap, and the input feature map is cut 360 by using coordinates of candidate boxes obtained after NMS screening, through convolution+non-monotonic neural activation function (CM) (320), classification, regression, convolution+mapping activation function (CS) (330), convolution (340), non-maximum suppression (NMS) (350).

In one embodiment of the present application, assuming that the input matrix size is 60×40×512, the matrix size obtained by performing CM is still 60×40×512, and then the classification matrix and the coordinate matrix of all candidate frames are obtained after CS and Conv operations in cls and reg branches, the sizes are 60×40 (9*2) =60×40×18 (9 indicates the number of anchor boxes, 2 indicates the two classifications of foreground and background), 60×40 (9*4) =60×40×36 (9 indicates the number of anchor boxes, 4 indicates the coordinates of each anchor box (i.e. the center point coordinate x of the anchor box, y and width and height w, h) of the anchor box), then NMS is performed on all candidate boxes (all candidate boxes are sorted in descending order of classification probability, the candidate box with the highest probability is reserved and other candidate boxes overlapped with the candidate box (i.e. IOU > 0.5) are deleted, redundant boxes are deleted and the first n (configurable, for example, 300) candidate boxes are reserved, and finally Cut operation is performed on the filtered n candidate boxes on the input feature map FMap to obtain a final Output matrix Output (i.e. a candidate box feature map matrix).

In the cutting process, assuming that the size of the Input matrix Input is 60×40×512, and the coordinates of one candidate frame of 300 candidate frames screened by the NMS are (9,15,20,30), the Cut operation indicates that the candidate frame with the width and height (20, 30) is Cut at the (9, 15) position on the Input matrix, i.e. the feature map FMap (i.e. the feature map matrix of the candidate frame, the size of which is 20×30×512). Similarly, 300 candidate frame feature map matrices of different sizes may be obtained.

S130: and carrying out interest-based pooling treatment on the second matrix to obtain a third matrix.

In an embodiment of the present application, after obtaining the second matrix, the second matrix is subjected to pooling processing based on interest, so as to obtain a third matrix. The specific treatment process comprises the following steps: partitioning the second matrixes to obtain blocks corresponding to the second matrixes; and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

Fig. 4 is a schematic diagram of pooling a matrix according to an embodiment of the present application.

As shown in fig. 4, in an embodiment of the present application, the input 410 is all candidate boxes CBox screened in the previous step. Next, a roiplating operation is performed to uniformly block all the candidate blocks CBox of different sizes obtained in the previous step into a preset size (420), for example, 7*7. For example, when 300 candidate frames are unified in size, that is, 300 candidate frame feature map matrices with m×n×512 are unified into 1 candidate frame feature map matrix with m×7×7×512. Assuming that Input (i.e., CBox) has 300 candidate blocks, for example, one candidate block has a size of 20×30×512, a block (420) operation is first performed to divide the matrix 20×30×512 into 7*7 fixed-size blocks and perform rounding, where each block has a size of: (20/7) × (30/7) =2.86×4.29=2×4; a max pooling (430) operation is then performed on each block, i.e. only the pixel with the largest value of 2 x 4 = 8 pixels is retained. And finally obtaining a final output matrix with the size of 7 x 512, wherein the final output matrix is obtained by a candidate frame through RoIPooling operation. Since there are 300 total, the final output matrix (440) size is 300×7×7×512.

S140: and performing full connection and activation processing on the third matrix to obtain a feature map corresponding to the target object.

In an embodiment of the present application, regression (reg) of the vehicle detection frame and classification (cls) of the vehicle model are then performed respectively through two layers of FCR (full connection+relu activation function) and two different branches FC (full connection), FCs (full connection+softmax activation function), so as to obtain a feature map corresponding to the vehicle.

S150: based on the feature map, a type of the target object is identified.

In an embodiment of the present application, finally, the coordinates of the vehicle predicted on the feature map are mapped to coordinates on the original input image, so as to implement recognition of the vehicle style of the image.

In an embodiment of the application, an original feature extraction network is improved from a single-backup network to a cascade network of multiple backups, so that feature expression capability in difficult scenes such as haze, rainy days, night, vehicle sides and the like is obviously enhanced, and overall accuracy and recall rate of vehicle pattern recognition are improved; the acquisition mode of the anchor box is improved into kmeans++ automatic clustering by the original manual setting, so that the generation quality of the anchor box is greatly improved, the positioning accuracy of a detection frame is improved, and the overall accuracy and recall rate of vehicle type identification are further improved.

Fig. 5 is a schematic diagram of a vehicle model identification algorithm based on cascade multi-backbone according to an embodiment of the present application.

In an embodiment of the present application, as shown in fig. 5, by performing size transformation 520 on an image 510 to be identified, and improving an original feature extraction network from a single backbone network to a multi-backbone cascade network 530, feature expression capability in difficult scenes such as haze, rainy days, night, vehicle sides and the like is significantly enhanced, so that overall accuracy and recall of vehicle pattern identification are improved; after the feature map matrix 540 is obtained through the cascade network 530, the obtaining mode of the anchor box is improved from original manual setting to kmeans++ automatic clustering, the feature map matrix FMap540 is input into the candidate regional network RPN (550) to obtain the candidate frame CBox (560), and finally the pooling processing 570 and full-connection correction activation FCR (580), namely full-connection activation FCS (590) and full-connection FC (511), are carried out, so that the generation quality of the anchor box is greatly improved, the positioning precision of a detection frame is improved, and the overall accuracy and recall rate of vehicle type recognition are further improved.

In the embodiment of the application, the image to be identified is processed to obtain the image to be identified with the same size, a first matrix corresponding to the target object is generated through a cascade network formed by a plurality of backbone networks, the candidate frame feature images in the first matrix are subjected to feature clustering to obtain a second matrix, the second matrix is subjected to pooling processing based on interests to obtain a third matrix, and finally the third matrix is subjected to full connection and activation processing to obtain a feature image corresponding to the target object, so that the type of the target object is identified based on the feature image. In the embodiment, the characteristics of the relatively comprehensive target object can be obtained through the steps, so that the comprehensiveness and the accuracy of characteristic identification are improved.

Referring to fig. 6, fig. 6 is a schematic diagram of an apparatus for identifying an object type according to an embodiment of the present application. The device 600 for identifying the object type may be a mobile terminal such as a smart phone, a tablet computer, etc. The apparatus 600 for identifying an object type according to the present embodiment includes units for performing the steps in the embodiment corresponding to fig. 1, and refer to fig. 1 and the related description in the embodiment corresponding to fig. 1, which are not repeated herein. The apparatus 600 for identifying an object type of the present embodiment includes: the first matrix unit 601 is configured to input an image to be identified into a cascade network formed by at least two backbone networks, so as to obtain a first matrix corresponding to a target object in the image; a second matrix unit 602, configured to input the first matrix into a candidate area network, and obtain a second matrix corresponding to a candidate frame feature map in the first matrix by means of feature clustering; a third matrix unit 603, configured to perform interest-based pooling processing on the second matrix, to obtain a third matrix; the feature map unit 604 is configured to perform full connection and activation processing on the third matrix, so as to obtain a feature map corresponding to the target object; an identifying unit 605 is configured to identify the type of the target object based on the feature map.

In an embodiment of the present application, the apparatus 600 further includes: a size acquisition unit for acquiring the size of the image to be identified; and the size adjusting unit is used for adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

In an embodiment of the present application, the first matrix unit 601 is configured to: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix.

In an embodiment of the present application, the second matrix unit 602 is configured to: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers.

In an embodiment of the present application, the third matrix unit 603 is configured to: partitioning the second matrixes to obtain blocks corresponding to the second matrixes; and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

In an embodiment of the present application, the apparatus further includes: the vehicle type identification unit is used for identifying the type of the vehicle in the captured vehicle image based on the captured vehicle image to obtain a vehicle type identifier; and the scene simulation unit is used for simulating the traveling scene of the vehicle based on the vehicle type identification.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Fig. 7 is a schematic diagram of an apparatus for identifying an object type according to an embodiment of the present application. As shown in fig. 7, the apparatus 7 for identifying an object type of this embodiment includes: a processor 70, a memory 71, and a computer program 72 stored in the memory 71 and executable on the processor 70. The processor 70, when executing the computer program 72, implements the steps of the various method embodiments described above for identifying object types, such as the steps shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, performs the functions of the modules/units of the apparatus embodiments described above.

By way of example, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 72 in the object type identifying device 7.

The device 7 for identifying the object type may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the means 7 for identifying an object type and does not constitute a limitation of the means 7 for identifying an object type, and may comprise more or less components than shown, or may be combined with certain components, or different components, e.g. the terminal device may further comprise an input-output device, a network access device, a bus, etc.

The processor 70 may be a central processing unit (Central Processing Unit, CPU), or may be another general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the object type identifying device 7, such as a hard disk or a memory of the object type identifying device 7. The memory 71 may be an external storage device of the object type recognition device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (FC), or the like, which are provided on the object type recognition device 7. Further, the memory 71 may also comprise both an internal memory unit and an external memory device of the object type identifying means 7. The memory 71 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 71 may also be used for temporarily storing data that has been output or is to be output.

Embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the above-described method of identifying an object type.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. With such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of identifying a type of object, comprising:

inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image;

Inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; generating a candidate frame by utilizing an anchor box obtained by clustering in advance by using kmeans++; the distance function used when obtaining the anchor boxes by using kmeans++ clustering is 1-IOU, wherein IOU=I/U, I represents the intersection area of two anchor boxes, and U represents the union area of the two anchor boxes;

pooling processing based on interest is carried out on the second matrix to obtain a third matrix;

performing full connection and activation treatment on the third matrix to obtain a feature map corresponding to the target object;

identifying a type of the target object based on the feature map; wherein,

inputting the image to be identified into a cascade network formed by at least two backbone networks to obtain a first matrix corresponding to a target object in the image, wherein the first matrix comprises:

respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks;

partitioning the output matrix to obtain a preset number of matrix blocks;

extracting the maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix;

Summing the pooling matrixes corresponding to the backbone networks to obtain the first matrix;

inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode, wherein the method comprises the following steps:

inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic nerve activation processing on the first convolution result to obtain a second convolution result;

performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result;

performing non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result;

cutting the suppression result to obtain matrixes corresponding to the candidate frame feature graphs with preset numbers;

and performing full connection and activation processing on the third matrix to obtain a feature map corresponding to the target object, wherein the feature map comprises:

performing full-connection correction activation on the third matrix, and then performing full-connection activation on two different branches, namely full-connection activation and full-connection activation, so as to obtain a feature map corresponding to a target object; wherein the full connection revision activation is a full connection and Relu activation process, and the full connection activation is a full connection and Softmax activation process.

2. The method for identifying object types according to claim 1, wherein before inputting the image to be identified into a cascade network consisting of at least two backbone networks to obtain the first matrix corresponding to the target object in the image, the method further comprises:

acquiring the size of the image to be identified;

and adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

3. The method of identifying object types of claim 1, wherein pooling the second matrix based on interest to obtain a third matrix comprises:

partitioning the second matrixes to obtain blocks corresponding to the second matrixes;

and carrying out maximum pooling treatment on the block, and selecting the maximum value in the set area of the block to obtain the third matrix.

4. The method of identifying an object type of claim 1, further comprising:

identifying the type of the vehicle in the captured vehicle image based on the captured vehicle image to obtain a type identifier of the vehicle;

and simulating a traveling scene of the vehicle based on the type identification of the vehicle.

5. An apparatus for identifying a type of object, comprising:

The first matrix unit is used for inputting the image to be identified into a cascade network formed by at least two backbone networks to obtain a first matrix corresponding to the target object in the image;

the second matrix unit is used for inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to the candidate frame feature map in the first matrix in a feature clustering mode; generating a candidate frame by utilizing an anchor box obtained by clustering in advance by using kmeans++; the distance function used when obtaining the anchor boxes by using kmeans++ clustering is 1-IOU, wherein IOU=I/U, I represents the intersection area of two anchor boxes, and U represents the union area of the two anchor boxes;

the third matrix unit is used for carrying out pooling processing based on interests on the second matrix to obtain a third matrix;

the feature map unit is used for carrying out full connection and activation processing on the third matrix to obtain a feature map corresponding to the target object;

an identifying unit, configured to identify a type of the target object based on the feature map; wherein,

partitioning the output matrix to obtain a preset number of matrix blocks;

6. The apparatus for identifying an object type as in claim 5, further comprising:

a size acquisition unit for acquiring the size of the image to be identified;

and the size adjusting unit is used for adjusting the size of the image to be identified based on the size and the set size to obtain the adjusted image to be identified.

7. An apparatus for identifying an object type comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor implements the steps of the method for identifying an object type according to any one of claims 1 to 4 when the computer program is executed.

8. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of identifying an object type according to any one of claims 1 to 4.