CN112132215A

CN112132215A - Method and device for identifying object type and computer readable storage medium

Info

Publication number: CN112132215A
Application number: CN202011004689.7A
Authority: CN
Inventors: 吴晓东
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-25
Anticipated expiration: 2040-09-22
Also published as: CN112132215B

Abstract

The application is applicable to the technical field of computer application, and provides a method and a device for identifying object types, wherein the method comprises the following steps: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; performing interest-based pooling on the second matrix to obtain a third matrix; performing full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object; and identifying the type of the target object based on the feature map. In the embodiment, the characteristics of the target object can be obtained more comprehensively through the steps, and the comprehensiveness and the accuracy of the characteristic identification are improved.

Description

Method and device for identifying object type and computer readable storage medium

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a method and an apparatus for identifying an object type, and a computer-readable storage medium.

Background

The vehicle style is an attribute of a vehicle, is also important indispensable information in vehicle identity authentication, and becomes one of important links in an urban intelligent traffic management system for automatic recognition of the vehicle style. In the related art, the type or style of a vehicle is generally detected in an image recognition mode, but in practical application, because the driving environment of the vehicle is easy to change, for example, under the environments of haze, rainy days, night and the like, certain interference is caused to the accuracy of vehicle money recognition, and the problems of inaccurate vehicle money recognition and low efficiency are easily caused.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying object types, and the problems of inaccurate vehicle money identification and low efficiency can be solved.

In a first aspect, an embodiment of the present application provides a method for identifying a type of an object, including: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; performing interest-based pooling on the second matrix to obtain a third matrix; performing full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object; and identifying the type of the target object based on the feature map.

In a possible implementation manner of the first aspect, before the inputting the image to be recognized into a cascade network composed of at least two backbone networks and obtaining a first matrix corresponding to a target object in the image, the method further includes: acquiring the size of the image to be identified; and adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

In a possible implementation manner of the first aspect, the inputting the image to be recognized into a cascade network composed of at least two backbone networks to obtain a first matrix corresponding to the target object in the image includes: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the first aspect, the inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate box feature map in the first matrix in a feature clustering manner includes: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

In a possible implementation manner of the first aspect, the performing interest-based pooling processing on the second matrix to obtain a third matrix includes: carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes; and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

In a possible implementation manner of the first aspect, the method further includes: identifying the type of the vehicle money in the vehicle image based on the shot vehicle image to obtain a vehicle money type identifier; and simulating a traveling scene of the vehicle based on the vehicle money type identification.

In a second aspect, an embodiment of the present application provides an apparatus for identifying a type of an object, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the computer program: inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; performing interest-based pooling on the second matrix to obtain a third matrix; performing full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object; and identifying the type of the target object based on the feature map.

In a possible implementation manner of the second aspect, before the inputting the image to be recognized into a cascade network composed of at least two backbone networks and obtaining the first matrix corresponding to the target object in the image, the method further includes: acquiring the size of the image to be identified; and adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

In a possible implementation manner of the second aspect, the inputting the image to be recognized into a cascade network composed of at least two backbone networks to obtain a first matrix corresponding to the target object in the image includes: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the second aspect, the inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate box feature map in the first matrix in a feature clustering manner includes: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

In a possible implementation manner of the second aspect, the performing interest-based pooling processing on the second matrix to obtain a third matrix includes: carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes; and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

In one possible implementation manner of the second aspect, the method further includes: identifying the type of the vehicle money in the vehicle image based on the shot vehicle image to obtain a vehicle money type identifier; and simulating a traveling scene of the vehicle based on the vehicle money type identification.

In a third aspect, an embodiment of the present application provides an apparatus for identifying a type of an object, including: the first matrix unit is used for inputting the image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image; the second matrix unit is used for inputting the first matrix into a candidate area network and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode; the third matrix unit is used for performing interest-based pooling on the second matrix to obtain a third matrix; the characteristic diagram unit is used for carrying out full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object; and the identification unit is used for identifying the type of the target object based on the characteristic diagram.

In a possible implementation manner of the third aspect, the apparatus further includes: the size acquisition unit is used for acquiring the size of the image to be identified; and the size adjusting unit is used for adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

In a possible implementation manner of the third aspect, the first matrix unit is configured to: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

In a possible implementation manner of the third aspect, the second matrix unit is configured to: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

In a possible implementation manner of the third aspect, the third matrix unit is configured to: carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes; and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

In a possible implementation manner of the third aspect, the apparatus further includes: the vehicle money identification unit is used for identifying the vehicle money type based on the shot vehicle image to obtain a vehicle money type identifier; and the scene simulation unit is used for simulating the advancing scene of the vehicle based on the vehicle money type identification.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method for identifying an object type according to any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, the images to be recognized are processed to obtain the images to be recognized with the same size, a first matrix corresponding to the target object is generated through a cascade network formed by a plurality of backbone networks, feature clustering is performed on the candidate frame feature maps in the first matrix to obtain a second matrix, interest-based pooling processing is performed on the second matrix to obtain a third matrix, and finally full connection and activation processing are performed on the third matrix to obtain the feature map corresponding to the target object, so that the type of the target object is recognized based on the feature map. In the embodiment, the characteristics of the target object can be obtained more comprehensively through the steps, and the comprehensiveness and the accuracy of the characteristic identification are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for identifying an object type according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of image feature extraction of a multi-stem cascade network according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating feature clustering performed on a matrix based on a candidate area network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of pooling matrices according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a cascaded multi-backbone based vehicle money identification algorithm according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an apparatus for identifying an object type according to a fourth embodiment of the present disclosure;

fig. 7 is a schematic diagram of an apparatus for identifying an object type according to a fourth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Referring to fig. 1, fig. 1 is a flowchart of a method for identifying a type of an object according to an embodiment of the present application. The main execution body of the method for identifying the object type in this embodiment is a device having a function of identifying the object type, and includes, but is not limited to, a computer, a server, a tablet computer, or a terminal. The method of identifying the type of object as shown in the figure may comprise the steps of:

s110: and inputting the image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to the target object in the image.

In an embodiment of the present application, before identifying the type of the target object, an image of the target object is acquired as a band identification image. After the image to be recognized is acquired, the image to be recognized may be preprocessed to obtain the image to be recognized. The preprocessing process may include image size adjustment, image gray scale adjustment, and the like.

Further, in an embodiment of the present application, before the step, the following step may be further included: acquiring the size of the image to be identified; and adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

Specifically, because the size of the image to be recognized is different, or the focal length in the shooting process of the camera is inconsistent, the situation that the size of the image is inconsistent is easily caused.

For example, in this embodiment, the original input image to be recognized may be first subjected to size transformation while ensuring the aspect ratio, so as to obtain an image to be recognized with a preset size. For example, the preset size in this embodiment may be 1000 × 600, and may also be configured to other sizes, and the like. The image to be recognized with the corresponding size can be obtained in a widening, compressing or cutting mode.

In an embodiment of the application, after the image to be recognized is acquired, the image to be recognized is input into a cascade network formed by at least two backbone networks through the cascade network formed by at least two preset backbone networks, so as to process a target object in the image to be recognized, and obtain a first matrix containing characteristics of the target object.

Further, step S110 specifically includes the following steps: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

In an embodiment of the present application, the vehicle features in the image are then extracted through a backsbone (backsbone) cascade network (CascadeNet) and a feature map matrix FMap is obtained.

Fig. 2 is a schematic diagram of image feature extraction of a multi-bone-stem cascade network according to an embodiment of the present application.

As shown in fig. 2, the input 210 represents an output matrix corresponding to an image to be recognized after an original input image undergoes size change, and the skeleton 1(Backbone _1) (220), the skeleton 2(Backbone _2), and the skeleton n (Backbone _ n) respectively represent n different feature extraction networks.

Illustratively, a backhaul _1 may represent a DarkNet53, a backhaul _2 may represent a ResNet50, a backhaul _ n may represent an EfficientNet, and so on. Block _1, Block _2, Block _ n represent Block operations, resulting in blocks 1-n (230), the maximum pooling operation (240) may include Max Paoling _1, Max Paoling _2, Max Paoling _ n, etc., and finally the output result is obtained by summing operation (250).

Specifically, in the structure of the DarkNet53 network, the DarkNet53 network can be formed by convolution of 1 × 1 and 3 × 3, wherein 53 convolution layers are included; in the Resnet50 network structure, firstly, convolution operation is carried out on input, then 4 residual errors are contained, and finally, full connection operation is carried out so as to facilitate classification tasks; the EfficientNet is a composite of at least two networks, and the network width, the network depth and the image pixel size are subjected to composite fusion to obtain a better network.

Exemplarily, assuming that n is 3, the input matrix size is 1000 × 600 × 512, the output matrix sizes after the backhaul _1, the backhaul _2, and the backhaul _3 are 300 × 400 × 128, 500 × 600 × 256, and 700 × 800 × 512, respectively, and then the blocking process is performed, wherein the blocking size is configurable, for example, to be 60 × 40 blocks; then, performing maximum pooling, namely taking the maximum pixel value in each block, and respectively obtaining output matrixes with the sizes of 60 × 40 × 128, 60 × 40 × 256 and 60 × 40 × 512; finally, the three obtained output matrixes are summed to obtain a final output matrix with the size of 60 × 40 (128+256+512) ═ 60 × 40 × 896.

S120: and inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to the candidate frame feature map in the first matrix in a feature clustering mode.

In an embodiment of the present application, after a first matrix including a target image is obtained, the first matrix is input into a candidate area network, and a second matrix corresponding to each candidate frame feature map in the first matrix is obtained in a feature clustering manner for the first matrix corresponding to the target image.

Specifically, the process in step S120 includes the following steps: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

In an embodiment of the present application, a candidate regional area network RPN network is executed on the feature map FMap and all candidate frames CBox are generated by using anchor boxes obtained by using kmeans + + clustering in advance. Specifically, in this embodiment, the distance function used when clustering the anchor boxes by using the kmeans + + algorithm is 1-IOU, where IOU is equal to I/U, I represents the intersection area of two anchor boxes, and U represents the union area of two anchor boxes.

In an embodiment of the present application, kmeans + + is an improvement to the original kmeans clustering algorithm, and mainly improves the random selection manner of k initial cluster centers in kmeans into roulette selection, for example, k is 9.

Fig. 3 is a schematic diagram of feature clustering performed on a matrix based on a candidate area network according to an embodiment of the present application.

The specific execution flow is shown in fig. 3, where the input 310 is a feature map matrix FMap, and the input feature map is cut 360 by using coordinates of candidate boxes obtained after NMS screening through convolution + non-monotonic neural activation function (CM) (320), classification, regression, convolution + mapping activation function (CS) (330), convolution (340), non-maximum suppression (NMS) (350).

In one embodiment of the present application, assuming that the input matrix size is 60 × 40 × 512, the size of the matrix obtained by executing CM is still 60 × 40 × 512, and then CS and Conv operations in two branches, namely cls and reg, are performed to obtain the classification matrix and coordinate matrix of all candidate frames, the sizes of which are 60 × 40 (9 × 2) 60 × 40 × 18(9 indicates the number of anchor boxes, 2 indicates foreground and background binary classifications), 60 × 40 (9 × 4) 60 × 40 × 36(9 indicates the number of anchor boxes, 4 indicates the coordinates of each anchor box (i.e. the coordinates of the center point x, y of the anchor box and the width and height of the anchor box w, h)), and then performing NMS on all candidate frames (sorting all candidate frames in descending order according to the classification probability, retaining the candidate frame with the highest probability and overlapping with the candidate frame (i.e. the coordinates of the center point x, y of the anchor box and the width and height of the anchor box w, h)), and deleting the other candidate frames (i.e. redundant frames with iout >0.5) and deleting the remaining redundant frames (i.e. the remaining configurable frame), for example, 300) candidate frames are taken, and finally Cut operation is performed on the input feature map FMap on the n screened candidate frames to obtain a final Output matrix Output (i.e., a candidate frame feature map matrix).

In the cutting process, assuming that the size of the Input matrix Input is 60 × 40 × 512, and the coordinate of one candidate frame of the 300 candidate frames screened by the NMS is (9,15,20,30), the Cut operation indicates to Cut a candidate frame (i.e., a candidate frame feature map matrix, the size of which is 20 × 30) having a width and a height of (20,30) at a position (9,15) on the Input matrix, i.e., the feature map FMap. Similarly, 300 candidate box feature map matrices of different sizes may be obtained.

S130: and performing interest-based pooling on the second matrix to obtain a third matrix.

In an embodiment of the present application, after obtaining the second matrix, performing interest-based pooling on the second matrix to obtain a third matrix. The specific treatment process comprises the following steps: carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes; and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

Fig. 4 is a schematic diagram of pooling matrices according to an embodiment of the present disclosure.

As shown in fig. 4, in an embodiment of the present application, the input 410 is all candidate frames CBox filtered in the previous step. Next, the robooling operation is performed, and all candidate frames CBox with different sizes obtained in the previous step are uniformly partitioned into a preset size (420), for example, 7 × 7. Illustratively, when the size of the candidate frames is unified, that is, the size of the candidate frame feature maps is unified into 1 candidate frame feature map matrix of 300 × 7 × 512, namely 300M × N × 512. Assuming that Input (i.e., CBox) has 300 candidate boxes in total, for example, one of the candidate boxes is 20 × 30 × 512, a block partitioning (420) operation is first performed to divide the matrix 20 × 30 × 512 into 7 × 7 fixed-size blocks and perform rounding, each block having a size: (20/7) × (30/7) ═ 2.86 × 4.29 ═ 2 × 4; a max pooling (430) operation is then performed on each block, i.e., only the largest of the 2 x 4 to 8 pixels is retained. And finally, obtaining a final output matrix with the size of 7 × 512, which is the final output matrix obtained after the candidate frame is subjected to the RoIPooling operation. Since there are a total of 300, the final output matrix (440) size is 300 × 7 × 512.

S140: and carrying out full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object.

In an embodiment of the application, regression (reg) of a vehicle detection frame and classification (cls) of vehicle styles are respectively performed through two layers of FCRs (full connection + Relu activation functions), two different branches of FCCs (full connection) and FCS (full connection + Softmax activation functions), and a feature map corresponding to the vehicle is obtained.

S150: and identifying the type of the target object based on the feature map.

In an embodiment of the application, the coordinates of the vehicle predicted on the feature map are mapped to the coordinates on the original input image, so that the vehicle money identification of the image is realized.

In an embodiment of the application, an original feature extraction network is improved from a single backhaul network to a multi-backhaul cascade network, so that the feature expression capability in difficult scenes such as haze, rainy days, nights, vehicle sides and the like is obviously enhanced, and the overall accuracy and recall rate of vehicle money identification are improved; the original manual setting of the acquisition mode of the anchor box is improved to kmeans + + automatic clustering, and the generation quality of the anchor box is greatly improved, so that the positioning precision of the detection frame is improved, and the overall accuracy and recall rate of vehicle money identification are further improved.

Fig. 5 is a schematic diagram of a cascaded multi-backbone based vehicle money identification algorithm according to an embodiment of the present application.

In an embodiment of the present application, as shown in fig. 5, by performing size transformation 520 on an image to be recognized 510, and improving an original feature extraction network from a single backbone network to a multi-backbone cascade network 530, feature expression capability in difficult scenes such as haze, rainy days, nights, and vehicle sides is significantly enhanced, so that overall accuracy and recall rate of vehicle money recognition are improved; after the feature map matrix 540 is obtained through the cascade network 530, the obtaining mode of the anchor box is improved from the original manual setting to kmeans + + automatic clustering, the feature map matrix FMap540 is input into the candidate regional network RPN (550) to obtain a candidate frame CBox (560), and finally the candidate frame CBox is subjected to pooling processing 570 and full connection modification activation FCR (580), namely full connection activation FCS (590) and full connection FC (511), so that the generation quality of the anchor box is greatly improved, the positioning precision of the detection frame is improved, and the overall accuracy and recall rate of the vehicle money identification are further improved.

In the embodiment of the application, the images to be recognized are processed to obtain the images to be recognized with the same size, a first matrix corresponding to the target object is generated through a cascade network formed by a plurality of backbone networks, feature clustering is performed on the candidate frame feature maps in the first matrix to obtain a second matrix, interest-based pooling processing is performed on the second matrix to obtain a third matrix, and finally full connection and activation processing are performed on the third matrix to obtain the feature map corresponding to the target object, so that the type of the target object is recognized based on the feature map. In the embodiment, the characteristics of the target object can be obtained more comprehensively through the steps, and the comprehensiveness and the accuracy of the characteristic identification are improved.

Referring to fig. 6, fig. 6 is a schematic view of an apparatus for identifying a type of an object according to an embodiment of the present application. The device 600 for identifying the type of the object may be a mobile terminal such as a smart phone or a tablet computer. The units included in the apparatus 600 for identifying an object type of the present embodiment are used to execute the steps in the embodiment corresponding to fig. 1, please refer to fig. 1 and the related description in the embodiment corresponding to fig. 1, which are not repeated herein. The apparatus 600 for identifying the type of an object of the present embodiment includes: the first matrix unit 601 is configured to input an image to be identified into a cascade network composed of at least two backbone networks, so as to obtain a first matrix corresponding to a target object in the image; a second matrix unit 602, configured to input the first matrix into a candidate area network, and obtain a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering manner; a third matrix unit 603, configured to perform interest-based pooling on the second matrix to obtain a third matrix; a feature map unit 604, configured to perform full connection and activation processing on the third matrix to obtain a feature map corresponding to the target object; an identifying unit 605, configured to identify a type of the target object based on the feature map.

In an embodiment of the present application, the apparatus 600 further includes: the size acquisition unit is used for acquiring the size of the image to be identified; and the size adjusting unit is used for adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

In an embodiment of the present application, the first matrix unit 601 is configured to: respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks; partitioning the output characteristics to obtain a preset number of matrix blocks; extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix; and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

In an embodiment of the present application, the second matrix unit 602 is configured to: inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result; performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result; carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result; and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

In an embodiment of the present application, the third matrix unit 603 is configured to: carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes; and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

In an embodiment of the present application, the apparatus further includes: the vehicle money identification unit is used for identifying the vehicle money type based on the shot vehicle image to obtain a vehicle money type identifier; and the scene simulation unit is used for simulating the advancing scene of the vehicle based on the vehicle money type identification.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 7 is a schematic diagram of an apparatus for identifying a type of an object according to an embodiment of the present application. As shown in fig. 7, the apparatus 7 for identifying the type of an object of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70. The processor 70, when executing the computer program 72, implements the steps in the various method embodiments described above for identifying an object type, such as the steps shown in fig. 1. Alternatively, the processor 70 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 72.

Illustratively, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 72 in the apparatus for identifying an object type 7.

The device 7 for identifying the type of the object may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by a person skilled in the art that fig. 7 is only an example of the means 7 for identifying the type of object and does not constitute a limitation of the means 7 for identifying the type of object and may comprise more or less components than those shown, or some components may be combined, or different components, e.g. the terminal device may further comprise input and output devices, network access devices, buses, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the object type identification device 7, such as a hard disk or a memory of the object type identification device 7. The memory 71 may also be an external storage device of the object type identification device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (FC), and the like, provided on the object type identification device 7. Further, the memory 71 may also comprise both an internal memory unit and an external memory device of the apparatus for identifying the type of object 7. The memory 71 is used for storing the computer program and other programs and data required by the terminal device. The memory 71 may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, the computer program comprising program instructions, which, when executed by a processor, cause the processor to execute the above-mentioned method for identifying a type of an object.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above may also be implemented by a computer program, which may be stored in a computer-readable storage medium, to instruct related hardware.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of identifying a type of object, comprising:

inputting an image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image;

inputting the first matrix into a candidate area network, and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode;

performing interest-based pooling on the second matrix to obtain a third matrix;

performing full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object;

and identifying the type of the target object based on the feature map.

2. The method for identifying the type of the object according to claim 1, wherein before inputting the image to be identified into a cascade network composed of at least two backbones and obtaining the first matrix corresponding to the target object in the image, the method further comprises:

acquiring the size of the image to be identified;

and adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

3. The method for identifying the type of the object according to claim 1, wherein the inputting the image to be identified into a cascade network composed of at least two backbone networks to obtain a first matrix corresponding to the target object in the image comprises:

respectively inputting the images to be identified into the feature extraction networks corresponding to the backbone networks to obtain output matrixes corresponding to the backbone networks;

partitioning the output characteristics to obtain a preset number of matrix blocks;

extracting a maximum pixel value from the matrix block to obtain a pooling matrix corresponding to each output matrix;

and summing the pooling matrices corresponding to the backbone networks to obtain the first matrix.

4. The method for identifying the type of the object according to claim 1, wherein the inputting the first matrix into the candidate area network and obtaining the second matrix corresponding to the candidate box feature map in the first matrix by means of feature clustering comprises:

inputting the first matrix into a convolution network to obtain a first convolution result, and performing non-monotonic neural activation processing on the first convolution result to obtain a second convolution result;

performing classification-based convolution mapping activation on the second convolution result to obtain a third convolution result, and performing regression-based convolution processing on the second convolution result to obtain a fourth convolution result;

carrying out non-maximum suppression on the third convolution result and the fourth convolution result to obtain a suppression result;

and cutting the inhibition result to obtain a matrix corresponding to the candidate frame characteristic graphs with preset quantity.

5. The method of identifying a type of object of claim 1, wherein pooling of interest-based processing on the second matrix to obtain a third matrix comprises:

carrying out block processing on the second matrixes to obtain blocks corresponding to the second matrixes;

and performing maximum pooling on the blocks, and selecting the maximum value in the set area of the blocks to obtain the third matrix.

6. The method of identifying an object type according to claim 1, further comprising:

identifying the type of the vehicle money in the vehicle image based on the shot vehicle image to obtain a vehicle money type identifier;

and simulating a traveling scene of the vehicle based on the vehicle money type identification.

7. An apparatus for identifying a type of an object, comprising:

the first matrix unit is used for inputting the image to be identified into a cascade network consisting of at least two backbone networks to obtain a first matrix corresponding to a target object in the image;

the second matrix unit is used for inputting the first matrix into a candidate area network and obtaining a second matrix corresponding to a candidate frame feature map in the first matrix in a feature clustering mode;

the third matrix unit is used for performing interest-based pooling on the second matrix to obtain a third matrix;

the characteristic diagram unit is used for carrying out full connection and activation processing on the third matrix to obtain a characteristic diagram corresponding to the target object;

and the identification unit is used for identifying the type of the target object based on the characteristic diagram.

8. The apparatus for identifying a type of object according to claim 7, wherein the apparatus further comprises:

the size acquisition unit is used for acquiring the size of the image to be identified;

and the size adjusting unit is used for adjusting the size of the image to be recognized based on the size and the set size to obtain the adjusted image to be recognized.

9. An apparatus for identifying a type of an object, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.