Detailed Description
Embodiments disclosed in the present specification are described below with reference to the accompanying drawings.
As described above, in the vehicle damage detection and identification process, it is often necessary to identify a vehicle component. For example, in one embodiment, the vehicle component identification and the damage type identification may be performed separately, and the component identification result and the damage type identification result may be finally combined as the vehicle damage detection result. For this reason, it is considered to train a special vehicle component recognition model so as to accurately perform vehicle component recognition.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The computing platform shown in fig. 1 may train a vehicle component recognition model using a training sample set formed by a marked vehicle picture, where the marked vehicle picture may be a damaged vehicle picture or a lossless vehicle picture, and the mark may include a mark of a component type of each component in the vehicle picture and a frame region where the component is located. After the part recognition model is obtained through training, the scene picture shot by the user can be uploaded to a computing platform, and the vehicle part in the picture is automatically recognized by the part recognition model and is used for being combined with the damage information to determine the vehicle damage condition.
Specifically, according to the embodiments disclosed in the present specification, the vehicle component recognition model is trained to recognize a plurality of vehicle components based on a certain vehicle picture, that is, to perform multi-target simultaneous detection. In order to realize multi-target detection, a plurality of candidate regions of interest, or candidate regions, are identified first, and then the category and the frame of the target object corresponding to each candidate region are determined. According to the embodiment of the specification, in the process of multi-target detection, in order to perform multi-target detection more accurately and identify a plurality of vehicle components simultaneously, space constraint information among the vehicle components is introduced by designing an Attention layer in the component identification model, and then the identification accuracy of the vehicle components is optimized.
The following describes a specific structure of a vehicle component recognition model and a specific process of recognizing a vehicle component in a vehicle picture by an embodiment, with reference to fig. 2 and 3. FIG. 2 illustrates a flow diagram of a method for identifying vehicle components, according to one embodiment. FIG. 3 illustrates a schematic diagram of a use configuration of a vehicle component identification model according to one embodiment. The execution subject of the method may be any device or system or platform with computing and processing capabilities, such as the computing platform shown in fig. 1. As shown in fig. 2, the method comprises the steps of: step S210, obtaining a vehicle picture; step S220, determining a plurality of candidate regions taking at least one vehicle component as a potential target and a plurality of first feature vectors corresponding to the candidate regions based on the vehicle picture; step S230, converting the plurality of first feature vectors into a plurality of second feature vectors; step S240, determining a component category corresponding to each candidate region in the plurality of candidate regions based on the plurality of second feature vectors. The steps are as follows:
first, in step S210, a picture of the vehicle is acquired. The vehicle picture is a vehicle picture to be identified. In one embodiment, the user uploads the vehicle picture in the damage assessment client, and accordingly, the picture uploaded by the user can be obtained in this step.
Next, for the acquired vehicle picture, in step S220, a plurality of candidate regions in which at least one vehicle component is a potential target and a plurality of first feature vectors corresponding to the candidate regions are determined.
In one embodiment, step S220 may be implemented by convolutional layer 31 and RPN network 32 in fig. 3.
Specifically, the convolution layer 31 is configured to receive a vehicle image, perform convolution processing on the vehicle image, and generate a convolution feature map corresponding to the vehicle image. From an implementation point of view, convolutional layer 31 can also be considered as a convolutional Neural network cnn (convolutional Neural network).
Convolutional neural network CNN is a network structure often used in the field of image processing, and includes several convolutional layers for performing convolutional processing on an image. Convolution processing is a processing operation often employed to analyze images. Specifically, the convolution process is a series of operations performed on each pixel in an image using one convolution kernel. The convolution kernel (operator) is a matrix used for image processing and is a parameter for performing an operation with an original pixel. The convolution kernel is typically a square grid structure (e.g., a 3 x 3 matrix or pixel region) with each grid having a weight value. When the convolution kernel is used for carrying out convolution calculation on the picture, the convolution kernel is slid on a pixel matrix of the picture, products and summations are carried out on each element in the convolution kernel and the image pixel value covered by each element every time one step is slid, and the new characteristic value matrix obtained in the way forms a convolution characteristic map, namely feature map. The convolution operation may extract abstract features from a pixel matrix of the original picture, and the abstract features may reflect, for example, more global features such as a linear shape and a color distribution of an area in the original picture according to the design of a convolution kernel.
In a more specific embodiment, the convolutional layer 31 comprises one or more convolutional layers, each of which performs a convolution process on the image. After these convolution layer processes, a convolution feature map (feature map) corresponding to the original vehicle image is obtained.
In another more specific embodiment, convolutional layers 31 include a plurality of convolutional layers, and between these convolutional layers or after some convolutional layers, at least one ReLU (The Rectified Linear Unit) excitation layer is included for non-linearly mapping convolutional layer output results. The result of the non-linear mapping can be input into the next convolution layer for further convolution processing, or can be output as a convolution signature.
In yet another more particular embodiment, convolutional layer 31 includes a plurality of convolutional layers, and between the plurality of convolutional layers, at least one pooling layer (pooling) is included for pooling convolutional layer output results. The result of the pooling operation may be input into the next convolution layer and the convolution operation may continue.
Those skilled in the art will appreciate that convolutional layers 31 may be designed to include one or more convolutional layers, and that ReLU excitation layers and/or pooling layers may be optionally added between convolutional layers, as desired. The convolution layer 31 performs convolution processing on the original vehicle picture and outputs a convolution feature map corresponding to the picture.
Then, based on the convolution signature generated by the convolution layer 31, the region generation network RPN32 may determine a plurality of candidate regions that are potentially targeted for the vehicle component.
The candidate region rp (region pro spatial) is a region in the picture where the object may appear, in some cases also referred to as a region Of interest roi (region Of interest), and the determination Of the candidate region provides a basis for the regression determination Of the classification and bounding box Of the subsequent object.
The candidate region RP is extracted by adopting a selective search mode in both an R-CNN (region CNN) network model and a Fast R-CNN network model for target detection and identification. In a further Faster R-CNN, a region generation network rpn (region pro-social network) is proposed, dedicated to generating or proposing candidate regions RP.
The region generation network RPN is a full-convolution network (full-convolution network), and based on the convolution feature map returned by the base network (i.e., including the convolution layer 31), the suggestion and generation of the candidate region RP are efficiently realized in a full-convolution manner.
In a particular embodiment, determining the candidate region RP may include determining a bounding box location of the candidate region RP. In one example, the determined position of the frame may be represented by coordinates of four vertices of the frame. In another example, the determined position of the frame may be represented by coordinates of a center point of the frame and a width and a height of the frame. As exemplarily shown in fig. 3, the region generation network RPN suggests region borders of 3 candidate regions in the convolution signature, which are respectively denoted as regions a, B, and C. In another specific embodiment, while the region RP is determined, a feature vector corresponding to the region RP may also be determined, for example, a plurality of feature values in a convolution feature map covered by the region RP may be sequentially arranged to obtain a corresponding feature vector.
In this way, a plurality of candidate regions and a plurality of first feature vectors corresponding to the candidate regions, which are potential targets of the vehicle component in the vehicle picture, can be determined through the convolutional layer 31 and the RPN32 in fig. 3.
In the above, a plurality of candidate regions and a plurality of corresponding first feature vectors, which are potentially targeted by at least one vehicle component, may be determined based on the vehicle picture.
Next, in step S230, the plurality of first feature vectors are converted into a plurality of second feature vectors.
In this step, the inventors converted the first feature vector for the candidate region corresponding to a certain component into the second feature vector including the spatial association relationship between the component and another component, taking into account that the vehicle components have a spatial constraint relationship therebetween. Specifically, the conversion of the vector is realized by introducing a self attention mechanism. The SelfAttention mechanism can be used to find relationships of elements within a sequence without the intervention of external information, for example, in natural language processing, to find internal relationships between words in a sentence. Thus, by using the idea of the SelfAttention mechanism to explicitly convert the first feature vector into the second feature vector, the high level can be explained.
In one embodiment, when a certain component in the vehicle picture is determined, the association relationship between the component and all other components in the vehicle picture is considered, and accordingly, when the first feature vector corresponding to a certain candidate region is transformed, the association degree between the vector and the first feature vectors corresponding to all other candidate regions is considered. Based on this, in the present step, converting the ith first feature vector of the plurality of first feature vectors into the ith second feature vector may include: firstly, determining the association degree between the ith first feature vector and each first feature vector in the plurality of first feature vectors, and then carrying out weighted summation on the plurality of first feature vectors based on the determined association degree, wherein i is a positive integer.
In a specific embodiment, the calculation of the correlation may use a method for calculating similarity, such as calculating a euclidean distance or a manhattan distance as the similarity. In one example, the dot product between two vectors is calculated as the corresponding similarity. In one specific example, the ith first feature vector x may be expressed by the following functioniConversion to a corresponding ith secondary feature vector x'i:
x′i=∑j(xi·xj)·xj,i,j=1,…,n (1)
In the formula (1), xjRepresents the jth first feature vector, and n represents the total number of the plurality of first feature vectors, that is, the total number of the plurality of candidate regions.
Further, the conversion of the vector can be realized by performing operation in the form of a matrix. Specifically, a first feature matrix composed of a plurality of first feature vectors is converted into a second feature matrix, and the second feature matrix comprises a plurality of second feature vectors. In a specific example, the first feature matrix X may be converted to the second feature matrix X' by the following function:
X′=(X·XT)·X (2)
in the formula (2), X,
n denotes a total number of the plurality of first feature vectors, d denotes dimensions of the first feature vector and the second feature vector, and the first feature vector and the second feature vector have the same dimension.
Further, it may be considered to perform normalization processing on each second feature vector included in the second feature matrix X', and specifically, the normalization processing may be implemented based on the following function:
in the formula (3), X,
n denotes the total number of the plurality of first feature vectors, d denotes the dimensions of the first feature vectors and the second feature vectors, and a Softmax () function is used to implement the normalization process.
In the above description, a vector conversion method is described in which, when a certain vehicle component in a picture is recognized, the relationship between all other vehicle components and the vehicle component is considered.
In another embodiment, when a certain component in the vehicle picture is determined, only the association relationship between the component and the surrounding components can be considered, and the association relationship between the component and the component far away from the component does not need to be considered. Based on this, the idea of the local Attention mechanism can be used for reference, and then when the first feature vector corresponding to a certain candidate region is transformed, the association degree between the vector and the first feature vectors corresponding to a plurality of candidate regions is considered. It is to be understood that different vehicle components, their surroundings, are often not identical. Accordingly, for the first feature vectors corresponding to different candidate regions, the first feature vectors corresponding to the surrounding candidate regions to be considered are not always identical.
Based on this, in the present step, converting the ith first feature vector of the plurality of first feature vectors into the ith second feature vector may include: first, a plurality of first feature vectors (for convenience of description, the i-th part of the first feature vectors are referred to as the first feature vectors) corresponding to the ith first feature vector are determined from the plurality of first feature vectors, then, the association degree between the ith first feature vector and each first feature vector in the ith part of the first feature vectors is determined, and the ith part of the first feature vectors is subjected to weighted summation based on the association degree, wherein i is a positive integer. It should be noted that several of the several first feature vectors may refer to one or more. In one embodiment, it may be preset that the ith partial feature vector constantly includes the ith first feature vector.
Further, for the ith partial first feature vector aiming at the ith first feature vector, various rules can be introduced according to actual needs. As can be seen from the foregoing, the ith first feature vector corresponds to the ith candidate region in the plurality of candidate regions, and the ith partial first feature vector corresponds to the ith partial candidate region in the plurality of candidate regions. In a specific embodiment, the ith partial candidate region is located within a position range set for the ith candidate region.
In a more specific embodiment, the i-th partial candidate region includes: and a region of the plurality of candidate regions in which a distance between a center of the region and a center of the region of the ith candidate region is less than a distance threshold. In one example, the distance threshold may be set based on the width and height of the ith candidate region. In another example, the distance threshold may be set by a worker according to actual experience.
In another more specific embodiment, the i-th partial candidate region includes: a candidate region located on a predetermined azimuth side of the i-th candidate region among the plurality of candidate regions. In one example, wherein the predetermined azimuth side may include: upper side, lower side, left side, right side, upper left side, lower left side, upper right side, lower right side.
In yet another more specific embodiment, the i-th partial candidate region includes: and among the candidate areas which are positioned on the predetermined azimuth side of the ith candidate area and do not overlap with the ith candidate area, the candidate area with the shortest distance to the ith candidate area is selected. In one example, if the predetermined azimuth side includes a left side, the i-th partial candidate region may include: and the candidate area is positioned at the left side of the ith candidate area and is not overlapped with the ith candidate area, and the distance between the right frame and the left frame of the ith candidate area is shortest. In another example, if the predetermined azimuth side further includes an upper side, the i-th partial candidate region may further include: and the candidate area is positioned on the upper side of the ith candidate area and is not overlapped with the ith candidate area, and the distance between the upper frame and the lower frame of the ith candidate area is shortest.
On the other hand, in one embodiment, the calculation of the correlation between the ith first feature vector and the ith partial first feature vector may refer to a method for calculating the similarity, such as calculating the euclidean distance or manhattan distance as the similarity. In one example, the dot product between two vectors is calculated as the corresponding similarity.
In one specific example, the ith first feature vector x may be expressed by the following functioniConversion to a corresponding ith secondary feature vector x'i:
x′i=∑jIi,j(xi·xj)·xj,i,j=1,…,n (4)
In the formula (4), xjRepresenting a jth first feature vector; n represents the total number of the plurality of first feature vectors; i isi,jIndicating an indication function, in particular when the jth candidate region is within a range set for the ith candidate regioni,j1, otherwise Ii,j=0。
More specifically, in one example, the function I is indicatedi,jThe expression of (a) is as follows:
in the formula (5), centeriCenter of area representing the ith candidate areajRepresenting the center of the jth candidate region, | | | | | non-woven calculation2Representing the second order norm, w and h represent the width and height, respectively, of the ith candidate region.
In another example, the function I is indicatedi,jThe expression of (a) is as follows:
in formula (6), RB, LB, TB, and BB are abbreviations of the following english phrases, respectively: right Bound of Bounding Box, Left Bound of Bounding Box, Top Bound of Bounding Box, Bottom Bound of Bounding Box, and Bottom Bound of Bounding Box; RB (radio B)j、LBj、TBj、BBjRespectively represent the right of the jth candidate regionThe frame, the left frame, the upper frame and the lower frame; RB (radio B)i、LBi、TBi、BBiRespectively representing a right frame, a left frame, an upper frame and a lower frame of the ith candidate area; j-argmax RBj,j∈{LBi>RBjIndicating a candidate region which is positioned at the left side of the left frame of the ith candidate region (namely, positioned at the left side of the ith candidate region and is not overlapped with the ith candidate region) and is farthest to the right in the plurality of candidate regions; j is argmin LBj,j∈{LBj>RBiIndicating a candidate area which is positioned at the right side of the right frame of the ith candidate area (namely, positioned at the right side of the ith candidate area and is not overlapped with the ith candidate area) and is relatively leftmost in the plurality of candidate areas; j is argmax TBj,j∈{BBi>TBjThe candidate areas with the upper frame located at the lower side of the lower frame of the ith candidate area (i.e. located at the lower side of the ith candidate area and not overlapped with the ith candidate area) and the relatively highest frame in the plurality of candidate areas are represented; j is argmin BBj,j∈{BBj>TBiAnd represents the candidate area with the lower frame located at the upper side of the upper frame of the ith candidate area (i.e. located at the upper side of the ith candidate area and not overlapped with the ith candidate area) and the relatively lowest frame among the plurality of candidate areas.
In summary, the indicator function of equation (6) introduces, as a region to be noticed, another candidate region on the i-th candidate region side thereof, which is not overlapped with the i-th candidate region and is closest to the i-th candidate region.
The above formulas (5) and (6) are used to indicate the function Ii,jIs illustrative, and may actually take many other forms.
In addition, the conversion of the vector can be realized by performing the operation in the form of a matrix. Specifically, a first feature matrix composed of a plurality of first feature vectors is converted into a second feature matrix, and the second feature matrix comprises a plurality of second feature vectors. In a specific example, the first feature matrix X may be converted to the second feature matrix X' by the following function:
X′=A·(X·XT)·X (7)
in the formula (7), X,
n denotes the total number of the plurality of first feature vectors, d denotes the dimensions of the first feature vector and the second feature vector, the first feature vector and the second feature vector having the same dimensions, and A denotes an indicator function I represented by formula (4)
i,jA matrix of values of (c).
The above may be implemented to convert the plurality of first feature vectors into the plurality of second feature vectors. Next, in step S240, a component category corresponding to each candidate region in the candidate regions is determined based on the second feature vectors.
In one embodiment, as shown in fig. 3, the plurality of second feature vectors output by the Attention layer 33 are input into the classification layer 34, and the classification layer 34 processes the plurality of second feature vectors to obtain a component category corresponding to each candidate region in the plurality of candidate regions.
In one embodiment, the classification layer 34 is a fully connected layer, and performs component classification based on the region characteristics of each region input from the previous layer. More specifically, the classification layer 34 may contain a plurality of classifiers, each trained to identify different classes of targets in the candidate region. In the context of vehicle component detection, the various classifiers are trained to identify different classes of vehicle components, such as bumpers, front doors, hoods, headlights, tail lights, and the like.
Further, the classification layer 34 may also be used for bounding box regression. In a more specific embodiment, the classification layer 34 further includes a regressor, configured to perform regression on a bounding box corresponding to the identified target, and determine a smallest rectangular region surrounding the target as the bounding box (bounding box). In an example, the foregoing step S220 further includes determining a plurality of first frames corresponding to the plurality of candidate regions, and accordingly in this step, the plurality of first frames may be adjusted, that is, frame regression may be performed, based on the plurality of second feature vectors.
In the above manner, the component recognition result for the vehicle picture can be obtained.
In summary, the method for identifying vehicle components provided in the embodiments of the present specification explicitly introduces the positional relationship between the components, and can improve the identification accuracy of the vehicle components. Further, only the spatial position constraints of a single component and its peripheral components can be considered, so that the calculation amount can be saved, and simultaneously, higher identification accuracy is ensured.
The method for recognizing the vehicle component is described above with reference to fig. 2 and 3 from the viewpoint of use of the vehicle component recognition model. The following describes a method for recognizing a vehicle component from the viewpoint of training a vehicle component recognition model with reference to fig. 4 and 5. Fig. 4 shows a flow diagram of a method for identifying a vehicle component according to another embodiment, and fig. 5 shows a schematic diagram of a training structure of a vehicle component identification model according to an embodiment. The execution subject of the method may be any device or system or platform with computing and processing capabilities, such as the computing platform shown in fig. 1.
As shown in fig. 4, the method comprises the steps of: step S410, obtaining a vehicle picture with a label, wherein the label comprises at least one category label and corresponds to at least one vehicle part; step S420, determining a plurality of candidate regions taking the at least one vehicle component as a potential target and a plurality of first feature vectors corresponding to the candidate regions based on the vehicle picture; step S430, converting the plurality of first feature vectors into a plurality of second feature vectors; step S440, inputting the plurality of first feature vectors into a first classification layer to obtain a plurality of first prediction results; step S450, inputting the second feature vectors into a second classification layer to obtain a second prediction result; step S460, training the first classification layer and the second classification layer by using the plurality of first prediction results, the plurality of second prediction results, and the at least one class label; the second classification layer is used for classifying the components of the vehicle pictures to be recognized. The steps are as follows:
first, in step S410, a picture of a vehicle with a label is obtained. It should be understood that at least one vehicle component may be included in a vehicle picture, and correspondingly, at least one component category label may be included in the labeling of the vehicle picture. In one embodiment, at least one part tag corresponds to at least one picture region containing a vehicle part.
Next, in step S420, based on the obtained annotation picture, a plurality of candidate regions with the vehicle component as a potential target and a plurality of first feature vectors corresponding to the candidate regions are determined. In one embodiment, this step may be implemented by convolutional layer 51 and RPN52 shown in fig. 5. For a description of convolutional layer 51 and RPN52, reference may be made to the previous description of convolutional layer 31 and RPN32 in fig. 3.
Furthermore, in one particular embodiment, convolutional layer 51 may be a pre-trained feature extractor (extractor), and convolutional layer 51 may not be trained again here. In a specific embodiment, step S420 may further include: training the RPN based on the feature map generated by the convolutional layer 51 and the label in the label picture, which may refer to a training mode in the prior art and is not described herein again. In another specific embodiment, the convolutional layer 51 and the RPN may be trained simultaneously based on the labeled picture, and the specific training mode may be implemented by the prior art and is not described herein again. In addition, the description of step S420 can also refer to the description of step S220.
A plurality of first feature vectors may be obtained as described above. Next, in step S430, the plurality of first feature vectors are converted into a plurality of second feature vectors.
It should be noted that, in an implementation, step S430 may be performed with reference to the example described in step S420, and please refer to the description of step S430 specifically.
In another embodiment, for a plurality of first feature vectors corresponding to a plurality of candidate regions, the first feature vector corresponding to a candidate region that does not include the vehicle component in the plurality of candidate regions may be converted into a zero vector as a corresponding second feature vector. In one embodiment, for an arbitrary first feature vector, in the case that the intersection ratio IOU between the candidate region corresponding to the arbitrary first feature vector and each picture region in the at least one picture region is less than the ratio threshold, the first feature vector is converted into a zero vector as a corresponding second feature vector. The intersection ratio refers to the ratio of the Area of the intersection region (Area of Overlap) between the two corresponding regions to the Area of the phase-parallel region (Area of Union).
In one specific example, the ith first feature vector x may be expressed by the following functioniConversion to a corresponding ith secondary feature vector x'i:
x′i=∑jIiIi,j(xi·xj)·xj,i,j=1,…,n (8)
In formulae (8) and (9), xjRepresenting a jth first feature vector; n represents the total number of the plurality of first feature vectors; i isi,jIndicating an indication function, specifically, when the jth candidate region is within a range set for the ith candidate region, Ii,j1, otherwise Ii,j=0;IiRepresenting an indicator function, IOUiThe intersection ratio between the ith candidate region and the picture region in the annotation picture is shown, k represents a ratio threshold set for the intersection ratio, and k may be 0.7, 0.8, or the like, for example.
In a specific example, a first feature matrix X composed of a plurality of first feature vectors may be converted into a second feature matrix X' including a plurality of second feature vectors by the following function:
X′=A1·A2·(X·XT)·X (10)
in the formula (10), X,
n denotes the total number of the plurality of first feature vectors, d denotes the first featureThe dimensions of the vector and the second feature vector, the first feature vector and the second feature vector having the same dimensions, A
1The expression is represented by an indicator function I in formula (9)
iDiagonal matrix of corresponding values, A
2The expression is represented by an indicator function I in formula (8)
i,jThe corresponding values constitute a matrix.
The plurality of first feature vectors may be converted into a plurality of second feature vectors in step S430. On the other hand, in step S440, a plurality of first feature vectors may be input into the first classification layer, resulting in a plurality of first prediction results.
In one embodiment, as shown in fig. 5, a plurality of first feature vectors output by the RPN network 52 are input into the first classification layer 53, and a first component category corresponding to each candidate region in a plurality of candidate regions is obtained as a plurality of first prediction results.
In a specific embodiment, the first classification layer 53 is a fully connected layer, and performs component classification based on the region characteristics of each region input from the previous layer. More specifically, the first classification layer 53 may contain a plurality of first classifiers, each trained to identify different classes of targets in the candidate region. In the context of vehicle component detection, the individual first classifiers are trained to identify different classes of vehicle components, such as bumpers, front doors, hoods, headlights, tail lights, and the like.
In one example, the ith first prediction result of the ith candidate region may be obtained by the following function:
wherein x is
iRepresents the ith first feature vector and the ith second feature vector,
represents k × N
clsA parameter matrix of dimensions, k representing the dimension of the first eigenvector, N
clsRepresenting the total number of categories, p, of vehicle components identifiable in the first classification level
iDenotes the ithA first prediction result. In one example, p
iThe vector value in (1) corresponds to the ith candidate region belonging to N
clsProbability of each of the vehicle-like components.
In this way, a plurality of first prediction results for a plurality of candidate regions may be obtained.
After step S430, step S450 may be executed to input the plurality of second feature vectors into the second classification layer, so as to obtain a plurality of second prediction results.
In one embodiment, as shown in fig. 5, a plurality of second feature vectors output by the Attention layer 54 are input into the second classification layer 55, and the second classification layer 55 processes the plurality of second feature vectors to obtain a second component category corresponding to each candidate region in the plurality of candidate regions.
In a specific embodiment, the second classification layer 55 is a fully connected layer, and performs component classification based on the region characteristics of each region input from the previous layer. More specifically, the second classification layer 55 may contain a plurality of classifiers, each trained to identify different classes of targets in the candidate region. In the context of vehicle component detection, the various classifiers are trained to identify different classes of vehicle components, such as bumpers, front doors, hoods, headlights, tail lights, and the like.
In one example, the ith first prediction result of the ith candidate region may be obtained by the following function:
wherein, x'
iRepresents the ith second feature vector and the ith second feature vector,
represents k × N
clsA parameter matrix of dimensions, k representing the dimension of the second eigenvector, N
clsRepresents a total number of categories, p ', of vehicle components identifiable in the second classification layer'
iRepresenting the ith second prediction. In one example, p'
iVector value ofThe ith candidate region belongs to N
clsProbability of each of the vehicle-like components.
In this way, a plurality of second prediction results for a plurality of candidate regions may be obtained.
After the plurality of first predictors and the plurality of second predictors are obtained, at step S460, training at least the first classification layer and the second classification layer using the plurality of first predictors, the plurality of second predictors, and the at least one class label; the second classification layer is used for classifying the components of the vehicle pictures to be recognized.
In one embodiment, this step may include: and for any candidate area in the candidate areas, calculating the intersection ratio IOU between the candidate area and each picture area in the at least one picture area, and taking the class label of the picture area corresponding to the IOU value larger than a preset threshold value as the class label corresponding to the first candidate area, thereby obtaining a plurality of class labels corresponding to the candidate areas. In a specific embodiment, it is assumed that annotating in an annotated picture comprises: picture area a and corresponding component class a, picture area B and corresponding component class B, and the predetermined threshold for the IOU is 0.8. Based on this, for a first candidate region among the plurality of candidate regions, if the intersection ratio thereof with the picture region a and the picture region B is 0.85 and 0.2, respectively, the component type a is taken as the component label of the first candidate region; if the intersection ratio thereof with the picture area a and the picture area B is 0.1 and 0.15, respectively, then the non-vehicle component is taken as the component tag of the first candidate area. In this way, a plurality of part tags corresponding to a plurality of candidate regions can be obtained.
Further, the first classification layer and the second classification layer are trained using the plurality of first predictors, the plurality of second predictors, and the plurality of class labels. In a particular embodiment, as shown in FIG. 5, a loss function may be calculated based on a plurality of first component categories, a plurality of second component categories, and a plurality of category labels to adjust parameters in the first classification layer and the second classification layer.
In one example, where the loss function is of the form:
wherein x is
iAnd x'
iRespectively representing the ith first and ith second eigenvectors, p, for the ith candidate region
iAnd p'
iRespectively representing an ith first prediction result for an ith candidate region, representing an ith second prediction result,
the category label indicating the ith candidate region is λ super parameter, and may be set to 0.6 or 0.8, for example.
The parameters in the first and second classification layers may be adjusted based on a loss function, such as the loss function of the form of equation (13) in combination with a gradient descent method.
In another specific embodiment, in the foregoing step S420, a plurality of candidate regions and a corresponding plurality of first feature vectors are determined by using the convolutional layer 51 and the RPN network 52 shown in fig. 5. Based on this, in this step, may include: the first classification layer 53, the second classification layer 55, and the convolutional layer 51 and/or the RPN network 52 are trained using the first plurality of predictions, the second plurality of predictions, and the class labels, such that end-to-end training of the vehicle component identification model is achieved.
In the above, training of the first classification layer and the second classification layer may be achieved. It should be understood that, in comparison with fig. 3, only the second classification layer needs to be used during the use of the component identification model, so that higher component identification accuracy can be obtained. For the introduction of the first classification layer in the training stage, the parameters of the second classification layer can be optimized in an auxiliary mode, and therefore the recognition accuracy is improved.
In addition, compared with the original FasterCNN algorithm, the component detection model introduced with the Attention layer provided by the embodiment of the specification can obtain better training effect. Specifically, as shown in fig. 6, the classification loss function in the detection model disclosed in the embodiment of the present disclosure has small oscillation, is more stable, and has a faster convergence rate.
The above description focuses on the target identification scenario for identifying the vehicle component, and it should be understood that the above method for identifying the vehicle component may also be extended to the identification of other target objects, especially target objects having a spatial constraint relationship, such as human organs including eyes, nose, mouth, etc., or parts of bicycles, airplanes, and ships, etc.
Thus, the embodiments of the present specification also provide a method for identifying a target object, which is executed by a computer, and the execution subject of the method can be any server or system or device or platform with computing and processing capabilities, and so on. As shown in fig. 7, the method may include the steps of:
step S710, obtaining a sample picture to be identified; step S720, based on the sample picture, determining a plurality of candidate regions taking at least one target object as a potential target and a plurality of first feature vectors corresponding to the candidate regions; step S730, converting the plurality of first feature vectors into a plurality of second feature vectors, wherein for an ith first feature vector, based on the correlation between the ith first feature vector and the plurality of first feature vectors, the plurality of first feature vectors are subjected to weighted summation to obtain a corresponding ith second feature vector; wherein i is a positive integer; step S740 is to determine, based on the plurality of second feature vectors, a category of the target object corresponding to each of the plurality of candidate regions.
With respect to the above steps, in one embodiment, step S720 includes: carrying out convolution processing on the vehicle picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolved feature map.
In one embodiment, the plurality of candidate regions includes an ith candidate region corresponding to the ith first feature vector and a plurality of candidate regions corresponding to the plurality of first feature vectors; the candidate regions are located within a position range set for the ith candidate region.
Further, in a specific embodiment, the step of determining the candidate regions includes the step of determining a first candidate region from the candidate regions, and the step of determining the candidate regions from the first candidate region to the ith candidate region includes: a distance between a region center of the first candidate region and a region center of the ith candidate region is less than a distance threshold.
In a more specific implementation, the distance threshold is set based on the width and height of the ith candidate region.
In another specific embodiment, the step of determining the candidate regions includes determining a second candidate region from the candidate regions, and the step of determining the candidate regions located in the position range set for the ith candidate region includes: the second candidate region is located on a predetermined azimuth side of the i-th candidate region.
In a more specific embodiment, the predetermined azimuth side includes at least one of: upper, lower, left and right sides.
Further, in an example, the second candidate region does not overlap with the i-th candidate region, and a distance from the i-th candidate region is shortest among candidate regions located on a predetermined azimuth side of the i-th candidate region.
In one embodiment, step S720 further includes: determining a plurality of first borders corresponding to the candidate areas; after step S730, the method further comprises: adjusting the plurality of first frames based on the plurality of second feature vectors.
In addition, for the description of the method in fig. 7, reference may also be made to the description of the method illustrated in fig. 3.
In summary, in the method for identifying a target object provided in the embodiments of the present specification, the Attention layer is introduced to implement the target identification by combining the position relationships among a plurality of target objects, so that the accuracy of identifying the target object can be improved.
According to another embodiment, another method for identifying a target object is also provided. Specifically, as shown in fig. 8, the method includes the steps of:
step S810, obtaining a sample picture with a label, wherein the label comprises at least one category label and corresponds to at least one target object; step S820, determining a plurality of candidate regions with at least one target object as a potential target and a plurality of first feature vectors corresponding to the candidate regions based on the sample picture; step S830, converting the plurality of first feature vectors into a plurality of second feature vectors, wherein for the ith first feature vector, based on the correlation between the ith first feature vector and the plurality of first feature vectors, the plurality of first feature vectors are subjected to weighted summation to obtain the corresponding ith second feature vector; wherein i is a positive integer; step 840, inputting the plurality of first feature vectors into a first classification layer to obtain a plurality of first prediction results; step S850, inputting the plurality of second feature vectors into a second classification layer to obtain a plurality of second prediction results; step S860, train the first classification layer and the second classification layer using the plurality of first prediction results, the plurality of second prediction results, and the at least one category label, where the second classification layer is used to classify a target of a sample picture to be identified.
With respect to the above steps, in one embodiment, at least one category label corresponds to at least one picture region containing the target object.
In one embodiment, step S830 includes: converting some of the plurality of first feature vectors into zero vectors as corresponding second feature vectors; and the intersection ratio IOU between the candidate area corresponding to each first characteristic vector in certain first characteristic vectors and each picture area in the at least one picture area is smaller than a first preset threshold value.
In one embodiment, step S860 includes: determining a category label corresponding to each candidate area in the plurality of candidate areas based on the at least one picture area and the at least one category label; and training the first classification layer and the second classification layer by using the plurality of first prediction results, the plurality of second prediction results and the class labels corresponding to the candidate regions.
Further, in a specific embodiment, any first candidate region included in the plurality of candidate regions; the determining a category label corresponding to each candidate region of the plurality of candidate regions based on the at least one picture region and the at least one category label includes: calculating an intersection ratio IOU of the first candidate area and each picture area in the at least one picture area to obtain at least one IOU value; determining a category label corresponding to a corresponding picture area as a category label of the first candidate area when an IOU value larger than a predetermined threshold value exists in the at least one IOU value; or, in the absence of an IOU value greater than a predetermined threshold value in the at least one IOU value, determining a non-target object as a category label for the first candidate region.
In one embodiment, step S820 includes: carrying out convolution processing on the vehicle picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolution feature map; step S860 includes: training the first classification layer, the second classification layer, and the RPN network using the first plurality of predictors, the second plurality of predictors, and the at least one class label.
It should be noted that, for the description of the method in fig. 8, reference may be made to the description of the method shown in fig. 4 in the foregoing embodiment.
Corresponding to the method provided in the above embodiment, the embodiment of the present specification further discloses an apparatus for identifying a vehicle component, an apparatus for identifying a target object, as follows:
FIG. 9 illustrates a block diagram of an apparatus for identifying vehicle components, according to one embodiment. As shown in fig. 9, the apparatus 900 includes: an obtaining unit 910 configured to obtain a picture of a vehicle to be identified; a first determining unit 920, configured to determine, based on the vehicle picture, a plurality of candidate regions that are potential targets of at least one vehicle component and a plurality of first feature vectors corresponding to the candidate regions; a converting unit 930 configured to convert the plurality of first feature vectors into a plurality of second feature vectors, wherein for an ith first feature vector, the plurality of first feature vectors are subjected to weighted summation based on the association degree between the ith first feature vector and the plurality of first feature vectors to obtain a corresponding ith second feature vector; wherein i is a positive integer; a second determining unit 940, configured to determine, based on the plurality of second feature vectors, a component category corresponding to each of the plurality of candidate regions.
In an embodiment, the first determining unit 910 is specifically configured to: carrying out convolution processing on the vehicle picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolved feature map.
In one embodiment, the plurality of candidate regions includes an ith candidate region corresponding to the ith first feature vector and a number of candidate regions corresponding to the number of first feature vectors; the candidate regions are located within a position range set for the ith candidate region.
In a more specific embodiment, the step of determining the candidate regions includes determining a first candidate region from the candidate regions, and the step of determining the candidate regions from the first candidate region and the ith candidate region includes: a distance between a region center of the first candidate region and a region center of the ith candidate region is less than a distance threshold.
In one example, the distance threshold is set based on the width and height of the ith candidate region.
In another more specific embodiment, the step of determining the candidate regions includes the step of determining a second candidate region from the candidate regions, and the step of determining the candidate regions from the second candidate region to the ith candidate region includes: the second candidate region is located on a predetermined azimuth side of the i-th candidate region.
In one example, the predetermined azimuth side includes at least one of: upper, lower, left and right sides.
In one example, the second candidate region does not overlap with the i-th candidate region, and a distance between the second candidate region and the i-th candidate region is shortest among candidate regions located on a predetermined azimuth side of the i-th candidate region.
In one embodiment, the first determining unit 910 is further configured to: determining a plurality of first borders corresponding to the candidate areas; the device further comprises: an adjusting unit 950 configured to adjust the plurality of first frames based on the plurality of second feature vectors.
Fig. 10 shows a block diagram of an apparatus for identifying a vehicle component according to another embodiment. As shown in fig. 10, the apparatus 1000 includes:
an obtaining unit 1010 configured to obtain a vehicle picture with a label, where the label includes at least one category tag corresponding to at least one picture area including at least one vehicle component; a determining unit 1020 configured to determine, based on the vehicle picture, a plurality of candidate regions that are potential targets of the at least one vehicle component and a plurality of first feature vectors corresponding to the candidate regions; a converting unit 1030, configured to convert the plurality of first feature vectors into a plurality of second feature vectors, wherein for an ith first feature vector, the plurality of first feature vectors are subjected to weighted summation based on a correlation degree between the ith first feature vector and the plurality of first feature vectors to obtain a corresponding ith second feature vector; wherein i is a positive integer; a first prediction unit 1040, configured to input the plurality of first feature vectors into a first classification layer, so as to obtain a plurality of first prediction results; a second prediction unit 1050 configured to input the plurality of second feature vectors into a second classification layer, so as to obtain a plurality of second prediction results; a training unit 1060 configured to train the first classification layer and the second classification layer using the plurality of first predictors, the plurality of second predictors, and the at least one class label; the second classification layer is used for classifying the components of the vehicle pictures to be recognized.
In one embodiment, the converting unit 1030 is specifically configured to: converting some of the plurality of first feature vectors into zero vectors as corresponding second feature vectors; and the intersection ratio IOU between the candidate area corresponding to each first characteristic vector in certain first characteristic vectors and each picture area in the at least one picture area is smaller than a first preset threshold value.
In one embodiment, the training unit 1060 specifically includes: a determining subunit 1061, configured to determine, based on the at least one picture area and the at least one category label, a category label corresponding to each candidate area in the multiple candidate areas; a training subunit 1062, configured to train the first classification layer and the second classification layer by using the plurality of first prediction results, the plurality of second prediction results, and the category labels corresponding to the candidate regions.
In a specific embodiment, the plurality of candidate regions includes an arbitrary first candidate region; the determination subunit 1061 is specifically configured to: calculating an intersection ratio IOU of the first candidate area and each picture area in the at least one picture area to obtain at least one IOU value; determining a category label corresponding to a corresponding picture area as a category label of the first candidate area when an IOU value larger than a predetermined threshold value exists in the at least one IOU value; or, in the absence of an IOU value of the at least one IOU value that is greater than a predetermined threshold, determining a non-vehicle component as a category label for the first candidate region.
In a specific embodiment, the determining unit 1020 is specifically configured to: carrying out convolution processing on the vehicle picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolution feature map;
the training unit 1060 is specifically configured to: training the first classification layer, the second classification layer, and the RPN network using the first plurality of predictors, the second plurality of predictors, and the at least one class label.
FIG. 11 illustrates an apparatus structure for identifying a target object, according to one embodiment. As shown in fig. 11, the apparatus 1100 includes: an obtaining unit 1110 configured to obtain a sample picture to be identified; a first determining unit 1120, configured to determine, based on the sample picture, a plurality of candidate regions that potentially target at least one target object and a plurality of first feature vectors corresponding to the candidate regions; a converting unit 1130, configured to convert the plurality of first feature vectors into a plurality of second feature vectors, wherein for an ith first feature vector, the plurality of first feature vectors are subjected to weighted summation based on the association degree between the ith first feature vector and the plurality of first feature vectors to obtain a corresponding ith second feature vector; wherein i is a positive integer; a second determining unit 1140 configured to determine the category of the target object corresponding to each of the candidate regions based on the second feature vectors.
In an embodiment, the first determining unit 1110 is specifically configured to: carrying out convolution processing on the sample picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolved feature map.
In one embodiment, the plurality of candidate regions includes an ith candidate region corresponding to the ith first feature vector and a number of candidate regions corresponding to the number of first feature vectors; the candidate regions are located within a position range set for the ith candidate region.
In a more specific embodiment, the step of determining the candidate regions includes determining a first candidate region from the candidate regions, and the step of determining the candidate regions from the first candidate region and the ith candidate region includes: a distance between a region center of the first candidate region and a region center of the ith candidate region is less than a distance threshold.
In one example, the distance threshold is set based on the width and height of the ith candidate region.
In another more specific embodiment, the step of determining the candidate regions includes the step of determining a second candidate region from the candidate regions, and the step of determining the candidate regions from the second candidate region to the ith candidate region includes: the second candidate region is located on a predetermined azimuth side of the i-th candidate region.
In one example, the predetermined azimuth side includes at least one of: upper, lower, left and right sides.
In one example, the second candidate region does not overlap with the i-th candidate region, and a distance between the second candidate region and the i-th candidate region is shortest among candidate regions located on a predetermined azimuth side of the i-th candidate region.
In one embodiment, the first determining unit 1110 is further configured to: determining a plurality of first borders corresponding to the candidate areas; the device further comprises: an adjusting unit 1150 configured to adjust the plurality of first frames based on the plurality of second feature vectors.
Fig. 12 is a diagram illustrating an apparatus for recognizing a target object according to another embodiment. As shown in fig. 12, the apparatus 1200 includes:
an obtaining unit 1210 configured to obtain a sample picture with an annotation, where the annotation includes at least one category label corresponding to at least one picture region including at least one target object; a determining unit 1220, configured to determine, based on the sample picture, a plurality of candidate regions that are potential targets of the at least one target object and a plurality of first feature vectors corresponding to the candidate regions; a converting unit 1230, configured to convert the plurality of first feature vectors into a plurality of second feature vectors, wherein for an ith first feature vector, the ith first feature vectors are subjected to weighted summation based on the correlation between the ith first feature vector and the first feature vectors to obtain a corresponding ith second feature vector; wherein i is a positive integer; a first prediction unit 1240 configured to input the plurality of first feature vectors into a first classification layer, resulting in a plurality of first prediction results; a second prediction unit 1250 configured to input the plurality of second feature vectors into a second classification layer to obtain a plurality of second prediction results; a training unit 1260 configured to train the first and second classification layers using the plurality of first predictors, the plurality of second predictors, and the at least one class label; the second classification layer is used for classifying the target object of the sample picture to be identified.
In one embodiment, the converting unit 1230 is specifically configured to: converting some of the plurality of first feature vectors into zero vectors as corresponding second feature vectors; and the intersection ratio IOU between the candidate area corresponding to each first characteristic vector in certain first characteristic vectors and each picture area in the at least one picture area is smaller than a first preset threshold value.
In one embodiment, the training unit 1260 specifically includes: a determining subunit 1261, configured to determine, based on the at least one picture region and the at least one category label, a category label corresponding to each candidate region in the plurality of candidate regions; a training subunit 1262, configured to train the first classification layer and the second classification layer by using the plurality of first prediction results, the plurality of second prediction results, and the category labels corresponding to the candidate regions.
In a specific embodiment, the plurality of candidate regions includes an arbitrary first candidate region; the determining subunit 1261 is specifically configured to: calculating an intersection ratio IOU of the first candidate area and each picture area in the at least one picture area to obtain at least one IOU value; determining a category label corresponding to a corresponding picture area as a category label of the first candidate area when an IOU value larger than a predetermined threshold value exists in the at least one IOU value; or, in the absence of an IOU value greater than a predetermined threshold value in the at least one IOU value, determining a non-target object as a category label for the first candidate region.
In a specific embodiment, the determining unit 1220 is specifically configured to: carrying out convolution processing on the sample picture to obtain a corresponding convolution characteristic diagram; determining the plurality of candidate regions and the plurality of first feature vectors using a region generation network (RPN) based on the convolution feature map;
the training unit 1260 is specifically configured to: training the first classification layer, the second classification layer, and the RPN network using the first plurality of predictors, the second plurality of predictors, and the at least one class label.
As above, according to an embodiment of a further aspect, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2 or fig. 4 or fig. 7 or fig. 8.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor which, when executing the executable code, implements the method described in connection with fig. 2 or fig. 4 or fig. 7 or fig. 8.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present specification, and are not intended to limit the scope of the embodiments disclosed in the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present specification should be included in the scope of the embodiments disclosed in the present specification.