Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a vehicle loss detection method according to an embodiment of the present invention, where the present embodiment is applicable to a vehicle loss detection situation, the method may be executed by an electronic device, and the electronic device may be a computer device or a terminal, and specifically includes the following steps:
and step 110, acquiring a target image.
The target image is an image for vehicle loss detection. The user can take a picture of the damaged vehicle through the handheld terminal, and the picture obtained through taking the picture is used as a target image. The pre-captured image may also be imported to a computer device as a target image.
And 120, inputting the target image into a network model, wherein a trunk network of the network model comprises a Swin Transformer network trunk network, and the Swin Transformer network trunk network is used for predicting the damage position coordinates and the damage types of the target image based on the Swin Transformer network.
The structure of the Swin Transformer network is shown in FIG. 2, and includes a block partition (patch partition) and four stages. Each stage includes a linear embedding layer (linear embedding) and a Swin Transformer block (block). Each stage is used to perform a down-sampling.
Illustratively, an input target image 224 × 224 is divided into a non-overlapping set of patches by a patch partition (patch partition), wherein each patch has a size of 4 × 4, the target image has 3 color channels, each patch has a feature dimension of 4 × 3 ═ 48, and the number of patches is H/4x W/4.
In the stage1 part (stage1), firstly, a linear embedding layer (linear embedding) is used for changing the characteristic dimension of the partitioned patch into C, and then the C is sent into a Swin Transformer Block; the operations of stage2-stage4 are the same, firstly, through a patch clustering, input adjacent blocks of 2x2 are merged, the number of the obtained patch blocks becomes H/8x W/8, the characteristic dimension becomes 4C, and so on, the feature vector of the target image is processed through four stages, and the vehicle damage type and the damaged position information are obtained. In the Swin Transformer network, the size of each patch is preset, and the number of the patches is determined according to the determined size of the patches.
The segmentation layer is used for segmenting the image into a plurality of patches and obtaining a feature vector of each patch. And the stage1 to the stage4 are used for carrying out image recognition according to the characteristic vector to obtain the damage position coordinate and the damage category of the target image. Stage1 identifies a feature vector of a target image in each block in units of blocks. And the stage2 merges the blocks in the stage1 to obtain the number of the blocks H/8x W/8, and identifies the feature vector of the target image in each block according to the merged blocks. And by analogy, combining the blocks of the previous stage in the next stage, and identifying the feature vector of the target image according to the combined blocks patch. And 4, after the characteristic vector of the target image is obtained, mapping the characteristic vector to a neural network for image recognition.
Optionally, inputting the target image into the network model, including: convolving the image through the convolution layer to obtain convolution data; the convolved data are used as input to a Swin Transformer network.
Alternatively, a convolution layer is provided before the block division layer (patch partition), and the target image is convolved by the convolution layer. Illustratively, two layers of 3 by 3 convolutional layers are configured, and the target image is convolved using the two layers of 3 by 3 convolutional layers and converted into convolution data. The convolution data is input to the patch partition layer (patch partition).
The convolution layer is used for carrying out convolution on the image, so that not only can the complexity of subsequent calculation be reduced, but also the model precision can be improved. Using two layers of 3 by 3 convolutional layers can further improve the convolution efficiency.
After the convolution data is input to the patch partition layer (patch partition), the input convolution data is divided into non-overlapping patch sets by the patch partition layer (patch partition) as input features of the Swin Transformer network.
The Swin Transformer network as the backbone is formed by stacking Swin Transformer blocks in each stage. The input features are transformed in feature dimension by a linear embedding layer (linear embedding). The Swin Transformer network realizes the multiplexing of the characteristics by combining the input according to the adjacent latches.
As shown in fig. 3, each Swin Transformer block (Swin Transformer block) consists of a displacement window based MSA (multi-head self attribute) with two layers of MLPs (Muti-Layer persistence). A layernorm (ln) layer is used before each MSA module and each MLP and residual concatenation is used after each MSA and MLP. The MSA module divides an input picture into non-coincident windows, and then performs self-attention calculation in different windows, wherein the calculation complexity and the image size are in a linear relation.
Optionally, the Swin Transformer network includes a plurality of Swin Transformer blocks, and each Swin Transformer block includes a plurality of MSA layers;
the input of the MSA layer is provided with a first convolution layer; the output of the MSA layer is provided with a second convolutional layer.
For each MSA layer, a first convolutional layer is set at its input for dimensionality reduction. And setting a second convolution layer at the output of the transformer for dimension increasing. Illustratively, the first convolution layer may be a 1 x 1 convolution layer. The second convolution layer may be a 1 x 1 convolution layer. Correspondingly, the input of the MSA layer is provided with a 1 × 1 convolution layer; the output of the MSA layer is provided with 1 x 1 convolutional layers. By providing convolution layers for each input and output of the MSA layer, the characteristic operation efficiency can be improved, and the operation speed can be increased. For each MSA layer, 1 x 1 convolutional layer is set at its input for dimensionality reduction. At its output, 1 x 1 convolutional layers are set for dimensionality enhancement.
Optionally, the backbone network is connected to a neck network, and the neck network includes:
feature Pyramid Networks (FPN) and Balanced Feature Pyramid Networks (BFP).
The characteristic map pyramid network is used for extracting characteristics of the image of each scale, multi-scale characteristic representation can be generated, and characteristic maps of all levels have strong semantic information and even comprise some characteristic maps with high resolution.
And (3) performing convolution on the images in the stages 1 to 4 according to the sizes, namely, from the bottom layer to the top layer of the feature pyramid network, performing feature extraction on the image of each layer by the feature pyramid network to generate multi-scale feature representation, and fusing the features. The images of each layer have certain semantic information. Feature fusion may be performed through a feature map pyramid network. The balanced feature pyramid network is used for enhancing the semantic features of the multilayer feature layer balanced through deep integration. Features are enhanced by a balanced feature pyramid network.
The neck network is used for connecting the backbone network backbone and the head network head, so that the characteristics output by the backbone network can be more efficiently applied to the head network, and the data processing efficiency is improved.
And step 130, determining a damage detection result according to the damage position coordinate and the damage type.
After the Swin Transformer network outputs the damage position coordinates and the damage types through forward propagation in step 120, a final damage detection result can be screened out through a soft-NMS (non-maximum suppression) algorithm.
According to the vehicle loss detection method provided by the embodiment of the invention, a target image is obtained; inputting a target image into a network model, wherein a trunk network of the network model comprises a Swin transform network trunk network and is used for predicting damage position coordinates and damage types of the target image based on the Swin transform network; and determining a damage detection result according to the damage position coordinate and the damage category. Compared with the current method that the CNN is used for detecting the vehicle loss is not accurate enough, the method and the device provided by the embodiment of the invention use the Swin transducer network as the main network, are more accurate compared with the CNN detection mode, and can effectively position and identify the damaged part. By adopting Swin transform as a backbone network to extract features, spatial information relation among pixels of the image and weighting selection of the features can be explored, so that better feature extraction and utilization are realized. Meanwhile, the Swin Transformer has the characteristics of locality, translation invariance, residual learning and the like of the CNN, so that the problems of complex calculated amount and large memory consumption in other visual Transformer schemes can be solved while the performance exceeds that of the CNN method. The Swin transform block in the Swin transform has the advantages of wide range of vehicle type application and detection, suitability for field environment and complex photographing background, can realize high-efficiency damage assessment of damaged parts of vehicles, and optimizes damage assessment efficiency based on the method of the self-attention mechanism.
Example two
Fig. 4 is a flowchart of a vehicle loss detection method according to a second embodiment of the present invention, which further illustrates the above embodiment, and before the target image is acquired in step 110, the method further includes a step of training a Swin Transformer network. The first embodiment provides an implementation method for detecting traffic loss by using a Swin Transformer network as a backbone network. The embodiment is used for providing the training mode of the network. The method can be implemented by the following steps:
and step 210, marking the vehicle loss historical picture according to a marking criterion, and configuring the damage type of the vehicle loss historical picture.
Wherein, the damage category and the marking criterion can be determined by the settlement personnel and the algorithm engineer after meeting. The damage categories include vehicle damage of varying severity that requires reimbursement. The marking criteria include special case marking criteria such as overlapping of various injuries, uncertain whether the injuries are injuries, uncertain why the injuries are, and the like. The categories of injury include: scratches, dents, wrinkles, dead folds, tears, deletions, and the like.
And marking historical pictures of the vehicle body damage in batches based on the damage category. Optionally, manual labeling may be performed. And marking the damage form appearing in each picture by adopting a rectangular frame, and recording the damage type of the damage form. Further, pictures which are difficult to distinguish damage types are removed, and a vehicle body damage database is constructed.
And step 220, training the Swin Transformer network according to the marked vehicle loss historical picture.
Optionally, a part of the image is used as a training set and another part of the image is used as a testing set from the body damage database.
And (3) randomly cutting all pictures in the training set, randomly rotating, randomly changing data enhancement operations such as saturation, hue and contrast and the like, then scaling the pictures to 896 × 896 pixels, and inputting the pixels into a Swin transform for training. The training process comprises the step of taking parameters such as the car damage image and the mark of the damage type as input to train the Swin transform network. And testing on the test set every 1 period (epoch), and respectively storing the model parameters with the highest detection model map. And optimizing the Swin Transformer network through multiple iterations.
Optionally, training the Swin Transformer network according to the labeled vehicle loss historical image includes:
and in the training process, performing regression calculation of the Swin transducer network according to the distance punishment damage function.
The IOU is also called an Intersection over Union, and represents a ratio of an Intersection and a Union of a "predicted bounding box" and a "real bounding box". The network is usually trained by using an IOU calculation formula and a bounding box positioning loss function. However, the accuracy obtained using the above calculation method is low. Therefore, the regression calculation of the Swin Transformer network is performed according to the distance punishment damage function, and therefore the positioning accuracy of the predicted ore is improved. The dioLOss penalty function may still provide a direction of movement for the bounding box when it does not overlap the target box. In addition, DIoU loss has a faster convergence rate relative to IOU loss. Meanwhile, for the case where two frames are included in the horizontal direction and the vertical direction, the DIoU loss can realize a fast regression.
Illustratively, the distance penalty damage function (DIoU Loss) is used to perform bounding box regression calculations for Swin Transformer networks. Distance punished damage LDIoUCan be calculated by the following formula:
wherein b and bgtRespectively representing the center points, p, of the prediction and real boxes2(b,bgt) The expression calculates the euclidean distance between the two center points. C represents the diagonal distance of the minimum closure area that can contain both the prediction box and the real box. IoU denotes the intersection ratio of the prediction box and the real box.
Optionally, training the Swin Transformer network according to the labeled vehicle loss historical image includes:
in the training process, data enhancement is carried out according to the vehicle loss historical picture; and training the Swin Transformer network by using the data-enhanced car loss historical picture.
In the training process, different data enhancement methods can be adopted according to the vehicle loss historical pictures, and the methods comprise the steps of trying different types of optimizers, adopting a learning rate reduction strategy, adopting a regularization technology and the like. In addition, a multi-scale training mode is adopted to train for enough time epochs to make the loss values of the model in the training set and the test set converge, and the model parameters with the highest map of the network in the test set are stored. Where a complete data set passes through the neural network once and back once, this process is called a time epoch.
In addition, a small number of targeted data enhancements, including mosaic and dim light, can be misdetected, thus randomly adding mosaic and image saturation changes to the data enhancements.
Step 230, a target image is acquired.
Step 240, inputting the target image into a network model, wherein a trunk network of the network model comprises a Swin Transformer network trunk network, and the Swin Transformer network trunk network is used for predicting the damage position coordinates and the damage category of the target image based on the Swin Transformer network.
And step 250, determining a damage detection result according to the damage position coordinate and the damage category.
The vehicle loss detection method provided by the embodiment of the application can train the network more efficiently, so that the trained network is more accurate.
EXAMPLE III
Fig. 5 is a schematic structural diagram of a vehicle loss detection apparatus according to a third embodiment of the present invention, where the present embodiment is applicable to a vehicle loss detection situation, the method may be executed by an electronic device, and the electronic device may be a computer device or a terminal, and specifically includes: an image acquisition module 310, a detection module 320, and a detection result determination module 330.
An image acquisition module 310 for acquiring a target image;
the detection module 320 is configured to input the target image into a network model, where a backbone network of the network model includes a Swin Transformer network, and the backbone network is used for predicting the damage position coordinates and the damage category of the target image based on the Swin Transformer network;
and a detection result determining module 330, configured to determine a damage detection result according to the damage position coordinate and the damage category.
On the basis of the foregoing embodiment, the detection module 320 is configured to:
convolving the image through the convolution layer to obtain convolution data;
and taking the convolution data as an input of a Swin Transformer network.
On the basis of the above embodiment, the Swin Transformer network includes a plurality of Swin Transformer blocks, and each Swin Transformer block includes a plurality of MSA layers;
the input of the MSA layer is provided with a first convolution layer;
the output of the MSA layer is provided with a second convolutional layer.
Specifically, the input of the MSA layer is provided with a 1 × 1 convolution layer, and the output of the MSA layer is provided with a 1 × 1 convolution layer.
On the basis of the above embodiment, the backbone network is connected to a neck network, and the neck network includes:
a feature map pyramid network and a balanced feature pyramid network.
On the basis of the above embodiment, the training device further comprises a training module. The training module is used for:
marking the vehicle loss historical picture according to a marking criterion, and configuring the damage category of the vehicle loss historical picture;
and training the Swin transform network according to the marked vehicle loss historical picture.
On the basis of the above embodiment, the training module is configured to:
and in the training process, performing regression calculation of the Swin transducer network according to the distance punishment damage function.
On the basis of the above embodiment, the training module is configured to:
in the training process, data enhancement is carried out according to the vehicle loss historical picture;
and training the Swin Transformer network by using the data-enhanced car loss historical picture.
In the vehicle loss detection apparatus provided in the embodiment of the present invention, the image obtaining module 310 obtains a target image; the detection module 320 inputs the target image into a network model, wherein a trunk network of the network model comprises a Swin Transformer network, and the trunk network is used for predicting the damage position coordinates and the damage types of the target image based on the Swin Transformer network; a detection result determination module 330. And determining a damage detection result according to the damage position coordinate and the damage category. Compared with the current method that the CNN is used for detecting the vehicle loss is not accurate enough, the method and the device provided by the embodiment of the invention use the Swin transducer network as the main network, are more accurate compared with the CNN detection mode, and can effectively position and identify the damaged part. By adopting Swin transform as a backbone network to extract features, spatial information relation among pixels of the image and weighting selection of the features can be explored, so that better feature extraction and utilization are realized. Meanwhile, the Swin Transformer has the characteristics of locality, translation invariance, residual learning and the like of the CNN, so that the problems of complex calculated amount and large memory consumption in other visual Transformer schemes can be solved while the performance exceeds that of the CNN method. The Swin transform block in the Swin transform has the advantages of wide range of vehicle type application and detection, suitability for field environment and complex photographing background, can realize high-efficiency damage assessment of damaged parts of vehicles, and optimizes damage assessment efficiency based on the method of the self-attention mechanism.
The vehicle loss detection device provided by the embodiment of the invention can execute the vehicle loss detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 6 is a schematic structural diagram of an electronic apparatus according to a fourth embodiment of the present invention, as shown in fig. 6, the electronic apparatus includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of the processors 40 in the electronic device may be one or more, and one processor 40 is taken as an example in fig. 6; the processor 40, the memory 41, the input device 42 and the output device 43 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 6.
The memory 41, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the vehicle loss detection method in the embodiment of the present invention (for example, the image acquisition module 310, the detection module 320, the detection result determination module 330, and the training module in the vehicle loss detection apparatus). The processor 40 executes various functional applications of the electronic device and data processing by executing software programs, instructions, and modules stored in the memory 41, that is, implements the vehicle loss detection method described above.
The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 42 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function controls of the electronic apparatus. The output device 43 may include a display device such as a display screen.
EXAMPLE five
Fifth, an embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a vehicle loss detection method, the method comprising:
acquiring a target image;
inputting the target image into a network model, wherein a trunk network of the network model comprises a Swin transform network, and the trunk network is used for predicting the damage position coordinates and the damage types of the target image based on the Swin transform network;
and determining a damage detection result according to the damage position coordinate and the damage category.
On the basis of the above embodiment, the inputting the target image into the network model includes:
convolving the image through the convolution layer to obtain convolution data;
and taking the convolution data as an input of a Swin Transformer network.
On the basis of the above embodiment, the Swin Transformer network includes a plurality of Swin Transformer blocks, and each Swin Transformer block includes a plurality of MSA layers;
the input of the MSA layer is provided with a first convolution layer; (the input of the MSA layer is provided with 1 x 1 convolution layer)
The output of the MSA layer is provided with a second convolutional layer.
Specifically, the input of the MSA layer is provided with a 1 × 1 convolution layer, and the output of the MSA layer is provided with a 1 × 1 convolution layer.
On the basis of the above embodiment, the backbone network is connected to a neck network, and the neck network includes:
a feature map pyramid network and a balanced feature pyramid network.
On the basis of the above embodiment, before acquiring the target image, the method further includes:
marking the vehicle loss historical picture according to a marking criterion, and configuring the damage category of the vehicle loss historical picture;
and training the Swin transform network according to the marked vehicle loss historical picture.
On the basis of the above embodiment, the training of the Swin Transformer network according to the labeled car loss history picture includes:
and in the training process, performing regression calculation of the Swin transducer network according to the distance punishment damage function.
On the basis of the above embodiment, the training of the Swin Transformer network according to the labeled car loss history picture includes:
in the training process, data enhancement is carried out according to the vehicle loss historical picture;
and training the Swin Transformer network by using the data-enhanced car loss historical picture.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the method described above, and may also execute the relevant operations in the vehicle loss detection method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the vehicle loss detection apparatus, the included units and modules are only divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.