CN110647853A

CN110647853A - Computer-implemented vehicle damage assessment method and device

Info

Publication number: CN110647853A
Application number: CN201910923001.6A
Authority: CN
Inventors: 蒋晨; 程远; 郭昕
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-03
Also published as: WO2021057069A1

Abstract

The embodiment of the specification provides a computer-implemented vehicle loss assessment method, which is used for performing intelligent loss assessment on video streams generated by shooting damaged vehicles. Specifically, firstly, preliminary target detection and feature extraction are performed on image frames in a video stream to obtain a video stream feature matrix. And, the target detection is carried out again on the key frame in the video stream to obtain a key frame vector. And then, fusing the characteristics in the video stream characteristic matrix and the key frame vector aiming at each component respectively to generate the comprehensive damage characteristics of the component. And on the other hand, performing preliminary loss assessment based on the video stream characteristic matrix to obtain a preliminary loss assessment result. And finally, performing loss assessment again based on the initial loss assessment result and the comprehensive damage characteristics of each part to obtain a final loss assessment result for the video stream.

Description

Computer-implemented vehicle damage assessment method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of video processing technology, and more particularly, to a method and apparatus for processing video streams using machine learning for vehicle intelligent damage assessment.

Background

In a traditional vehicle insurance claim settlement scene, an insurance company needs to send out professional loss survey and assessment personnel to an accident site for field loss survey and assessment, give a vehicle maintenance scheme and a compensation amount, take a field picture, and reserve the loss survey picture for a background checker to check loss and check price. Due to the need for manual investigation and damage assessment, insurance companies require a great deal of human cost and training cost for professional knowledge. From the experience of common users, the claim settlement process is characterized in that the claim settlement process waits for manual field shooting of a surveyor, loss settlement of the loss settlement worker in a maintenance place and loss verification of the loss verification worker in the background, the claim settlement period is as long as 1-3 days, the waiting time of the users is long, and the experience is poor.

Aiming at the industry pain point with huge labor cost mentioned in the background of requirements, the application of artificial intelligence and machine learning to the scene of vehicle damage assessment is assumed, and the computer vision image recognition technology in the field of artificial intelligence is expected to be utilized to automatically recognize the vehicle damage condition reflected in the picture according to the field loss picture shot by a common user and automatically provide a maintenance scheme. Therefore, damage assessment and check are not required to be manually investigated, the cost of insurance companies is greatly reduced, and the vehicle insurance claim settlement experience of common users is improved.

However, the accuracy of determining the vehicle damage in the current intelligent damage assessment scheme is still to be further improved. Therefore, an improved scheme is desired that can further optimize the detection result of the vehicle damage and improve the recognition accuracy.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and an apparatus for intelligent loss assessment of a vehicle based on video streaming, which may improve the accuracy of the intelligent loss assessment comprehensively.

According to a first aspect, there is provided a computer-implemented method of vehicle damage assessment, the method comprising: acquiring a feature matrix of a video stream, wherein the video stream is shot for a damaged vehicle, the feature matrix at least comprises N M-dimensional vectors which respectively correspond to N image frames in the video stream and are arranged according to the time sequence of the N image frames, and each M-dimensional vector at least comprises component detection information obtained by a pre-trained first component detection model and damage detection information obtained by the pre-trained first damage detection model for the corresponding image frame; acquiring K key frames in the video stream; generating corresponding K key frame vectors aiming at the K key frames, wherein each key frame vector comprises component detection information obtained by a pre-trained second component detection model aiming at a corresponding key frame image and damage detection information obtained by the pre-trained second damage detection model; fusing the component detection information and the damage detection information in the N M-dimensional vectors and the K key frame vectors to obtain the comprehensive damage characteristics of each component; acquiring a preliminary damage result, wherein the preliminary damage result comprises the damage result of each component obtained after the characteristic matrix is input into a pre-trained convolutional neural network; and inputting the comprehensive damage characteristics of each part and the preliminary damage result into a pre-trained decision tree model to obtain a final damage assessment result aiming at the video stream.

In one embodiment, obtaining the feature matrix for the video stream includes receiving the feature matrix from a mobile terminal.

In one embodiment, obtaining the feature matrix of the video stream comprises: acquiring the video stream; for each image frame in the N image frames, carrying out component detection through the first component detection model to obtain component detection information, and carrying out damage detection through the first damage detection model to obtain damage detection information; forming an M-dimensional vector corresponding to each image frame based on at least the component detection information and the damage detection information; and generating the feature matrix according to the respective M-dimensional vectors of the N image frames.

In one embodiment, obtaining K key frames in the video stream comprises: the K key frames are received from the mobile terminal.

In one embodiment, the second component detection model is different from the first component detection model, and the second damage detection model is different from the first damage detection model.

In one embodiment, fusing the component detection information and the damage detection information in the N M-dimensional vectors and the K keyframe vectors to obtain a comprehensive damage characteristic of each component, including: determining at least one candidate damaged component, including a first component; and for each vector in the N M-dimensional vectors and the K key frame vectors, performing intra-frame fusion on the part detection information and the damage detection information in a single vector to obtain the frame comprehensive characteristics of the first part, and performing inter-frame fusion on the frame comprehensive characteristics of the first part obtained for each vector to obtain the comprehensive damage characteristics of the first part.

In one embodiment, the obtaining the preliminary damage result includes receiving the preliminary damage identification result from the mobile terminal.

In one embodiment, the feature matrix includes M rows and S columns, where S is not less than N, the convolutional neural network includes several one-dimensional convolutional kernels, and the inputting the feature matrix into a pre-trained convolutional neural network includes: and performing convolution processing on the feature matrix on the row dimension of the feature matrix by using the plurality of one-dimensional convolution kernels.

In one embodiment, the convolutional neural network is trained by: obtaining a plurality of training samples, wherein each training sample comprises a sample feature matrix of each video stream and a corresponding damage result label, and the sample feature matrix of each video stream at least comprises N M-dimensional vectors which respectively correspond to N image frames in each video stream and are arranged according to the time sequence of the N image frames; training the convolutional neural network using the plurality of training samples.

In a specific embodiment, the damage result label includes at least one of: damaged material, damaged type, and damaged component type.

In one embodiment, after said obtaining the final impairment result for the video stream, the method further comprises: and determining a corresponding replacement scheme according to the final damage assessment result.

According to a second aspect, there is provided a computer-implemented vehicle damage assessment apparatus, the apparatus comprising: a first obtaining unit configured to obtain a feature matrix of a video stream, the video stream being captured for a damaged vehicle, the feature matrix including at least N M-dimensional vectors that correspond to N image frames in the video stream, respectively, and are arranged in a time sequence of the N image frames, each of the M-dimensional vectors including at least, for the corresponding image frame, component detection information obtained by a pre-trained first component detection model, and damage detection information obtained by the pre-trained first damage detection model; a second obtaining unit configured to obtain K key frames in the video stream; a generating unit configured to generate, for the K keyframes, corresponding K keyframe vectors, each keyframe vector including, for a corresponding keyframe image, component detection information obtained by a pre-trained second component detection model, and damage detection information obtained by a pre-trained second damage detection model; the fusion unit is configured to fuse component detection information and damage detection information in the N M-dimensional vectors and the K key frame vectors to obtain comprehensive damage characteristics of each component; a third obtaining unit, configured to obtain a preliminary damage result, where the preliminary damage result includes a damage result of each component obtained after inputting the feature matrix into a pre-trained convolutional neural network; and a damage assessment unit configured to input the comprehensive damage features of the components and the preliminary damage result into a pre-trained decision tree model to obtain a final damage assessment result for the video stream.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

According to a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, intelligent damage assessment is carried out on the basis of a video stream generated by shooting a damaged vehicle. Specifically, on one hand, a feature matrix of the video stream and information of a key frame are fused to obtain comprehensive damage features, on the other hand, the feature matrix is input into a pre-trained convolution neural network to obtain a preliminary damage result, and then the comprehensive damage features and the preliminary damage result are input into a decision model together to obtain a final damage assessment result. Through above mode, promote the degree of accuracy of vehicle intelligence loss assessment comprehensively.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram illustrating an exemplary implementation scenario of one embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a vehicle damage assessment method according to one embodiment;

fig. 3a shows an example of component detection information obtained for a certain image frame;

fig. 3b shows an example of damage detection information obtained for a certain image frame;

FIG. 4 illustrates a schematic diagram of convolving a feature matrix according to a specific example;

fig. 5 shows a schematic block diagram of a vehicle damage assessment device according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

The intelligent damage assessment system for vehicles mainly relates to the automatic identification of damaged conditions of vehicles from pictures of vehicle damage sites shot by ordinary users. In order to identify the car damage condition, a method generally adopted in the industry is to compare a car damage picture to be identified, which is shot by a user, with a massive historical database to obtain a similar picture, and determine a damaged component and the degree thereof on the picture to be identified based on a damage assessment result of the similar picture. However, such approaches compromise recognition accuracy.

According to one embodiment, a machine learning mode of supervised training is adopted to train an object detection model of a picture, the part object and the damage object of the vehicle are respectively detected by adopting the model, and then the vehicle damage condition of the vehicle in the picture is determined based on comprehensive analysis of detection results.

Furthermore, according to the conception and the implementation framework of the present specification, an intelligent damage assessment method based on video streaming is provided, considering that the video streaming can reflect the comprehensive information of the vehicle more accurately than the isolated picture. Fig. 1 is a schematic diagram of an exemplary implementation scenario of an embodiment disclosed in this specification. As shown in fig. 1, a user may take a picture of a car damage scene through a portable mobile terminal, such as a smart phone, to generate a video stream. The mobile terminal may be installed with an application or tool related to impairment recognition (for example, referred to as "impairment treasure"), the application or tool may perform preliminary processing on the video stream, perform lightweight, preliminary target detection and feature extraction on N image frames, and the target detection result and the feature extraction result of each frame may form an M-dimensional vector. Then, the mobile terminal may generate a feature matrix by performing a preliminary processing on the video stream, where the feature matrix includes at least N M-dimensional vectors. The application in the mobile terminal can also determine the key frame from the video stream, and perform preliminary damage assessment based on the generated feature matrix by using the CNN damage assessment model to obtain a preliminary damage result.

Then, the mobile terminal may send the feature matrix, the key frame, and the preliminary damage result to the server.

The server generally has more powerful and reliable computing capability, so that the server can detect the target again by using a more complex and more accurate target detection model on the key frames in the video stream, and detect the component information and the damage information of the vehicle.

Then, the server side fuses the information of the feature matrix and the information detected aiming at the key frame to generate comprehensive damage features aiming at each component. It should be further understood that the preliminary damage results include damage assessment results for each component.

On the basis, the comprehensive damage characteristics of each part and the preliminary damage result can be input into a decision tree model trained in advance to obtain a final damage assessment result aiming at the video stream, so that intelligent damage assessment is realized.

The following describes a specific implementation process of the intelligent impairment assessment.

FIG. 2 shows a flow diagram of a vehicle damage assessment method according to one embodiment. The method can be executed by a server, and the server can be embodied as any device, equipment, platform and equipment cluster with computing and processing capabilities. As shown in fig. 2, the method comprises at least the following steps:

step 21, obtaining a feature matrix of a video stream, wherein the video stream is shot for a damaged vehicle, the feature matrix at least comprises N M-dimensional vectors which respectively correspond to N image frames in the video stream and are arranged according to a time sequence of the N image frames, and each M-dimensional vector at least comprises component detection information obtained by a pre-trained first component detection model and damage detection information obtained by the pre-trained first damage detection model for the corresponding image frame;

step 22, acquiring K key frames in the video stream;

step 23, generating K corresponding key frame vectors for the K key frames, where each key frame vector includes, for a corresponding key frame image, component detection information obtained through a pre-trained second component detection model, and damage detection information obtained through the pre-trained second damage detection model;

step 24, fusing the component detection information and the damage detection information in the N M-dimensional vectors and the K key frame vectors to obtain the comprehensive damage characteristics of each component;

step 25, obtaining a preliminary damage result, wherein the preliminary damage result comprises the damage result of each component obtained after the characteristic matrix is input into a pre-trained convolutional neural network;

and 26, inputting the comprehensive damage characteristics and the preliminary damage results of all the components into a pre-trained decision tree model to obtain a final damage assessment result aiming at the video stream.

The manner in which the above steps are performed is described below.

First, in step 21, a feature matrix of a video stream is acquired.

It is understood that in the scene of vehicle damage assessment, the video stream is generated by shooting the damaged vehicle at the vehicle damage site by the user through an image acquisition device, such as a camera, in the mobile terminal. As mentioned above, the mobile terminal may perform preliminary processing on the video stream through a corresponding application or tool to generate the feature matrix.

Specifically, N image frames may be extracted from the video stream and subjected to preliminary processing. The N image frames may include each image frame in the video stream, and may also be image frames extracted at a predetermined time interval (e.g., 500ms), or image frames obtained from the video stream in other extraction manners.

For each extracted image frame, object detection and feature extraction are performed on the image frame, thereby generating an M-dimensional vector for each image frame.

As known to those skilled in the art, target detection is used to identify a specific target object from a picture and classify the target object. By training with the image samples marked with the target positions and the target types, various target detection models can be obtained. The component detection model and the damage detection model are specific applications of the target detection model. When a vehicle component is used as a target object for marking and training, a component detection model can be obtained; when a damage object on the vehicle is used as a target object for marking and training, a damage detection model can be obtained.

In this step, in order to perform the preliminary processing on each image frame, a pre-trained component detection model, referred to as a first component detection model herein, may be used to perform component detection on the image frame to obtain component detection information; and a pre-trained damage detection model, referred to herein as a first damage detection model, is used to perform damage detection on the image frames to obtain damage detection information.

It should be understood that the terms "first", "second", and the like herein are used for distinguishing the same, and are not intended to limit the sequence thereof.

In the art, the target detection model is mostly realized by various detection algorithms based on the convolutional neural network CNN. In order to optimize the computational efficiency of the conventional convolutional neural network CNN, various lightweight network structures have been proposed, including, for example, squeezet, MobileNet, ShuffleNet, Xception, and so on. The lightweight neural network structures reduce network parameters by adopting different convolution calculation modes, thereby simplifying the convolution calculation of the traditional CNN and improving the calculation efficiency. Such a lightweight neural network architecture is particularly suitable for operation in mobile terminals with limited computational resources.

Accordingly, in one embodiment, the first component detection model and the first damage detection model are implemented by using the above lightweight network structure.

By performing component detection on the image frame by using the first component detection model, component detection information in the image frame can be obtained. In general, the component detection information may include component detection frame information for framing a vehicle component from a corresponding image frame, and a component category predicted for the framed component. More specifically, the part detection frame information may include a position of the part detection frame, for example, expressed in (x, y, w, h) form, and picture convolution information corresponding to the part detection frame extracted from the image frame.

Similarly, when the first damage detection model is used to perform damage detection on an image frame, damage detection information in the image frame may be obtained, and the damage detection information may include damage detection frame information for selecting a damage object from a corresponding image frame and a damage type for predicting the damage selected by the frame.

Fig. 3a shows an example of component detection information obtained for a certain image frame. It can be seen that FIG. 3a includes several part detection boxes, each of which selects a part. The first component detection model may also output a predicted component category corresponding to each component detection box, for example, a number in the upper left corner of each rectangular box represents a component category. For example, in FIG. 3a, numeral 101 represents the front right door, 102 represents the rear right door, 103 represents the door handle, and so on.

Fig. 3b shows an example of damage detection information obtained for a certain image frame. It can be seen that fig. 3b includes a series of rectangular frames, i.e., the damage detection frames output by the first damage detection model, and each damage detection frame selects one damage. The first damage detection model also outputs a predicted damage category corresponding to each damage detection frame, for example, the numbers at the upper left corner of each rectangular frame represent the damage categories. For example, in fig. 3b the numeral 12 represents the damage category as scratch, and there may be other numerals representing other damage categories, such as deformation with numeral 10, tearing with numeral 11, breaking with numeral 13 (glass article), etc.

In one embodiment, the first component detection model is further used for performing image segmentation on the detected vehicle components to obtain the contour segmentation result for each component in the image frame.

As known to those skilled in the art, image segmentation is the segmentation or division of an image into regions that belong to/do not belong to a specific target object, the output of which may appear as a Mask (Mask) covering the specific target object region. In the art, various image segmentation models have been proposed based on various network structures and various segmentation algorithms, such as a CRF (conditional random field) -based segmentation model, a Mask R-CNN model, and the like. Component segmentation as a specific application of image segmentation may be used to divide a picture of a vehicle into regions that belong/do not belong to a particular component. The component segmentation may be implemented using any of the existing segmentation algorithms.

In one embodiment, the first component detection model is trained to both identify components (i.e., location prediction and class prediction) and segment components. For example, a Mask R-CNN-based model that identifies a component and divides a component by two network branches after the base convolution layer may be used as the first component detection model.

In another embodiment, the first part detection model comprises a first submodel for part identification and a second submodel for part segmentation. The first submodel outputs the result of the recognition by the means, and the second submodel outputs the result of the division by the means.

In the case where the first part detection model is also used for part segmentation, the obtained part detection information further includes a result of the part segmentation. The part segmentation results may be embodied as an outline or footprint of the respective part.

As above, component detection information can be obtained by performing component detection on a certain image frame by the first component detection model; and carrying out damage detection on the image frame through the first damage detection model to obtain damage detection information. Then, an M-dimensional vector corresponding to the image frame may be formed based on the component detection information and the damage detection information.

For example, in one example, for each image frame extracted, a 60-dimensional vector is formed, in which elements of the first 30 dimensions represent part detection information and elements of the second 30 dimensions represent damage detection information.

According to one embodiment, in addition to object detection (including component detection and damage detection) on image frames, other aspects of feature analysis and extraction are performed on image frames, which are included in the above-described M-dimensional vectors.

In one embodiment, for the extracted image frames, video continuity features of the extracted image frames are obtained, and the features can reflect changes among the image frames, further reflect the stability and continuity of the video, and can also be used for tracking targets in the image frames.

In one example, for a current image frame, optical flow variation characteristics of the image frame relative to a previous image frame may be acquired as continuity characteristics thereof. The optical flow variation can be calculated by using some existing optical flow models.

In one example, for a current image frame, image Similarity (feature Similarity) of the image frame and a previous image frame may be obtained as a continuity feature. In one specific example, image similarity may be measured by an ssim (structural similarity index measure) index. Specifically, the SSIM index between the current image frame and the previous image frame may be calculated based on an average gray value and a gray variance of each pixel point in the current image frame, an average gray value and a gray variance of each pixel point in the previous image frame, and a covariance of each pixel point in the current image frame and the previous image frame. The maximum value of the SSIM index is 1, and the larger the SSIM index is, the higher the structural similarity of the image is.

In one example, for a current image frame, offset features of several feature points in the image frame relative to a previous image frame may be obtained, and continuity features are determined based on the offset features. Specifically, the feature point of the image is a point (for example, the upper left corner of the headlight left) in the image, which has a clear characteristic and can effectively reflect the essential features of the image and can identify the target object in the image. The feature points may be determined by a method such as SIFT (Scale-invariant feature transform), LBP (Local Binary Pattern), and the like. Thus, the change of two adjacent image frames can be evaluated according to the deviation of the characteristic point. Typically, the shift of feature points can be described by a projection matrix (projected matrix). For example, assuming that the feature point set of the current image frame is Y and the feature point set of the previous image frame is X, a transformation matrix w may be solved, so that the result of f (X) Xw is as close to Y as possible, and the solved transformation matrix w may be used as the projection matrix from the previous image frame to the current image frame. Further, the projection matrix may be used as a continuity feature of the current image frame.

It is understood that the various continuity features in the above examples may be used alone or in combination, and are not limited herein. In further embodiments, the change characteristics between image frames can also be determined as their continuity characteristics in further ways.

It should be noted that, when determining the continuity characteristic when the current image frame is the first image frame of the video stream, the image frame itself may be compared as the previous image frame, or the continuity characteristic may be directly determined as a predetermined value, for example, each element of the projection matrix is 1, or the optical flow output is 0, and so on.

According to one embodiment, for the extracted image frames, image frame quality characteristics are also obtained, which may reflect the shooting quality of the image frames, i.e. the effectiveness for object recognition. Generally, the application of the mobile terminal may include a shooting guidance model for guiding the shooting of the user, such as guidance of distance (closer to or farther away from a damaged vehicle), guidance of angle, and the like. The shooting guide model analyzes and generates image frame quality characteristics in shooting guide. The image frame quality features may include: a feature indicating whether the image frame is blurred, a feature indicating whether the image frame contains a target, a feature indicating whether the image frame is sufficiently illuminated, a feature indicating whether the image frame is photographed at a predetermined angle, and the like. One or more of these features may also be included in the aforementioned M-dimensional vector for each image frame.

For example, in one specific example, for each extracted image frame, an 80-dimensional vector is formed, in which elements of 1-10 dimensions represent image frame quality features, elements of 11-20 dimensions represent video continuity features, elements of 21-50 dimensions represent part detection information, and elements of 51-80 dimensions represent damage detection information.

In this way, by performing object detection and feature extraction on N image frames extracted from the video stream, respectively, an M-dimensional vector is generated for each image frame, and thus N M-dimensional vectors are generated.

In one embodiment, the N M-dimensional vectors are arranged according to a time sequence of N image frames, thereby obtaining an N x M-dimensional matrix as a feature matrix of the video stream.

In one embodiment, N M-dimensional vectors corresponding to N image frames are also preprocessed as feature matrices of the video stream. The preprocessing may include a normalization operation to sort the feature matrix into fixed dimensions.

It will be appreciated that the dimensions of the feature matrix are typically predetermined, and that image frames in a video stream are often decimated at intervals, while the length of the video stream is always changing and its total length may not be known in advance. Therefore, it is not always possible to conform the dimensional requirements of the feature matrix by directly combining the M-dimensional vectors of the actually extracted image frames. In one embodiment, it is assumed that the dimensions of the feature matrix are pre-set to S frame M vectors. If the number of the image frames extracted from the video stream is less than S, the M-dimensional vectors corresponding to the extracted image frames can be arranged into an S-x-M-dimensional matrix through a filling operation, an interpolation operation, a pooling operation, and the like. In such a case, S > N, N being the number of actually decimated image frames contained in the feature matrix. In a specific embodiment, interpolation may be used to supplement the S-N M-dimensional vectors with the N M-dimensional vectors to obtain the feature matrix of S rows and M columns. Whereas if the number of image frames actually extracted from the video stream is greater than S, some of the image frames may be discarded, eventually causing the feature matrix to satisfy the predetermined dimension. In a specific embodiment, M-dimensional vectors corresponding to a portion of an image frame may be discarded randomly or at predetermined intervals to generate a feature matrix with S rows and M columns. In such a case, S ═ N, N being the number of image frames that remain last in the feature matrix.

As described above, through the preliminary processing of the image frames in the video stream, a feature matrix of the video stream is generated, the feature matrix including at least N M-dimensional vectors that correspond to the N image frames in the video stream, respectively, and are arranged in time sequence of the N image frames, each M-dimensional vector including at least, for the corresponding image frame, component detection information obtained by the pre-trained first component detection model, and damage detection information obtained by the pre-trained first damage detection model.

It is to be understood that in the above embodiment, the preliminary processing of the image frames and the generation of the feature matrix are performed by the mobile terminal. In such a case, the server only needs to receive the feature matrix from the mobile terminal in step 21. Such a method is suitable for a case where the mobile terminal has an application or a tool corresponding to the damage assessment identification and has a certain calculation capability. This approach is very advantageous for network transmission, since the amount of transmitted data of the feature matrix is much smaller than the video stream itself.

In another embodiment, after shooting the video stream, the mobile terminal transmits the video stream to the server, and the server processes each image frame to generate the feature matrix. In this case, as for the server, in step 21, a photographed video stream is acquired from the mobile terminal, and then image frames are extracted therefrom, and object detection and feature extraction are performed on the extracted image frames, generating M-dimensional vectors. Specifically, for each image frame, component detection may be performed through the first component detection model to obtain component detection information, and damage detection may be performed through the first damage detection model to obtain damage detection information; and forming an M-dimensional vector corresponding to each image frame based on at least the part detection information and the damage detection information. Then, the feature matrix is generated from the M-dimensional vectors of the N image frames, respectively. The above process is similar to the process executed in the mobile terminal, and is not repeated.

In addition to obtaining the feature matrix of the video stream, in step 22, the server obtains K key frames in the video stream, where K is greater than or equal to 1 and less than the number N of the image frames.

In one embodiment, K key frames in a video stream are determined by a mobile terminal and sent to a server. Thus, as far as the server is concerned, only the K key frames need to be received from the mobile terminal in step 22.

Alternatively, in another embodiment, at step 22, the server determines the key frames in the video stream.

Whether through the mobile terminal or the server, the key frames in the video stream can be determined by adopting various existing key frame determination modes.

For example, in one embodiment, the image frames with higher overall quality may be determined as the key frames according to the quality characteristics of the respective image frames; in another embodiment, an image frame with a larger change relative to the previous frame may be determined as the key frame according to the continuity characteristics of each image frame.

Depending on the manner of extracting the N image frames and the manner of determining the key frames, the determined key frames may be included in the N image frames or may be different from the N image frames.

Next, at step 23, the target detection is performed again on the images of the respective key frames acquired at step 22. Specifically, the key frame image may be subjected to component detection by a second component detection model trained in advance to obtain component detection information, and subjected to damage detection by a second damage detection model trained in advance to obtain damage detection information, and a key frame vector may be generated based on the component detection information and the damage detection information.

It is to be understood that the second component detection model herein may be different from the first component detection model employed for the preliminary processing of the image frames to generate the feature matrix. In general, the second component detection model is a more accurate and complex model than the first component detection model, so that the key frames in the video stream can be detected more accurately. In particular, in the case where the feature matrix is generated by the mobile terminal, the first component detection model is typically a model based on a lightweight network architecture, thus being suitable for the limited computational power and resources of the mobile terminal; the second component detection model can be a detection model which has higher requirements on computing capacity and is suitable for a server side, so that more complex operation is performed on image features, and a more accurate result is obtained. Similarly, the second impairment detection model may be more complex and accurate than the first impairment detection model, so as to perform more accurate impairment detection on the keyframes in the video stream.

In one embodiment, the second part detection model is also used for image segmentation of the part. In this case, the part detection information obtained based on the second part detection model further includes information for image segmentation of the part included in the key frame image, that is, contour information of the part.

In this way, the second component detection model and the second damage detection model detect the target again for the image of the key frame, and based on the obtained component detection information and damage detection information, a key frame vector can be formed. For K keyframes, K keyframe vectors may be formed.

In one specific example, the key frame vector is a 70-dimensional vector, where elements of 1-35 dimensions represent part detection information and elements of 36-70 dimensions represent damage detection information.

Next, in step 24, the component detection information and the damage detection information in the N M-dimensional vectors and the K key frame vectors are fused to obtain the comprehensive damage features of each component.

Specifically, the step may include: determining at least one candidate damaged component, including a first component; and for each vector in the N M-dimensional vectors and the K key frame vectors, performing intra-frame fusion on the component detection information and the damage detection information in a single vector to obtain the frame comprehensive characteristics of the first component, and performing inter-frame fusion on the frame comprehensive characteristics of the first component obtained for each vector to obtain the comprehensive damage characteristics of the first component.

With respect to the determination of the at least one candidate damaged component, in one embodiment, each component of the vehicle is considered a candidate damaged component. For example, assuming that the vehicle is divided into 100 parts in advance, these 100 parts can be regarded as the damaged parts as candidates. The method has the advantages that omission does not occur, redundant calculation is more, and burden of subsequent processing is larger.

As described above, when generating both the feature matrix and the key frame vector of the video stream, component detection is performed on the image frame, and the obtained component detection information includes the component type predicted for the component. In one embodiment, the component category present in the component detection information is taken as the candidate damaged component.

More specifically, a first set of candidate components may be determined based on component detection information in the N M-dimensional vectors. It is to be understood that the component detection information of each M-dimensional vector may include several component detection boxes and corresponding predicted component categories, and a union of the predicted component categories in the N M-dimensional vectors may be used as the first candidate component set.

Similarly, a second set of candidate components, i.e., the union of the predicted component categories in the respective keyframe vectors, may be determined based on the component detection information in the K keyframe vectors. And then, taking the part which is obtained by merging the first candidate part set and the second candidate part set as a candidate damaged part. In other words, if a component category appears in the feature matrix of the video stream or in the key frame vector, it means that the component of the category is detected in N image frames of the video stream or the component is detected in the key frame, and then the component of the category can be used as a candidate damaged component.

In the above manner, a plurality of alternative damaged components may be obtained. The following processing manner will be described by taking an arbitrary one of the components (referred to as a first component for simplicity) as an example.

For any first component, such as the right back gate shown in fig. 3a, in step 24, the feature matrix of the video stream and each key frame vector are fused to obtain the comprehensive damage feature of the first component. To obtain the comprehensive lesion feature, the fusion operation may include intra-frame fusion (or first fusion) and inter-frame fusion (or second fusion). In intra-frame fusion, fusing component detection information and damage detection information in a vector corresponding to each frame to obtain frame comprehensive characteristics of a first component in the frame; and then combining the time sequence information, and fusing the frame comprehensive characteristics of the first component corresponding to each frame through inter-frame fusion to obtain the comprehensive damage characteristics of the first component. The following describes the procedures of intra-frame fusion and inter-frame fusion, respectively.

Intra-frame fusion aims to obtain a component-level damage feature, also called a frame synthesis feature, about a first component in a certain frame. In one embodiment, the frame synthesis feature of the first part may include a part feature of the first part in the frame and a damage feature associated with the first part. For example, a vector corresponding to the frame integration feature of the first component and a vector corresponding to the damage feature related to the first component may be spliced, and the spliced vector obtained by the splicing may be used as the frame integration feature of the first component.

In one embodiment, the N image frames include a first image frame corresponding to a first M-dimensional vector; the obtaining of the frame synthesis characteristic of the first component includes: extracting first detection information related to the first part from the part detection information in the first M-dimensional vector; determining a component feature of the first component in the first image frame as a part of a frame integration feature of the first component corresponding to the first image frame based on the first detection information.

In a specific embodiment, the N image frames further include a second image frame following the first image frame, corresponding to a second M-dimensional vector; each M-dimensional vector further comprises video continuity features; the obtaining of the frame synthesis characteristic of the first component further includes: extracting video continuity features from the second M-dimensional vector; determining second detection information of the first component in the second image frame based on the first detection information and the video continuity characteristic; determining a component feature of the first component in the second image frame based on the second detection information as part of a frame integration feature of the first component corresponding to the second image frame.

More specifically, in one example, the video continuity features include at least one of: an optical flow change feature between image frames, a similarity feature between image frames, a transformation feature determined based on a projection matrix between image frames.

On the other hand, in one embodiment, the N image frames include a first image frame corresponding to a first M-dimensional vector; the damage detection information in the first M-dimensional vector includes information of a plurality of damage detection frames framing a plurality of damaged objects from the first image frame; the obtaining of the frame synthesis characteristic of the first component includes: determining at least one damage detection frame belonging to the first component according to the component detection information in the first M-dimensional vector and the information of the plurality of damage detection frames; acquiring damage characteristics of the at least one damage detection frame; and carrying out first fusion operation on the damage features of the at least one damage detection frame to obtain the damage features related to the first component, wherein the damage features are used as a part of the frame comprehensive features of the first component corresponding to the first image frame.

In a specific embodiment, the component detection information in the first M-dimensional vector includes information of a plurality of component detection frames for framing a plurality of components from the first image frame; said determining at least one damage detection box belonging to the first component comprises: determining a first part detection frame corresponding to the first part; and determining at least one damage detection frame belonging to the first component according to the position relation between the plurality of damage detection frames and the first component detection frame.

In another specific embodiment, the part detection information in the first M-dimensional vector includes part segmentation information; said determining at least one damage detection box belonging to the first component comprises: determining a first area covered by the first component according to the component segmentation information; determining whether the damage detection frames fall into the first area or not according to the position information of the damage detection frames; determining a damage detection frame falling into the first region as the at least one damage detection frame.

In a particular embodiment, the at least one impairment detection block comprises a first impairment detection block; the obtaining of the damage characteristic of the at least one damage detection frame includes obtaining a first damage characteristic corresponding to the first damage detection frame, where the first damage characteristic includes a picture convolution characteristic related to the first damage detection frame.

More specifically, in an example, the damage detection information further includes a predicted damage category corresponding to each of the plurality of damage detection frames; the obtaining of the first damage characteristic corresponding to the first damage detection frame further includes determining, according to an association relationship between the first damage detection frame and other damage detection frames in the plurality of damage detection frames, a first association characteristic as a part of the first damage characteristic, where the association relationship at least includes one or more of the following: and predicting the damage category association relationship and the frame content association relationship reflected by the picture convolution characteristics.

In a particular embodiment, the first fusion operation includes one or more of: and taking the maximum operation, taking the minimum operation, calculating the average operation, summing the operation and calculating the median operation.

In the above, the component feature and the damage feature of the first component may be acquired for the first image frame, respectively. On the basis, the obtained component features and damage features of the first component are spliced or combined together, so that the frame comprehensive features of the first component in the first image frame can be obtained, namely, the first image frame is subjected to intra-frame fusion with respect to the first component. The obtained frame comprehensive characteristics of the first part are the part-level damage characteristics of the first part in the first image frame.

Similarly, for each image frame in the N image frames, the intra-frame fusion may be performed based on the corresponding M-dimensional vector, so as to obtain a frame comprehensive feature of the first component corresponding to the frame, which is denoted as a first vector. In this way, N first vectors may be obtained for the N image frames, corresponding to the frame integration characteristics of the first component obtained for the N image frames, respectively.

For each of the K keyframes, the intra-frame fusion may be performed based on the component detection information and the damage detection information in the keyframe vector, so as to obtain a frame comprehensive feature of the first component of the keyframe, which is denoted as a second vector. In this way, K second vectors may be derived for K keyframes, corresponding to the frame synthesis features of the first component derived for K image frames, respectively.

Since the dimensions of the key frame vector may be different from the aforementioned M-dimensional vector, the dimensions of the first vector and the second vector may also be different, but the concept and process of intra-frame fusion are similar.

Then, inter-frame fusion is performed on the frame comprehensive features of the first component obtained above for each frame (including N image frames and K key frames), so as to obtain a comprehensive feature vector of the first component as the comprehensive damage features.

In one embodiment, N first vectors are first combined to obtain a first combined vector, wherein the N first vectors respectively correspond to the frame synthesis features of the first component obtained for the N image frames; performing second combination on the K second vectors to obtain a second combined vector, wherein the K second vectors correspond to the frame comprehensive characteristics of the first component obtained aiming at the K key frames; and synthesizing the first combination vector and the second combination vector to obtain the synthesized feature vector.

In a specific embodiment, the first combining the N first vectors includes: and splicing the N first vectors according to the time sequence of the corresponding N image frames.

In another specific embodiment, the first combining the N first vectors includes: determining weight factors of the N first vectors; and carrying out weighted combination on the N first vectors according to the weight factors.

Further, in one example, the determining the weighting factors of the N first vectors includes: for each image frame in the N image frames, determining a temporally closest key frame from the K key frames; and determining the weight factor of the first vector corresponding to each image frame according to the time sequence distance of the image frame and the nearest key frame thereof, so that the time sequence distance is in negative correlation with the weight factor.

In another example, each of the M-dimensional vectors further includes an image frame quality feature; the determining the weighting factors of the N first vectors comprises: and for each image frame in the N image frames, determining the weight factor of a first vector corresponding to the image frame according to the image frame quality characteristics in the M-dimensional vector corresponding to the image frame.

In a more specific example, the image frame quality characteristic includes at least one of: a feature indicating whether or not the image frame is blurred, a feature indicating whether or not the image frame includes a target, a feature indicating whether or not the image frame is sufficiently illuminated, and a feature indicating whether or not the image frame is captured at a predetermined angle.

In this way, the comprehensive damage characteristic of the first component is obtained by fusing the N M-dimensional vectors in the characteristic matrix of the video stream and the K key frame vectors, so that the overall damage characteristic of the first component in the N image frames and the K key frames of the video stream can be fully reflected.

In step 25, a preliminary damage result is obtained, where the preliminary damage result includes the damage result of each component obtained by inputting the feature matrix into a pre-trained convolutional neural network.

In one embodiment, a preliminary damage identification result is determined by the mobile terminal and sent to the server. Thus, as far as the server is concerned, only the preliminary damage results need to be received from the mobile terminal, step 25. In another embodiment, a preliminary impairment result for the video stream is determined by the server at step 25.

Specifically, the preliminary damage result, whether determined by the mobile terminal or the server, may be obtained based on a pre-trained convolutional neural network.

It will be appreciated that when a convolutional neural network processes an image, its input matrix is often in the format of "batch size (batch _ size) × length × width channel number". The channels of the color image are usually "R", "G" and "B" 3 channels, i.e. the number of channels is 3. Obviously, in this format, the length and width are independent of each other, and the channels are mutually affected. Similarly, in the two-dimensional convolution operation on the feature matrix, the features of different spatial positions of the image should be independent, and the two-dimensional convolution operation has spatial invariance. Since the convolution is usually performed in the dimension of "length/width" during the image processing, if "length/width" is replaced by the number of rows and columns in the feature matrix, the features at different positions will affect each other in the feature dimension, rather than being independent of each other, and it is not reasonable to perform the convolution. For example, extracting a detail damage map requires features of multiple dimensions, such as detail map classification and damage detection results, to be simultaneously involved. That is, the spatial invariance holds in the time dimension, and does not hold in the feature dimension. Thus, the feature dimensions here may correspond to the properties of the channel dimensions in the image processing. Thus, the input format of the feature matrix may be adjusted, e.g., to "batch size (batch _ size) × 1 × column count (e.g., S or N) × row count (M)". Thus, convolution can be performed in the dimension of "1 x column number (e.g., S)", each column is a feature set at a time, and by performing convolution on the time dimension, the association between features can be mined.

In one embodiment, a convolutional neural network may include one or more convolutional processing layers and an output layer. The convolution processing layer may be composed of a two-dimensional convolution layer, an active layer, and a Normalization layer, such as 2D convolution Filter + ReLU + Batch Normalization.

Wherein the two-dimensional convolutional layer may be used to convolve the feature matrix with a convolutional kernel corresponding to the time dimension. In a specific embodiment, the feature matrix includes M rows and S columns, where S is not less than N, the convolutional neural network includes a plurality of one-dimensional convolutional kernels, and the inputting the feature matrix into the pre-trained convolutional neural network includes: and performing convolution processing on the characteristic matrix on the line dimension of the characteristic matrix by using the plurality of one-dimensional convolution kernels. In one example, an M S feature matrix is shown for FIG. 4, which may be convolved corresponding to the time dimension via a convolution kernel such as (1, -1, -1, 1). In the training process of the convolutional neural network, a convolution kernel can be trained on purpose. For example, one convolution kernel may be trained for each feature. The convolution kernels (1, -1, -1,1) shown in fig. 4, for example, are convolution kernels corresponding to the characteristics of component damage in the vehicle damage detection scene, and so on. Thus, through the convolution operation of each convolution kernel, a feature (e.g., a component damage feature in a vehicle damage detection scenario) may be identified.

The active layer may be used to non-linearly map the output of the two-dimensional convolutional layer. The activation layer may be implemented by an excitation function such as Sigmoid, Tanh (hyperbolic tangent), ReLU, or the like. The output result of the two-dimensional convolutional layer is mapped to a non-linear variation value between 0 and 1 through the active layer.

As the network deepens, the output result after passing through the activation layer may move to a gradient saturation region (a region where the gradient change of the stress excitation function is small). Thus, the convolutional neural network may converge slowly or not due to the reduction or disappearance of the gradient. Therefore, the output result of the active layer can be further pulled back to the region where the gradient change of the excitation function is significant through the Normalization layer (Batch Normalization).

The output layer is used for outputting the preliminary damage result aiming at the video stream.

Thus, the initial damage result obtained by the convolutional neural network can be realized. On the other hand, for the training of the reel neural network, the video stream marked with the damage recognition result can be used as the training sample.

In a specific embodiment, a plurality of training samples may be obtained first, where each training sample includes a sample feature matrix of each video stream and a corresponding impairment result label, and the sample feature matrix of each video stream includes at least N M-dimensional vectors respectively corresponding to N image frames in each video stream and arranged according to a time sequence of the N image frames. It should be understood that the sample feature matrix of each video stream can be obtained based on the method described in step 21. In another aspect, the damage result label comprises one or more of: damaged material, damaged type, and damaged component type.

And training the convolutional neural network by utilizing a plurality of training samples. As known to those skilled in the art, based on the sample feature matrix and the damage result label corresponding to each sample, the parameters of the model can be adjusted by, for example, a gradient descent method. The loss function in the model training process is, for example, a sum of squares of differences between predicted values and label values of the plurality of samples, or a sum of absolute values of differences between predicted values and label values of the plurality of samples, or the like.

It should be noted that the trained convolutional neural network for obtaining the preliminary injury result may be further modified to be used for extracting the key frames, and specifically may be used for extracting the K key frames in the foregoing step 22. Specifically, the trained convolutional neural network may be obtained, parameters in other layers except for the output layer in the convolutional neural network are fixed, the video stream marked with the key frame is further trained by using the video stream as a training sample, and the parameters of the output layer are adjusted, so as to obtain the modified convolutional neural network for extracting the key frame.

Next, in step 26, the comprehensive damage features of the components and the preliminary damage result are input into a pre-trained decision tree model to obtain a final damage assessment result for the video stream.

In an embodiment, for a certain component of the at least one candidate damaged component, the damage result of the component may be extracted from the preliminary damage result obtained in step 25, and a vector corresponding to the damage result and a vector corresponding to the comprehensive damage feature of the component are spliced to obtain a spliced vector for the component. Further, in a specific embodiment, the stitching vector of the component may be input into a pre-trained decision tree model to obtain a final loss assessment result for the component. Therefore, by performing this operation on each component, the final loss assessment result of each component can be obtained, and the final loss assessment result for the video stream is formed.

In one embodiment, a variety of specific decision tree algorithms, such as gradient boosting decision tree GBDT, classification decision tree CRT, etc., may be utilized as the decision tree model in this step.

In a specific embodiment, the decision tree model is embodied as a two-class model by pre-training with labeled samples of two classes. The labeled samples in the two categories are that whether each part in a given sample video stream is damaged or not is marked by a labeling person (damaged is one category, and undamaged is the other category). The pre-training process may include obtaining a composite damage characteristic and a preliminary damage result for a component based on a given sample video stream by the aforementioned steps 21 to 25, and then predicting whether the component is damaged based on the composite damage characteristic and the preliminary damage result by using a decision tree model. And then, comparing the prediction result with the labeling result of the part, and adjusting the model parameters of the decision tree model according to the comparison result, so that the prediction result tends to fit the labeling result. The decision tree binary classification model is obtained through the training.

In such a case, after the comprehensive damage characteristic and the initial damage result of a certain component to be analyzed are input into the decision tree two-class model in step 26, the model will output one of two classes whether the certain component is damaged or not, and then the result whether the certain component is damaged or not can be obtained.

In another embodiment, the decision tree model is embodied as a multi-classification model by pre-training with multi-classified labeled samples. The multi-classified labeled samples are characterized in that a labeling person labels damage categories (for example, a plurality of damage categories such as scratch, deformation, fragmentation and the like) of various parts in a given sample video stream. The pre-training process may include obtaining a composite damage characteristic and a preliminary damage result for a component based on a given sample video stream by the aforementioned steps 21 to 25, and then predicting a damage category of the component based on the composite damage characteristic and the preliminary damage result by using a decision tree model. And then, comparing the prediction result with the labeling result of the part, and adjusting the model parameters of the decision tree model according to the comparison result, so that the prediction result tends to fit the labeling result. The decision tree multi-classification model is obtained through training.

In such a case, in step 26, after the comprehensive damage feature vector of a certain component to be analyzed is input into the decision tree multi-classification model, the model outputs a predicted damage classification category, and then the damage type of the first component can be obtained according to the classification category as a final damage assessment result.

It is to be understood that a component described above is any of the alternative damaged components. The above process may be performed for each candidate damaged part, obtaining its composite damage signature at step 24, its preliminary damage result at step 25, and its damage status based on its composite damage signature and preliminary damage result at step 26. Therefore, the damage condition of each candidate damaged part can be obtained, and the final damage assessment result of the whole set is obtained.

In a specific example, after the multi-classification decision tree model is adopted to predict each candidate damaged part, the following damage assessment results can be obtained: a right rear door: scraping; a rear bumper: deforming; tail light: and (4) cracking.

In one embodiment, such final impairment results are communicated back to the mobile terminal.

On the basis of determining the final damage assessment result including the damage condition of each component, in one embodiment, a replacement scheme of each component can be determined according to the final damage assessment result.

It is understood that according to the requirements of damage assessment, the staff member can be preset with a mapping table, wherein the repair schemes of various types of components under various damage categories are recorded. For example, for a metal type part, when the damage category is scraping, the corresponding replacement scheme is paint spraying, and when the damage category is deformation, the corresponding replacement scheme is a metal plate; for glass type parts, when the damage category is scratch, the corresponding replacement scheme is to replace glass, and so on.

Thus, for the first component "right rear door" exemplified above, assuming that the damage category is determined to be scratch, first, the type to which the component belongs is determined according to the component category "right rear door", for example, the component is a metal type component, and then, according to the damage category "scratch", the corresponding repair scheme is determined to be: and (5) painting.

Therefore, the repair scheme can be determined for each damaged part, and the damage assessment result and the repair scheme are transmitted to the mobile terminal together, so that more comprehensive intelligent damage assessment is realized.

According to another aspect, an apparatus for vehicle damage assessment is provided, which may be deployed in a server, and the server may be implemented by any device, platform or device cluster having computing and processing capabilities. Fig. 5 shows a schematic block diagram of a vehicle damage assessment device according to one embodiment. As shown in fig. 5, the apparatus 500 includes:

a first obtaining unit 510 configured to obtain a feature matrix of a video stream, the video stream being captured for a damaged vehicle, the feature matrix including at least N M-dimensional vectors that correspond to N image frames in the video stream, respectively, and are arranged in time series of the N image frames, each M-dimensional vector including at least, for the corresponding image frame, component detection information obtained by a pre-trained first component detection model, and damage detection information obtained by the pre-trained first damage detection model.

A second obtaining unit 520 configured to obtain K key frames in the video stream.

A generating unit 530 configured to generate, for the K keyframes, corresponding K keyframe vectors, each keyframe vector including, for a corresponding keyframe image, component detection information obtained by a pre-trained second component detection model and damage detection information obtained by a pre-trained second damage detection model.

And a fusion unit 540 configured to fuse the component detection information and the damage detection information in the N M-dimensional vectors and the K keyframe vectors to obtain a comprehensive damage characteristic of each component.

A third obtaining unit 550 is configured to obtain a preliminary damage result, where the preliminary damage result includes a damage result of each component obtained after the feature matrix is input to a pre-trained convolutional neural network.

And a loss assessment unit 560 configured to input the comprehensive damage features of the components and the preliminary damage result into a pre-trained decision tree model to obtain a final loss assessment result for the video stream.

In one embodiment, the first obtaining unit 510 is specifically configured to receive the feature matrix from a mobile terminal.

In an embodiment, the first obtaining unit 510 is specifically configured to: acquiring the video stream; for each image frame in the N image frames, carrying out component detection through the first component detection model to obtain component detection information, and carrying out damage detection through the first damage detection model to obtain damage detection information; forming an M-dimensional vector corresponding to each image frame based on at least the part detection information and the damage detection information; and generating the characteristic matrix according to the respective M-dimensional vectors of the N image frames.

In an embodiment, the second obtaining unit 520 is specifically configured to: the K key frames are received from the mobile terminal.

In one embodiment, the fusion unit 540 is specifically configured to: determining at least one candidate damaged component, including a first component; for each vector in the N M-dimensional vectors and the K key frame vectors, performing intra-frame fusion on the component detection information and the damage detection information in a single vector to obtain the frame comprehensive characteristics of the first component, and performing inter-frame fusion on the frame comprehensive characteristics of the first component obtained for each vector to obtain the comprehensive damage characteristics of the first component.

In an embodiment, the third obtaining unit 550 is specifically configured to receive the preliminary damage identification result from the mobile terminal.

In one embodiment, the convolutional neural network is obtained by pre-training a training unit, and the training unit is specifically configured to: obtaining a plurality of training samples, wherein each training sample comprises a sample feature matrix of each video stream and a corresponding damage result label, and the sample feature matrix of each video stream at least comprises N M-dimensional vectors which respectively correspond to N image frames in each video stream and are arranged according to the time sequence of the N image frames; training the convolutional neural network using the plurality of training samples.

In one embodiment, the damage result label includes at least one of: damaged material, damaged type, and damaged component type.

In one embodiment, the apparatus further comprises: and the determining unit 570 is configured to determine a corresponding repair scheme according to the final damage assessment result.

By the method and the device, intelligent damage assessment is performed on the basis of the video stream for shooting the damaged vehicle.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A computer-implemented vehicle damage assessment method, comprising:

acquiring a feature matrix of a video stream, wherein the video stream is shot for a damaged vehicle, the feature matrix at least comprises N M-dimensional vectors which respectively correspond to N image frames in the video stream and are arranged according to a time sequence of the N image frames, and each M-dimensional vector at least comprises part detection information obtained by a pre-trained first part detection model and damage detection information obtained by the pre-trained first damage detection model for the corresponding image frame;

acquiring K key frames in the video stream;

generating corresponding K key frame vectors aiming at the K key frames, wherein each key frame vector comprises component detection information obtained by a pre-trained second component detection model aiming at a corresponding key frame image and damage detection information obtained by the pre-trained second damage detection model;

fusing the component detection information and the damage detection information in the N M-dimensional vectors and the K key frame vectors to obtain the comprehensive damage characteristics of each component;

acquiring a preliminary damage result, wherein the preliminary damage result comprises the damage result of each component obtained after the characteristic matrix is input into a pre-trained convolutional neural network;

and inputting the comprehensive damage characteristics of each part and the preliminary damage result into a pre-trained decision tree model to obtain a final damage assessment result aiming at the video stream.

2. The method of claim 1, wherein obtaining a feature matrix for a video stream comprises receiving the feature matrix from a mobile terminal.

3. The method of claim 1, wherein obtaining a feature matrix for a video stream comprises:

acquiring the video stream;

for each image frame in the N image frames, carrying out component detection through the first component detection model to obtain component detection information, and carrying out damage detection through the first damage detection model to obtain damage detection information;

forming an M-dimensional vector corresponding to each image frame based on at least the part detection information and the damage detection information;

and generating the feature matrix according to the respective M-dimensional vectors of the N image frames.

4. The method of claim 1, wherein obtaining K key frames in the video stream comprises: and receiving the K key frames from the mobile terminal.

5. The method of claim 1, wherein the second component detection model is different from the first component detection model, and the second damage detection model is different from the first damage detection model.

6. The method of claim 1, wherein fusing the component detection information and the damage detection information in the N M-dimensional vectors and the K keyframe vectors to obtain a composite damage signature of each component comprises:

determining at least one candidate damaged component, including a first component;

and for each vector in the N M-dimensional vectors and the K key frame vectors, performing intra-frame fusion on the component detection information and the damage detection information in a single vector to obtain the frame comprehensive characteristics of the first component, and performing inter-frame fusion on the frame comprehensive characteristics of the first component obtained for each vector to obtain the comprehensive damage characteristics of the first component.

7. The method of claim 1, wherein the obtaining a preliminary damage result comprises receiving the preliminary damage identification result from a mobile terminal.

8. The method of claim 1, wherein the feature matrix comprises M rows and S columns, wherein S is not less than N, the convolutional neural network comprises a number of one-dimensional convolutional kernels, and the inputting the feature matrix into a pre-trained convolutional neural network comprises:

and performing convolution processing on the feature matrix on the row dimension of the feature matrix by using the plurality of one-dimensional convolution kernels.

9. The method of claim 1, wherein the convolutional neural network is trained by:

obtaining a plurality of training samples, wherein each training sample comprises a sample feature matrix of each video stream and a corresponding damage result label, and the sample feature matrix of each video stream at least comprises N M-dimensional vectors which respectively correspond to N image frames in each video stream and are arranged according to the time sequence of the N image frames;

training the convolutional neural network using the plurality of training samples.

10. The method of claim 9, wherein the damage result label comprises at least one of: damaged material, damaged type, and damaged component type.

11. The method of claim 1, wherein after said obtaining a final impairment result for the video stream, the method further comprises:

and determining a corresponding replacement scheme according to the final damage assessment result.

12. A computer-implemented vehicle damage assessment apparatus, comprising:

a first obtaining unit configured to obtain a feature matrix of a video stream, the video stream being captured for a damaged vehicle, the feature matrix including at least N M-dimensional vectors that correspond to N image frames in the video stream, respectively, and are arranged in a time sequence of the N image frames, each M-dimensional vector including at least, for the corresponding image frame, component detection information obtained by a pre-trained first component detection model, and damage detection information obtained by the pre-trained first damage detection model;

a second obtaining unit configured to obtain K key frames in the video stream;

a generating unit configured to generate, for the K keyframes, corresponding K keyframe vectors, each keyframe vector including, for a corresponding keyframe image, component detection information obtained by a pre-trained second component detection model, and damage detection information obtained by a pre-trained second damage detection model;

the fusion unit is configured to fuse component detection information and damage detection information in the N M-dimensional vectors and the K key frame vectors to obtain comprehensive damage characteristics of each component;

a third obtaining unit, configured to obtain a preliminary damage result, where the preliminary damage result includes a damage result of each component obtained after inputting the feature matrix into a pre-trained convolutional neural network;

and a loss assessment unit configured to input the comprehensive damage features of the components and the preliminary damage result into a pre-trained decision tree model to obtain a final loss assessment result for the video stream.

13. The apparatus according to claim 12, wherein the first obtaining unit is specifically configured to receive the feature matrix from a mobile terminal.

14. The apparatus according to claim 12, wherein the first obtaining unit is specifically configured to:

acquiring the video stream;

15. The apparatus according to claim 12, wherein the second obtaining unit is specifically configured to: and receiving the K key frames from the mobile terminal.

16. The apparatus of claim 12, wherein the second component detection model is different from the first component detection model, and the second damage detection model is different from the first damage detection model.

17. The apparatus according to claim 12, wherein the fusion unit is specifically configured to:

18. The apparatus according to claim 12, wherein the third obtaining unit is specifically configured to receive the preliminary damage identification result from a mobile terminal.

19. The apparatus of claim 12, wherein the feature matrix comprises M rows and S columns, wherein S is not less than N, the convolutional neural network comprises a number of one-dimensional convolutional kernels, and the inputting the feature matrix into a pre-trained convolutional neural network comprises:

20. The apparatus of claim 12, wherein the convolutional neural network is pre-trained by a training unit, the training unit being specifically configured to:

21. The apparatus of claim 20, wherein the damage result label comprises at least one of: damaged material, damaged type, and damaged component type.

22. The apparatus of claim 12, wherein the apparatus further comprises:

and the determining unit is configured to determine a corresponding replacement scheme according to the final damage assessment result.

23. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-11.

24. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-11.