Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to achieve the above object, embodiments of the present invention provide a method, an apparatus, and a storage medium for collecting and model training of difficult-to-understand images, which may be applied to various electronic devices, and are not limited in particular. The method for collecting the difficult case image will be described in detail first.
Fig. 1 is a first flowchart of a method for collecting difficult-example images according to an embodiment of the present invention, including:
s101: and acquiring a video image to be processed.
For example, the video image to be processed may include various objects to be detected, such as pedestrians, vehicles, traffic signs, and so on. The method comprises the steps that video images can be collected through a camera and taken as video images to be processed, wherein the collected video images can comprise video images collected under different conditions, such as video images collected under different conditions of different scenes, different weather conditions, different illumination conditions and the like; the collected public video image data can also be used as the video image to be processed, and the specific method for acquiring the video image to be processed is not limited. The scene may include an automatic driving scene, a security scene, and the like, and the specific scene is not limited; the weather may include sunny days, cloudy days, etc., and the specific weather is not limited; the lighting conditions may include strong light, weak light, etc., and the specific lighting conditions are not limited.
S102: and carrying out target detection on each image frame in the video image to be processed to obtain the detection result of each target in each image frame.
For example, the video image to be processed may be decomposed frame by frame to obtain each image frame constituting the video, and each image frame is input into the detection model to perform target detection, so as to obtain a detection result of each image frame. Referring to fig. 2, a frame of video image is input into a detection model for object detection, and four objects a, b, c, and d detected in the frame are obtained. The detected target may be a vehicle, a pedestrian, a signal light, a traffic sign, or the like, and the specific detected target is not limited.
In one embodiment, S102 may include: identifying each target in each image frame in the video image to be processed; matching each target in the image frame with each target in the previous image frame to obtain a matching result of each target in the image frame; and determining the identification number of each target in the image frame based on the matching result as the detection result of the image frame.
For example, an image frame is input into a detection model for object detection, and if four objects are detected in the image frame, the four objects are assumed to be a, b, c, and d, respectively. Assuming that three targets e, f and g are detected in the previous image frame of the image frame, the detected targets a, b, c and d are matched with the targets e, f and g detected in the previous image frame. If the obtained matching result is as follows: and a is successfully matched with e, b is successfully matched with f, c is successfully matched with g, and d is not successfully matched. Thus, the identification number of the object a can be determined as A1(ii) a Determining the identification number of the object eIs A2(ii) a Determining the identification number of the object B as B1(ii) a Determining the identification number of the target f as B2(ii) a Identifying object C as C1(ii) a Determining the identification number of the target g as C2(ii) a Determining the identification number of the target D as D1(ii) a Determining the identification number A of the target a in the image frame1Identification number B of object B1Identification number C of object C1Identification number D of object D1As a result of detection of the image frame.
Alternatively, in one case, determining the identification number of each target in the image frame based on the matching result may include: determining the matching result which is successfully matched as the same target, and marking the same identification number; and determining the matching result of unsuccessful matching as different targets and marking different identification numbers.
Referring to fig. 2 and fig. 3, the matching result of successful matching is determined as the same target, and the same identification number is marked; and determining the matching result of unsuccessful matching as different targets and marking different identification numbers. It is understood that the same object is labeled with the same identification number and a newly appearing object is labeled with a different identification number.
For example, if the matching result is: and a is successfully matched with e, b is successfully matched with f, c is successfully matched with g, and d is not successfully matched. Thus, a and e can be determined as the same target, and the target can be marked with the identification number A; determining B and f are the same target, which can be labeled with identification number B; determining C and g as the same object, this object may be labeled with identification number C; determining D as a newly emerging target, this target may be tagged with an identification number D.
If the obtained matching result is as follows: successfully matching a with e, successfully matching b with f, unsuccessfully matching c, unsuccessfully matching d, and unsuccessfully matching g. Thus, a and e can be determined as the same target, and the target can be marked with the identification number A; determining B and f are the same target, which can be labeled with identification number B; determining C as a newly emerging target, this target may be tagged with identification number C; d is determined as a new emerging target, which can be marked with an identification number D; g is the target detected in the previous frame, and this target can be marked with identification number E.
S103: and judging whether discontinuous frames exist between the image frames in which the same target is detected or not according to the detection result of each target in each image frame. If so, S104 may be performed.
Here, the discontinuous frame may be understood as a case where a certain object is missing in a certain frame of the continuous image frames and the object continuously appears in other image frames before and after the image frame, such as in the 1 st to 15 th image frames, the 8 th image frame is missing a certain object, and the object continuously appears in the 1 st to 7 th image frames before and after the 8 th image frame, so that the 8 th image frame is a discontinuous frame. If an object appears in the 1 st, 2 nd and 3 rd image frames and the object does not appear in each frame after the 1 st, 2 nd and 3 rd image frames, the object is moved out of the video image, and each frame after the 3 rd image frame is not a discontinuous frame. If the target does not appear in the 1 st, 2 nd and 3 rd image frames and appears continuously or discontinuously in the following image frames, the target appears in the video image frames from the 4 th image frame, and the 1 st, 2 nd and 3 rd image frames are not discontinuous frames.
In one embodiment, if the matching result that matches successfully is marked with the same identification number and the matching result that matches unsuccessfully is marked with a different identification number in the above one embodiment, S103 may include: determining each detection target based on the identification number of each target; determining an image frame corresponding to each detection target; and judging whether discontinuous frames exist between image frames corresponding to each detection target.
For example, the object with the identification number a may be determined as the detection object 1, the object with the identification number B may be determined as the detection object 2, the object with the identification number C may be determined as the detection object 3, and the object with the identification number D may be determined as the detection object 4. In this way, the image frame corresponding to each detection target can be determined, and whether a discontinuous frame exists between the image frames corresponding to the detection targets can be judged for each detection target. In another embodiment, the identification number may be A1、A2Is determined as a detection target 1 with an identification number B1、B2Is determined as a detection target 2 with an identification number C1、C2Is determined as a detection target 3 with an identification number D1、D2Is determined as the detection target 4.
In one case, for each detection target, determining whether there is a discontinuous frame between image frames corresponding to the detection target includes: judging whether the frame numbers between the image frames corresponding to the detection targets are continuous or not aiming at each detection target; if not, it is determined that there is a discontinuous frame between the image frames where the detection target is located, and S104 may be performed.
For example, if the frame numbers of the image frames corresponding to the detection object 1 are 1-4, 7-9, and 11-15, that is, the detection object 1 appears in the image frames with the frame numbers of 1-4, 7-9, and 11-15, the frame number of the image frame where the detection object 1 is located is not continuous, and there are discontinuous frames between the image frames where the detection object 1 is located, and the discontinuous frames are the 5 th, 6 th, and 10 th image frames; if the frame number of the image frame corresponding to the detection target 2 is 1-15, namely the detection target 2 appears in the image frames with the frame number of 1-15, no discontinuous frame exists between the image frames where the detection target 2 is located; if the frame number of the image frame corresponding to the detection target 3 is 1-3, namely the detection target 3 appears in the image frames with the frame number of 1-3, no discontinuous frame exists between the image frames where the detection target 3 is located; if the frame numbers of the image frames corresponding to the detection target 4 are 4-6 and 8-15, that is, the detection target 4 appears in the image frames with the frame numbers of 4-6 and 8-15, the frame number of the image frame where the detection target 4 is located is discontinuous, a discontinuous frame is arranged between the image frames where the detection target 4 is located, and the discontinuous frame is the 7 th image frame; if the frame number of the image frame corresponding to the detection target 5 is 7-14, that is, the detection target 5 appears in the image frames with the frame numbers of 7-14, there is no discontinuous frame between the image frames where the detection target 5 is located. If there is a discontinuous frame between the image frames corresponding to the detection target, S104 may be performed.
Or, in another case, a statistical chart may be drawn based on the frame number of the video image to be processed and the identification number of the detection target, and fig. 4 is a schematic diagram of the statistical chart according to the embodiment of the present invention. The abscissa indicates the video frame number and the ordinate indicates the identification number of the detection target. The statistical analysis can be performed according to the statistical icon shown in fig. 4: the detection target with the identification number A is discontinuous in the 5 th, 6 th and 10 th frames, so that discontinuous frames exist, and the discontinuous frames are the 5 th, 6 th and 10 th image frames; the detection target with the identification number B continuously appears in the video image to be processed, and no discontinuous frame exists; the detection target with the identification number C continuously appears in the video image to be processed, and no discontinuous frame exists; the detection target with the identification number D is not continuous in the 7 th frame, so that an intermittent frame exists, and the intermittent frame is the 7 th image frame; the detection target with the identification number E continuously appears in the video image to be processed, and no discontinuous frame exists. If there is a discontinuous frame between the image frames corresponding to the detection target, S104 may be performed.
S104: and determining the image frame corresponding to the discontinuous frame as a difficult example image.
It may be understood that discrete image frames are determined as difficult example images. For example, in the above-mentioned embodiment, a statistical chart is used to perform statistical analysis, and whether each detection target has a discontinuous frame in the video image to be processed is determined. Referring to fig. 4, the detection target with the identification number a is not continuous in the 5 th, 6 th and 10 th frames, and there are discontinuous frames; the detection target with the identification number B continuously appears in the video image to be processed, and no discontinuous frame exists; the detection target with the identification number C continuously appears in the video image to be processed, and no discontinuous frame exists; the detection target with the identification number D is discontinuous in the 7 th frame, and a discontinuous frame exists; the detection target with the identification number E continuously appears in the video image to be processed, and no discontinuous frame exists. Then, the image frames 5, 6, 7, 10 of the video image to be processed may be determined as difficult example images.
In one embodiment, the detection result includes a detection frame, and the detection frame is used for identifying the area occupied by the target in each image frame of the video image to be processed; determining an image area occupied by the detection frame in each image frame as a target legend; inputting the target legend into a preset target classifier to obtain a classification result, wherein the classification result comprises: a background category and a target category; judging whether the classification result is a background classification result or not; and if so, determining the image frame where the target legend corresponding to the classification result is located as a difficult example image.
For example, the shape of the detection frame may be rectangular, circular, trapezoidal, and the like, and the specific shape is not limited. Taking a rectangular detection frame as an example, the detection frame includes four corner coordinate values, namely, an upper left-corner abscissa x1, an upper left-corner ordinate y1, a lower right-corner abscissa x2, and a lower right-corner ordinate y2, and an image area in each frame of video image with the four coordinate values as vertices may be used to identify an area occupied by a target in each image frame of a video image to be processed; and determining an image area occupied by the detection frame in each image frame as a target legend.
Inputting the target legend into a preset target classifier, such as an SVM (Support Vector Machines) classifier, to obtain a classification result of the target legend; the classification result comprises: a background category and a target category; judging whether the classification result is a background result; and if so, determining the video image of the frame where the target legend corresponding to the classification result is located as a difficult example image. The preset target classifier may be an SVM classifier, a bayes classifier, or the like, and the specific preset target classifier is not limited.
For example, if the target legend is input to the SVM classifier, if the classification result of the target legend is the background type, the classification result is the same as the background result, and the image frame where the target legend corresponding to the classification result is located is determined as a difficult image; if the classification result of the obtained target legend is the target type, the classification result is different from the background result, and the classification result of the next target legend can be input into the preset target classifier.
In this embodiment, the target legend is input to the preset target classifier to obtain a classification result, and then whether the classification result is a background result or not is judged, so that the image frame corresponding to the classification result which is the same as the background result is determined as a difficult example image, that is, an image with a false detection phenomenon on the detection model is determined as the difficult example image, and therefore the difficult example image is collected.
In one embodiment, after the difficulty image is determined in the above embodiment, feature extraction may be performed on the determined difficulty image to obtain difficulty image features; calculating the similarity of the two difficulty image features aiming at any two difficulty image features; judging whether the similarity meets a preset threshold condition or not; if yes, any one of the two difficult example images corresponding to the two difficult example image features is removed.
In one case, the difficult image determined in the above embodiment may be directly subjected to feature extraction to obtain the difficult image features. For example, in the above one embodiment, by performing statistical analysis on the detection result of each image frame, the discontinuous frame in which the detection target does not appear continuously is determined as a difficult example image, that is, an image in which the detection model has a missing detection phenomenon is determined as a difficult example image, and the image frame determined as the difficult example image is assumed to be the 2 nd, 3 rd and 5 th frames; in the above embodiment, the target legend is input to the preset target classifier to obtain the classification result, and then whether the classification result is the background result is determined, so as to determine the image frame corresponding to the classification result that is the same as the background result as the difficult example image, that is, the image in which the false detection phenomenon occurs in the detection model is determined as the difficult example image, and assume that the image frame determined as the difficult example image is the 3 rd, 7 th, and 9 th frames. And directly extracting the features of the determined difficulty image to obtain the difficulty image features. That is, the 3 rd image frame needs to be subjected to feature extraction twice.
Or, in another case, the difficult example images determined in the foregoing embodiment may be merged, for example, in the foregoing embodiment, by performing statistical analysis on the detection result of each image frame, an intermittent frame in which the detection target does not appear continuously is determined as a difficult example image, that is, an image in which the detection model has a missing detection phenomenon is determined as a difficult example image, and the image frame determined as a difficult example image is assumed to be the 2 nd, 3 rd, and 5 th frames; in the above embodiment, the target legend is input into the preset target classifier to obtain the classification result, and then whether the classification result is the background result is determined as the difficult example image, that is, the image with the false detection phenomenon occurring in the detection model is determined as the difficult example image, and the image frames determined as the difficult example image are assumed to be the 3 rd, 7 th and 9 th frames. And (3) performing feature extraction on the difficult case image to obtain the difficult case image features, wherein the difficult case image obtained by the determined difficult case image union set is 2, 3, 5, 7 and 9 frames. In this case, feature extraction is not required for the repeated 3 rd image frame, and feature extraction is only required for the 3 rd image frame once.
For example, the image of the difficult case may be subjected to feature extraction, such as extracting SURF Features (Speeded Up Robust Features) of the image of the difficult case to obtain image Features of the difficult case; calculating the similarity of the two difficulty image features aiming at any two difficulty image features, such as calculating the cosine similarity of the two difficulty image features aiming at any two difficulty image features; judging whether the similarity meets a preset threshold condition or not; if yes, any one of the two difficult example images corresponding to the two difficult example image features is removed. Wherein, the extracted difficulty image features can be as follows: SURF features, color features, texture features, shape features, and the like, and specific hard image features are not limited; the method for calculating the similarity of the two difficulty image features may be as follows: a nearest neighbor fast matching algorithm, calculating cosine similarity, Euclidean distance, Manhattan distance and the like of two difficulty image features, wherein the specific method for calculating the similarity of the two difficulty image features is not limited; the preset threshold condition may be that the similarity is greater than a preset threshold, the similarity is less than a preset threshold, and the like, and the specific preset threshold condition is not limited.
For example, assuming that the preset threshold condition is that the similarity is greater than the preset threshold 0.6, SURF feature extraction is performed on the difficult example image X to obtain a difficult example image feature X; performing SURF feature extraction on the difficult case image Y to obtain difficult case image feature Y, performing similarity calculation on the X and Y difficult case image features by using a nearest neighbor fast matching algorithm, if the obtained similarity is 0.8 and meets a preset threshold condition, rejecting any one of X, Y difficult case images corresponding to the X and Y difficult case image features, and if the X difficult case image is rejected, reserving the Y difficult case image; or rejecting the Y difficult case image and keeping the X difficult case image. If the similarity is 0.3 and the preset threshold condition is not met, X, Y two difficult case images corresponding to the x and y two difficult case image features can be reserved. The preset threshold may be 0.6, 0.5, etc., and the specific preset threshold is not limited.
By applying the embodiment of the invention, the video image to be processed is obtained; carrying out target detection on each image frame in the video image to be processed to obtain the detection result of each target in each image frame; judging whether discontinuous frames exist between the image frames where the same target is detected or not according to the detection result of each target in each image frame; and if so, determining the image frame corresponding to the discontinuous frame as a difficult example image. Therefore, in the scheme, on one hand, whether discontinuous frames exist between the image frames where the same target is detected is judged, if yes, the discontinuous frames are determined to be the difficult-case images, namely, the images with the detection missing phenomenon of the detection model are determined to be the difficult-case images, and therefore the difficult-case images are collected.
On the other hand, the detection result of each image frame is subjected to statistical analysis, a difficult example image is determined, meanwhile, a preset target classifier is used for classifying the target legend, the difficult example image is determined, the determined difficult example image is screened according to the similarity between different difficult example samples, and the redundancy of the difficult example image is reduced. In addition, the video images to be processed are screened by utilizing statistical analysis, preset target classifiers and similarity screening, so that manual screening is avoided, and labor consumption is reduced.
Fig. 5 is a second flowchart of a method for collecting difficult-example images according to an embodiment of the present invention, including:
s501: and acquiring a video image to be processed.
For example, the video image to be processed may be a video image of an object including a pedestrian, a vehicle, a traffic sign, or the like; the method comprises the steps that video images can be collected through a camera and taken as video images to be processed, wherein the collected video images can comprise video images collected under different conditions, such as video images collected under different conditions of different scenes, different weather conditions, different illumination conditions and the like; the collected public video image data can also be used as the video image to be processed, and the specific method for acquiring the video image to be processed is not limited. The scene may include an automatic driving scene, a security scene, and the like, and the specific scene is not limited; the weather may include sunny days, cloudy days, etc., and the specific weather is not limited; the lighting conditions may include strong light, weak light, etc., and the specific lighting conditions are not limited.
S502: identifying each target in each image frame in the video image to be processed; and matching each target in the image frame with each target in the previous image frame to obtain a matching result of each target in the image frame.
For example, the video image to be processed may be decomposed frame by frame to obtain each image frame constituting the video, and each image frame is input into the detection model to perform target detection, so as to obtain a detection result of each image frame. Referring to fig. 2, a frame of video image is input into a detection model for object detection, and four objects a, b, c, and d detected in the frame are obtained. The detected target may be a vehicle, a pedestrian, a signal light, a traffic sign, or the like, and the specific detected target is not limited.
For example, an image frame is input into a detection model for object detection, and if four objects are detected in the image frame, the four objects are assumed to be a, b, c, and d, respectively. Assuming that three targets e, f and g are detected in the previous image frame of the image frame, matching the detected targets a, b, c and d with the targets e, f and g detected in the previous image frame to obtain a matching result of the target a, a matching result of the target b, a matching result of the target c and a matching result of the target d.
S503: determining the matching result which is successfully matched as the same target, and marking the same identification number; and determining the matching result of unsuccessful matching as different targets and marking different identification numbers.
Referring to fig. 2 and fig. 3, the matching result of successful matching is determined as the same target, and the same identification number is marked; and determining the matching result of unsuccessful matching as different targets and marking different identification numbers. It is understood that the same object is labeled with the same identification number and a newly appearing object is labeled with a different identification number.
For example, if the matching result is: and a is successfully matched with e, b is successfully matched with f, c is successfully matched with g, and d is not successfully matched. Thus, a and e can be determined as the same target, and the target can be marked with the identification number A; determining B and f are the same target, which can be labeled with identification number B; determining C and g as the same object, this object may be labeled with identification number C; determining D as a newly emerging target, this target may be tagged with an identification number D.
If the obtained matching result is as follows: successfully matching a with e, successfully matching b with f, unsuccessfully matching c, unsuccessfully matching d, and unsuccessfully matching g. Thus, a and e can be determined as the same target, and the target can be marked with the identification number A; determining B and f are the same target, which can be labeled with identification number B; determining C as a newly emerging target, this target may be tagged with identification number C; d is determined as a new emerging target, which can be marked with an identification number D; g is the target detected in the previous frame, and this target can be marked with identification number E.
S504: determining each detection target based on the identification number of each target; and determining an image frame corresponding to each detection target.
For example, the object with the identification number a may be determined as the detection object 1, the object with the identification number B may be determined as the detection object 2, the object with the identification number C may be determined as the detection object 3, and the object with the identification number D may be determined as the detection object 4.
In one embodiment, S504 may include: drawing a statistical chart based on the frame number and the identification number of the video image to be processed; and determining an image frame corresponding to each detection target based on the statistical chart.
For example, a statistical chart may be drawn based on the frame number of the video image to be processed and the identification number of the detection target, and fig. 4 is a schematic diagram of a statistical chart according to an embodiment of the present invention. The abscissa indicates the video frame number and the ordinate indicates the identification number of the detection target. The statistical analysis can be performed according to the statistical icon shown in fig. 4: the image frames corresponding to the detection target with the identification number A are image frames 1-4, 7-9 and 11-15; the image frame corresponding to the detection target with the identification number B is 1 st to 15 th frames; the image frame corresponding to the detection target with the identification number of C is the 1 st to 3 rd frames; the image frames corresponding to the detection target with the identification number D are frames 4-6 and 8-15; the image frames corresponding to the detection target with the identification number E are frames 7-14.
S505: and judging whether the frame numbers of the image frames corresponding to the detection targets are continuous or not aiming at each detection target. If not, S506 may be performed.
It may be understood that discrete image frames are determined as difficult example images. For example, in the above embodiment, the image frame corresponding to each detection target is determined by a statistical chart. Referring to fig. 4, image frames corresponding to the detection target with the identification number a are image frames 1-4, 7-9 and 11-15, so that the detection target with the identification number a is not continuous in the frames 5, 6 and 10; the image frame corresponding to the detection target with the identification number B is the 1 st to 15 th frames, so that the detection target with the identification number B continuously appears in the video image to be processed; the image frame corresponding to the detection target with the identification number of C is the 1 st to 3 rd frames, so that the detection target with the identification number of C continuously appears in the video image to be processed; the image frames corresponding to the detection target with the identification number D are frames 4-6 and 8-15, so that the detection target with the identification number D is discontinuous in the 7 th frame; the image frames corresponding to the detection target with the identification number E are 7 th-14 th frames, so that the detection target with the identification number E continuously appears in the video image to be processed.
S506: judging that discontinuous frames exist between the image frames where the detection target is located; and determining the image frame corresponding to the discontinuous frame as a difficult example image.
Here, the discontinuous frame may be understood as a case where a certain object is missing in a certain frame of the continuous image frames and the object continuously appears in other image frames before and after the image frame, such as in the 1 st to 15 th image frames, the 8 th image frame is missing a certain object, and the object continuously appears in the 1 st to 7 th image frames before and after the 8 th image frame, so that the 8 th image frame is a discontinuous frame. If an object appears in the 1 st, 2 nd and 3 rd image frames and the object does not appear in each frame after the 1 st, 2 nd and 3 rd image frames, the object is moved out of the video image, and each frame after the 3 rd image frame is not a discontinuous frame. If the target does not appear in the 1 st, 2 nd and 3 rd image frames and appears continuously or discontinuously in the following image frames, the target appears in the video image frames from the 4 th image frame, and the 1 st, 2 nd and 3 rd image frames are not discontinuous frames.
In the above one embodiment, the image frames corresponding to the detection target with the identification number a are the 1 st to 4 th, 7 th to 9 th and 11 th to 15 th image frames, so the detection target with the identification number a is not continuous in the 5 th, 6 th and 10 th frames; the image frames corresponding to the detection target with the identification number D are frames 4-6 and 8-15, so that the detection target with the identification number D is not continuous in the 7 th frame. It can be determined that there is a discontinuous frame between the image frames where the detection target with the identification number a is located, and there is a discontinuous frame between the image frames where the detection target with the identification number D is located. Then, the 5 th, 6 th, 7 th, and 10 th frame video images of the video image to be processed may be determined as the difficult example image.
S507: the detection result comprises a detection frame; determining an image area occupied by the detection frame in each image frame as a target legend; inputting the target legend into a preset target classifier to obtain a classification result, wherein the classification result comprises: a background category and a target category.
For example, the detection frame may be rectangular, circular, trapezoidal, etc., and the specific shape is not limited. Taking a rectangular detection frame as an example, the detection frame includes four corner coordinate values, namely, an upper left-corner abscissa x1, an upper left-corner ordinate y1, a lower right-corner abscissa x2, and a lower right-corner ordinate y2, and an image area in each frame of video image with the four coordinate values as vertices may be used to identify an area occupied by a target in each image frame of a video image to be processed; and determining an image area occupied by the detection frame in each image frame as a target legend. Inputting the target legend into a preset target classifier, such as an SVM (Support Vector Machines) classifier, to obtain a classification result of the target legend. The preset target classifier may be an SVM classifier, a bayes classifier, or the like, and the specific preset target classifier is not limited.
S508: and judging whether the classification result is a background result. If so, S509 may be performed.
For example, if the classification result of the target legend is the background type, the classification result is the same as the background result, and S509 may be executed; if the classification result of the obtained target legend is the target type, the classification result is different from the background result, and the classification result of the next target legend can be input into the preset target classifier.
S509: and determining the image frame where the target legend corresponding to the classification result is located as a difficult example image.
In this embodiment, the target legend is input to the preset target classifier to obtain a classification result, and then whether the classification result is a background result or not is judged, so that the image frame corresponding to the classification result which is the same as the background result is determined as a difficult example image, that is, an image with a false detection phenomenon on the detection model is determined as the difficult example image, and therefore the difficult example image is collected.
In one embodiment, S509 may include, thereafter: extracting the features of the determined difficulty image to obtain difficulty image features; and calculating the similarity of the two difficulty image features aiming at any two difficulty image features.
In one case, the difficult image determined in the above embodiment may be directly subjected to feature extraction to obtain the difficult image features. For example, in the above one embodiment, by performing statistical analysis on the detection result of each image frame, the discontinuous frame in which the detection target does not appear continuously is determined as a difficult example image, that is, an image in which the detection model has a missing detection phenomenon is determined as a difficult example image, and the image frame determined as the difficult example image is assumed to be the 2 nd, 3 rd and 5 th frames; in the above embodiment, the target legend is input to the preset target classifier to obtain the classification result, and then whether the classification result is the background result is determined, so as to determine the image frame corresponding to the classification result that is the same as the background result as the difficult example image, that is, the image in which the false detection phenomenon occurs in the detection model is determined as the difficult example image, and assume that the image frame determined as the difficult example image is the 3 rd, 7 th, and 9 th frames. And directly extracting the features of the determined difficulty image to obtain the difficulty image features. That is, the 3 rd image frame needs to be subjected to feature extraction twice.
Or, in another case, the difficult example images determined in the foregoing embodiment may be merged, for example, in the foregoing embodiment, by performing statistical analysis on the detection result of each image frame, an intermittent frame in which the detection target does not appear continuously is determined as a difficult example image, that is, an image in which the detection model has a missing detection phenomenon is determined as a difficult example image, and the image frame determined as a difficult example image is assumed to be the 2 nd, 3 rd, and 5 th frames; in the above embodiment, the target legend is input into the preset target classifier to obtain the classification result, and then whether the classification result is the background result is determined as the difficult example image, that is, the image with the false detection phenomenon occurring in the detection model is determined as the difficult example image, and the image frames determined as the difficult example image are assumed to be the 3 rd, 7 th and 9 th frames. And (3) performing feature extraction on the difficult case image to obtain the difficult case image features, wherein the difficult case image obtained by the determined difficult case image union set is 2, 3, 5, 7 and 9 frames. In this case, feature extraction is not required for the repeated 3 rd image frame, and feature extraction is only required for the 3 rd image frame once.
For example, the image of the difficult case may be subjected to feature extraction, such as extracting SURF Features (Speeded Up Robust Features) of the image of the difficult case to obtain image Features of the difficult case; for any two difficulty image features, the similarity of the two difficulty image features is calculated, for example, the cosine similarity of the two difficulty image features is calculated for any two difficulty image features. Wherein, the extracted difficulty image features can be as follows: SURF features, color features, texture features, shape features, and the like, and specific hard image features are not limited; the method for calculating the similarity of the two difficulty image features may be as follows: the method for calculating the similarity of the two hard case image features is not limited.
For example, SURF feature extraction is carried out on the difficult example image X to obtain difficult example image features X; SURF feature extraction is carried out on the difficult case image Y to obtain difficult case image features Y, similarity calculation is carried out on the x and Y difficult case image features by utilizing a nearest neighbor fast matching algorithm, and the similarity is 0.8.
For any two difficulty image features, after calculating the similarity of the two difficulty image features, the method further comprises the following steps: judging whether the similarity meets a preset threshold condition or not; and rejecting any one of the two difficult example images corresponding to the two difficult example image features.
The preset threshold condition may be that the similarity is greater than a preset threshold, the similarity is less than a preset threshold, and the like, and the specific preset threshold condition is not limited. For example, if the preset threshold condition is that the similarity is greater than the preset threshold value of 0.6, if the similarity of the X and Y two difficulty image features is 0.8, and the preset threshold condition is satisfied, any one of X, Y two difficulty images corresponding to the X and Y two difficulty image features can be removed, for example, the X difficulty image is removed, and the Y difficulty image is retained; or rejecting Y difficult cases and reserving X difficult cases; if the similarity is 0.3 and the preset threshold condition is not met, X, Y two difficult case images corresponding to the x and y two difficult case image features can be reserved. The preset threshold may be 0.6, 0.5, etc., and the specific preset threshold is not limited.
By applying the embodiment of the invention, the video image to be processed is obtained; carrying out target detection on each image frame in the video image to be processed to obtain the detection result of each target in each image frame; judging whether discontinuous frames exist between the image frames where the same target is detected or not according to the detection result of each target in each image frame; and if so, determining the image frame corresponding to the discontinuous frame as a difficult example image. Therefore, in the scheme, on one hand, whether discontinuous frames exist between the image frames where the same target is detected is judged, if yes, the discontinuous frames are determined to be the difficult-case images, namely, the images with the detection missing phenomenon of the detection model are determined to be the difficult-case images, and therefore the difficult-case images are collected.
On the other hand, the detection result of each image frame is subjected to statistical analysis, a difficult example image is determined, meanwhile, a preset target classifier is used for classifying the target legend, the difficult example image is determined, the determined difficult example image is screened according to the similarity between different difficult example samples, and the redundancy of the difficult example image is reduced. In addition, the video images to be processed are screened by utilizing statistical analysis, preset target classifiers and similarity screening, so that manual screening is avoided, and labor consumption is reduced.
An embodiment of the present invention further provides a model training method, as shown in fig. 6, including:
s601: acquiring a video image to be detected, and determining a difficult case image in the video image to be detected.
For example, the video image to be detected may include various objects to be detected, such as pedestrians, vehicles, traffic signs, and so on. The video images can be collected through the camera, and the collected video images can be used as video images to be detected, wherein the collected video images can comprise video images collected under different conditions, such as video images collected under different conditions of different scenes, different weather conditions, different illumination conditions and the like; the collected public video image data can also be used as a video image to be detected, and the specific method for acquiring the video image to be detected is not limited. The scene may include an automatic driving scene, a security scene, and the like, and the specific scene is not limited; the weather may include sunny days, cloudy days, etc., and the specific weather is not limited; the lighting conditions may include strong light, weak light, etc., and the specific lighting conditions are not limited.
The process of determining the difficult example image in the video image to be detected refers to the embodiments shown in fig. 1 to 5, which are not described herein again.
S602: and acquiring the labeling information of the target in the difficult case image.
The method for acquiring the annotation information may be to manually annotate the target in the difficult-to-sample image, or to collect the public target annotation information corresponding to the difficult-to-sample image, and the specific method for acquiring the annotation information is not limited.
S603: and inputting the difficult-to-train image and the labeling information into the detection model to be trained, and iteratively adjusting the detection model to be trained based on an output result of the detection model to be trained to obtain a trained detection model, wherein the detection model to be trained is a model obtained by pre-training.
The detection model can be used for detecting targets such as vehicles, pedestrians, traffic signs and the like. And training the detection model to be trained, which is obtained in advance according to the data in the image database, by using the difficulty image to obtain the detection model. The data in the image database is used for training a model capable of detecting each target, and the embodiment of the invention does not limit the specific image database. Compare in the data training detection model who directly utilizes in the image database, utilize difficult picture image to train the model, can improve detection model's detection accuracy, also can make detection model appear the number of times of false retrieval or missing and examine and reduce to improve detection model's detection performance. In addition, the detection model to be trained, which is obtained by training data in the image database in advance, is secondarily trained by using the difficult image, so that the detection accuracy of the detection model can be further improved.
An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701 and a memory 702,
a memory 702 for storing a computer program;
the processor 701 is configured to implement any of the above-described difficult example image collection and model training methods when executing the program stored in the memory 702.
The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program is executed by a processor to implement the steps of any of the above method for collecting difficult cases of images and training models.
In yet another embodiment, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform any of the above-described difficult case image gathering and model training methods.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, computer-readable storage medium embodiments, and computer program product embodiments are described for simplicity because they are substantially similar to method embodiments, as may be found in some descriptions of method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.