CN117612052A - Deep learning training data sampling method, system, device and storage medium - Google Patents

Deep learning training data sampling method, system, device and storage medium Download PDF

Info

Publication number
CN117612052A
CN117612052A CN202311357056.8A CN202311357056A CN117612052A CN 117612052 A CN117612052 A CN 117612052A CN 202311357056 A CN202311357056 A CN 202311357056A CN 117612052 A CN117612052 A CN 117612052A
Authority
CN
China
Prior art keywords
target
state
frame
tracking
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311357056.8A
Other languages
Chinese (zh)
Inventor
肖兵
杨婉香
李正国
廖鑫
王文熹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Shixi Technology Co Ltd
Original Assignee
Zhuhai Shixi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Shixi Technology Co Ltd filed Critical Zhuhai Shixi Technology Co Ltd
Priority to CN202311357056.8A priority Critical patent/CN117612052A/en
Publication of CN117612052A publication Critical patent/CN117612052A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a deep learning training data sampling method, a deep learning training data sampling system, a deep learning training data sampling device and a deep learning training data storage medium. The method comprises the following steps: determining a target to be detected; tracking the target to be detected in each frame by utilizing the target detection frame to obtain state information of the target to be detected; judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information; if the condition meeting the predefined state transition event is determined to occur, marking the corresponding target video frame; and extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.

Description

Deep learning training data sampling method, system, device and storage medium
Technical Field
The present disclosure relates to the field of deep learning technologies, and in particular, to a method, a system, a device, and a storage medium for sampling training data for deep learning.
Background
Currently, a target detection algorithm based on deep learning is widely applied to a plurality of fields such as offices, education, security, service robots, unmanned aerial vehicles, automatic driving, intelligent home furnishings and the like. The performance of the model directly affects the functional experience of the product, so how to improve the performance of the model is always the focus of attention of research and development personnel.
In the development of deep learning algorithms, situations are often encountered where models perform poorly in certain scenarios. The traditional method is that video data of related scenes are collected, the video is sampled to obtain images to be annotated, then the images are annotated, the annotated data are added into a training set, and then a model is retrained.
Video sampling typically involves the decimation of fixed time intervals, frame intervals, key frames (I-frames) or scene transition frames. Another approach is adaptive sampling, with the sampling interval being dynamically adjusted according to the speed of motion of the target or the amplitude of the scene change. In this process, image screening is also required to exclude frames with too high a similarity or poor image quality.
However, when the model performs well in most cases, the traditional sampling approach may no longer be applicable, only when false or false positives occur in individual frames or time periods. At this time, reasoning the video frame by frame with a model and manually screening out the poorly performing frames may be a solution, but is inefficient and time consuming.
Another approach is to combine target detection with target tracking to more accurately determine when a model has false or missed detection. However, this method sometimes makes erroneous decisions and cannot cover all cases.
In comprehensive consideration, more intelligent and efficient data sampling and training strategies are required to be explored, including methods such as online learning, active learning, manual auxiliary labeling, data enhancement, automatic data screening, reinforcement learning, field adaptation and the like, so that the performance of the deep learning target detection model in practical application is improved. The choice of these methods should be weighed against the specific problem and the available resources.
Disclosure of Invention
In order to solve the technical problems, the application provides a deep learning training data sampling method, a deep learning training data sampling system, a deep learning training data sampling device and a deep learning training data storage medium. The following describes the technical solutions in the present application:
the first aspect of the application provides a deep learning training data sampling method, which comprises the following steps:
determining a target to be detected;
tracking the target to be detected in each frame by utilizing a target detection technology to obtain state information of the target to be detected;
judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information;
if the condition meeting the predefined state transition event is determined to occur, marking the corresponding target video frame;
and extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
Optionally, the state information includes a matching number, a matching state and a tracking state, the matching state is whether a target to be detected in a video frame is associated with a target detection frame, the tracking state is a confirmation condition of the target to be detected, the tracking state includes confirmed, unconfirmed and deleted, and the matching state includes matching and unconfirmed.
Optionally, the state transition event includes:
event one: in the first time period, the second time period and the third time period, the matching state corresponds to a state transition process of matching-unmatching-re-matching, and the tracking state is converted from unacknowledged to acknowledged;
optionally, the state transition event includes:
event two: in the first time period, the second time period and the third time period, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is changed from unacknowledged to acknowledged and then changed into deleted;
optionally, the state transition event includes:
event three: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is directly transitioned from unacknowledged to deleted.
Optionally, the tracking the target to be detected in each frame by using the target detection technology includes:
tracking the target to be detected in each frame by utilizing a target detection frame;
the target detection frame is obtained by the following steps:
and inputting the video data to be sampled into a deep learning model to be optimized, so as to carry out logic operation on each frame in the video data, and obtaining a target detection frame for identifying the target to be detected.
Optionally, after determining that the predefined state transition event is met, the method further comprises:
and analyzing the abnormal type of the state transition event.
Optionally, the analyzing the abnormal type of the state transition event includes:
and if the state transition event is event one, determining that the abnormal type is missed detection.
Optionally, the analyzing the abnormal type of the state transition event includes:
if the state transition event is an event two, counting the history matching times of the target to be detected;
if the historical matching frame number is smaller than a preset threshold value, determining that the abnormal type is false detection;
and if the historical matching frame number is larger than a preset threshold value, determining that the abnormal type is missed detection.
Optionally, the analyzing the abnormal type of the state transition event includes:
and if the state transition event is an event three, determining that the abnormal type is false detection.
Optionally, if the state transition event is event one, the object to be detected does not overlap with other objects, and the object to be detected does not exceed the boundary of the frame, it is determined that the anomaly type is missed detection.
Optionally, whether the object to be detected overlaps with other objects is determined by:
for two targets a and B in the same frame, the overlap ratio IoM is calculated by the following equation:
wherein Intersection (a, B) represents the overlapping area of the target A, B, and SA, SB represent the area of the target A, B, respectively;
if IoM is greater than the preset threshold, it is determined that the target a and the target B overlap, otherwise, the target a and the target B do not overlap.
Optionally, extracting the marked target video frame to obtain sampling data for optimizing the deep learning model includes the following two ways:
mode one: directly extracting the marked target video frame;
mode two: in the process of identifying a target detection frame and tracking a target to be detected, caching a preset number of adjacent frame sequences;
After marking the corresponding target video frame, the marked target video frame is extracted from the buffered sequence of adjacent frames.
Optionally, when the video data to be sampled is offline data, the first mode or the second mode is adopted to perform target video frame extraction, and when the video data to be sampled is online video, the second mode is adopted to perform target video frame extraction.
A second aspect of the present application provides a deep learning training data sampling system comprising:
a determining unit for determining a target to be detected;
the tracking unit is used for tracking the target to be detected in each frame by utilizing a target detection technology to obtain the state information of the target to be detected;
the judging unit is used for judging whether a tracking process generates a state transition event conforming to a predefined or not according to the state information;
the marking unit is used for marking the corresponding target video frame if the condition that the predefined state transition event is met is determined;
and the frame extraction unit is used for extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
A third aspect of the present application provides a deep learning training data sampling apparatus, the apparatus comprising:
A processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any one of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
1. accurate abnormal frame sampling: the method can accurately identify the frame of the video in which the state transition event of the target to be detected occurs through the combined use of the deep learning model and the target detection frame. This means that frames are marked and decimated only if the target state changes or is abnormal, avoiding over-sampling of redundant data.
2. Adaptivity: the method mentions that state transition events are determined from the state information, which makes the sampling process adaptive. If the state of the object remains stable in most frames, only a few frames undergo a state change, then only these few frames will be marked and decimated, reducing the amount of sampled data.
3. Saving resources: since the marking and sampling is only performed when a state transition event occurs, this can save a lot of computing resources and memory space, which is more efficient than the conventional continuous sampling method.
4. Targeted model optimization: the quality of the sampled data is very high because they are generated by the model when critical state transition events occur. The data can be used for purposefully optimizing the deep learning model, solving the problems of missed detection, false detection or other performances under specific conditions, and improving the robustness and the performance of the model.
5. The applicability is wide: the method is not only suitable for the target detection model, but also can be used for other deep learning tasks such as human body posture estimation and image segmentation, and only needs to properly adjust the definition of a target detection frame and the judgment rule of a state transition event.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of one embodiment of a method provided in the present application;
FIG. 2 is a schematic diagram of a target detection frame obtained when a target detection model is used in the present application;
FIG. 3 is a schematic diagram of a target detection frame obtained when a human body posture estimation model is used in the present application;
FIG. 4 is a schematic diagram of a target detection frame obtained when an image segmentation model is used in the present application;
FIG. 5 is a schematic diagram of a target detection frame obtained by AABB bounding boxes when a human body posture estimation model is used in the application;
FIG. 6 is a schematic diagram of a target detection frame obtained through a target mask when an image segmentation model is used in the present application;
FIG. 7 is a flow chart of another embodiment of the method provided in the present application;
fig. 8 to 11 are respectively diagrams of matching state changes corresponding to the case A, B, C, D;
FIG. 12 is a block diagram of one embodiment of a system provided herein;
fig. 13 is a block diagram of one embodiment of the apparatus provided in the present application.
Detailed Description
It should be noted that, the method provided in the present application may be applied to a terminal or a system, and may also be applied to a server, for example, the terminal may be a smart phone or a computer, a tablet computer, a smart television, a smart watch, a portable computer terminal, or a fixed terminal such as a desktop computer. For convenience of explanation, the terminal is taken as an execution body for illustration in the application.
In the deep learning model optimization stage, the existing video sampling/frame extraction technology is not targeted, a large amount of redundant data can be obtained, more effective data are lost, the model performance is improved only after retraining, and time and labor are wasted; or is excessively dependent on manpower and is extremely inefficient. Therefore, the invention provides a deep learning training data sampling method, which analyzes the detection state of a model based on a target tracking result, extracts effective video frames to the greatest extent and reduces redundant data, thereby being beneficial to improving the performance of the model and reducing the labor investment for labeling caused by the redundant data.
The embodiments in the present application are described in detail below:
referring to fig. 1, the present application first provides an embodiment of deep learning training data sampling, referring to fig. 1, the embodiment includes:
s101, determining a target to be detected;
in the application, a model optimization target needs to be determined, namely, the intention of data sampling is clarified to focus on optimizing the missed detection/judgment problem, or focus on optimizing the false detection/judgment problem, or simultaneously optimizing the missed detection/judgment problem and the false detection/judgment problem. In addition, the method also relates to determining to reserve or exclude the scenes such as the mutual shielding of the targets, the moving of the targets out of the picture and the like.
In this step, it is first necessary to explicitly define a deep learning model to be optimized, and an object or an object desired to be detected by the model. This may be a task of object detection, human body pose estimation, image segmentation, etc. It is necessary to ensure that the model to be optimized is already trained or provided and that the target desired to be detected is well defined.
In this step, the object detection model is used to detect and identify the position of an object in the video frame image, and output an object frame containing the position of the object. It should be noted that, the target detection model is a trained model, the detection performance of the model is already high, and the detection result is relatively accurate under normal conditions. However, the detection of the target detection model cannot reach a hundred percent accuracy, and a phenomenon of missed detection and/or false detection may occasionally occur in an individual frame or a time period (such as a phenomenon that human targets occasionally leak detection when the human body posture changes, and target frame flickering is shown). A video stream, which may be a local video file or a publicly available video dataset, typically containing video segments of different scenes and objects, is acquired along with the object detection model to be optimized. After the video stream is acquired, the video stream is decomposed into a plurality of video frame images.
It is noted that the method is also applicable to human body posture estimation, image segmentation and image recognition/classification models besides the human body detection model. The following step S102 will describe the specific implementation procedure.
S102, tracking a target to be detected in each frame by utilizing a target detection technology to obtain state information of the target to be detected;
in this step, the target detection technology is used to track the target to be detected, and a specific tracking mode may be implemented in various modes, which is not specifically limited in this application, and for example, the target detection frame may be used to track the target.
This may employ a detection-based multi-target tracking algorithm, such as Sort, deepSort, byteTrack, etc., when tracking the target to be detected in each frame using a target detection box. The tracking process will provide status information for each target, such as whether a consecutive match was successful, whether a match state transition occurred, etc.
In this embodiment, the state information may include the number of matches, tracking state, matching state, and the like.
In this embodiment, target tracking is performed based on the obtained target frame, tracking matching states of each tracking object are monitored in the tracking process, when the matching states are converted, the detection abnormal states are analyzed according to the conversion type of the matching states and other key tracking information to determine whether the detection result belongs to missed detection/determination, false detection/determination or normal performance, the abnormal situation determined to be missed detection/determination or false detection/determination is confirmed, the time period in which the abnormal situation is located is determined, and the corresponding video frame ID is recorded.
Specifically, the target tracking may employ a detection-based target tracking (track-by-detection) algorithm, such as Sort, deepSort, byteTrack. Preferably, deep Sort or its modified algorithm is employed.
S103, judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information;
in the step, whether a state transition event conforming to a predefined state occurs in the tracking process is judged according to the obtained target tracking state information. These events may be abrupt changes in the target state, e.g. from normal to abnormal or vice versa, which is a critical point of concern.
To achieve this, first, state transition events of interest need to be explicitly defined, which are related to tasks and targets. For example, for target detection, a state transition event may include a target from normal detection to loss, or vice versa.
During object tracking, status information of each tracked object needs to be continuously monitored. Such information may include tracking status, matching status, historical number of matches, etc., which may be provided by the tracking algorithm.
In the present application, the relationship among the tracking state, the matching state, and the history matching number is as follows:
For any tracking target, confirming that the initial state is an unacknowledged state in a first frame image of video data, and in the follow-up tracking process, continuously matching the target successfully, namely, the history matching times reach a certain threshold value, wherein the tracking state conversion condition is as follows: unacknowledged state-acknowledged state. In this case, it means that the target detection frame can be continuously and correctly detected and continuously tracked, i.e., the tracking condition of the tracked target is good without any abnormality.
The target fails to match in a first plurality of consecutive video frames, and the tracking state is set to an unacknowledged state.
Whether these events occur is determined by programming logic based on predefined state transition events. This can be done by comparing the current state information with the previous state information to detect abrupt changes in state. The following are some example decision logic:
if the state of a target tracked object changes from "unacknowledged" to "acknowledged," normal detection may be indicated.
If a sudden transition occurs in the matching state of a target tracking object, such as a transition from "matched" to "unmatched", a false miss or false miss may be indicated.
Once a state transition event is detected that meets the predefined criteria, the associated video frame may be marked or recorded. These frames contain the occurrence of state transition events so that they can be used in subsequent steps.
If the goal is to improve the performance of the deep learning model, the labeled frames can be used as feedback data for retraining or optimizing the model. This can help the model better cope with state transition events, improving the accuracy and robustness of detection.
S104, if the condition that the predefined state transition event is met is determined, marking the corresponding target video frame;
if it is determined during the tracking process that the occurrence of a predefined state transition event is met, the corresponding target video frame will be marked. This means that it will be known which frames contain state transition events, which frames may contain anomalies or other critical situations of the object to be detected.
S105, extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
And finally, extracting the marked target video frame to obtain sampling data for optimizing the deep learning model. These data will be used for model retraining or optimization to improve the performance of the model in certain situations, such as improving robustness of detection, reducing false or missed detection, etc.
The application also provides two different video frame extraction modes, which are respectively applied to an online video and an offline video, and are described below:
Mode one: directly extracting the marked target video frame;
mode two: in the process of identifying a target detection frame and tracking a target to be detected, caching a preset number of adjacent frame sequences;
after marking the corresponding target video frame, the marked target video frame is extracted from the buffered sequence of adjacent frames.
And when the video data to be sampled is an online video, extracting the target video frame in the first mode or the second mode, and when the video data to be sampled is an offline video, extracting the target video frame in the second mode.
In this embodiment:
offline extraction mode: after marking all the video frames is completed, the corresponding video frames are uniformly extracted according to marking results.
On-line extraction mode: in the process of target detection and video frame marking, a certain number of historical video frames adjacent to the current frame are cached, and when an abnormal frame is judged, the corresponding video frame is extracted from the historical video frame cache according to the marking.
Offline video may take the offline or online form, while online video takes the online form.
Further, after video frame extraction is completed, frame-by-frame reasoning is performed on the extraction result by using a model to be optimized, and a small number of invalid frames are manually screened out.
This embodiment enables efficient sampling of key data and for optimization of deep learning models. The advantage of this approach is that it combines the determination of target detection, tracking and state transition events to obtain high quality training samples, thereby improving the performance of the model.
In the foregoing embodiments, the state of the object to be detected is identified by the object detection frame, and the present application provides an embodiment of obtaining the object detection frame, which is described below:
the method comprises the following steps: inputting video data to be sampled into the deep learning model to be optimized, and performing logic operation on each frame in the video data to obtain a target detection frame for identifying the target to be detected;
the following describes the acquisition mode in detail:
in this embodiment, video data to be sampled is input into a deep learning model to be optimized. For each frame of video data, the deep learning model will perform logical operations, such as object detection, in order to identify objects to be detected in each frame. The model identifies the target to be detected in each frame through logic operation, and generates a target detection frame. These boxes represent the location and boundaries of objects in the video frame. This step generates an object detection box for identifying the object.
In this embodiment, the object detection model is used to detect and identify the position of an object in the video frame image and output an object frame containing the position of the object. It should be noted that, the target detection model is a trained model, the detection performance of the model is already high, and the detection result is relatively accurate under normal conditions. However, the detection of the target detection model cannot reach a hundred percent accuracy, and a phenomenon of missed detection and/or false detection may occasionally occur in an individual frame or a time period (such as a phenomenon that human targets occasionally leak detection when the human body posture changes, and target frame flickering is shown). A video stream, which may be a local video file or a publicly available video dataset, typically containing video segments of different scenes and objects, is acquired along with the object detection model to be optimized. After the video stream is acquired, the video stream is decomposed into a plurality of video frame images.
In practical applications, the object detection model may be a model for human body pose estimation or image segmentation. Specifically, if the inference result of the target detection model itself includes the target frame, the target frame is directly obtained. If the reasoning result of the target detection model is only a human body key point, the target frame can be obtained by calculating the human body key point. Further, if the inference result of the object detection model is the object mask (mask) of the binary image, the object frame can be obtained by calculating the object mask. The target mask detected by the target detection model is a corresponding target frame obtained by calculating an AABB bounding box of the target mask.
For the target detection model, the target frame can be obtained after reasoning, see fig. 2.
For human body posture estimation or image segmentation models, a target detection frame can be obtained directly or indirectly after reasoning. Specifically, if the human body posture estimation/image segmentation model reasoning result itself includes a target detection frame, referring to fig. 3 and fig. 4, the target detection frame is directly obtained;
if the human body posture estimation model reasoning result is only a human body key point, the target detection frame can be obtained by calculating an AABB bounding box of the human body key point, and the figure 5 is referred to. If the image segmentation model reasoning result is only the target mask, the target detection frame can be obtained by calculating the AABB bounding box of the target mask, see FIG. 6.
For step S104 and step S105 in the above embodiments, the present application further provides a more specific embodiment, which is described below:
referring to fig. 7, this embodiment includes:
s201, determining a deep learning model to be optimized and a target to be detected;
s202, inputting video data to be sampled into the deep learning model to be optimized, and performing logic operation on each frame in the video data to obtain a target detection frame for identifying the target to be detected;
S203, tracking a target to be detected in each frame by utilizing a target detection frame to obtain state information of the target to be detected;
s204, judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information;
s205, analyzing the abnormal type of the state transition event;
as mentioned in the previous embodiments, there is a need to explicitly define state transition events of interest, which are related to tasks and targets. For example, for target detection, a state transition event may include a target from normal detection to loss, or vice versa. During object tracking, status information of each tracked object needs to be continuously monitored. Such information may include tracking status, matching status, historical number of matches, etc., which may be provided by the tracking algorithm. Whether these events occur is determined by programming logic based on predefined state transition events. This can be done by comparing the current state information with the previous state information to detect abrupt changes in state.
This embodiment will provide a specific decision rule:
a specific example of this step is provided below:
in a specific embodiment, the state information includes a matching number, a matching state, and a tracking state, the matching state is whether a target to be detected in the video frame is associated with the target detection frame, the tracking state is a confirmation of the target to be detected, the tracking state includes confirmed, unconfirmed, and deleted, the matching state includes matching and unconfirmed, and the state transition event includes:
Event one: in the first time period, the second time period and the third time period, the matching state corresponds to the state conversion process of matching-unmatching-re-matching, and the tracking state is converted from unacknowledged to acknowledged;
event two: in the first time period, the second time period and the third time period, the matching state is subjected to a matching-unmatching-deleted state conversion process, and the tracking state is converted from unacknowledged to confirmed and then converted to deleted;
event three: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is directly converted into deleted state from unacknowledged state.
In this embodiment, each target to be detected has a tracking state: unacknowledged (fixed), confirmed (fixed), and deleted (deleted). The initial tracking state when the tracking object is created is tentative, and if the number of successful continuous matching exceeds a threshold (noted as n_init and generally 3), the tracking object is converted into confirm; in the state of tentative, the tracking object is set as a deleted and deleted as long as 1 frame is not matched; in the confirmed state, if the number of frames (n_miss) of the continuous unmatched object exceeds the set threshold (max_miss, which is generally larger than n_init), the tracking object is set to be deleted, otherwise, tracking is continued. In addition, the number of frames n_miss on the continuous unmatched tracking object represents the matching state, and if n_miss >0, the tracking object is known to be unmatched currently.
Then, the state transition procedure of the object to be detected is defined as the following cases in the present embodiment:
case a: as shown in fig. 8, the tracking object is continuously matched successfully (the matching state is never changed), and the tracking state is changed as follows: tentative→fixed. This is desirable where the target is continuously detected and continuously tracked.
Case B: as shown in fig. 9, the tracking object matching state transition process is "matching-unmatched-re-matching" (corresponding time periods are marked as i, ii, iii), and the tracking state transition condition is: tentative→fixed.
Case C: as shown in fig. 10, the tracking object matching state transition process is "matching-unmatched-deleted" (corresponding time periods are marked as i, ii), and the tracking state transition condition is: tentactive→reserved→released.
Case D: as shown in fig. 11, the tracking object matching state transition process is "matching-unmatched-deleted" (corresponding time periods are marked as i, ii), and the tracking state transition condition is: tentative→released.
It should be noted that, although the matching states of the cases C and D are the same, the tracking states of the cases C and D are different, which also determines that the reasons for the two cases are inconsistent, if the history matching number of the object to be detected is small, and then the object to be detected is deleted (tracking is unstable) after the matching is unsuccessful, the target is presumed to be absent in a large probability, and the corresponding detection belongs to false detection; on the other hand, if the number of history matches of one tracking object is large (stable tracking), the presence of the target is presumed. That is, in case C, in the non-matching process, a certain number of frames of successful matching actually occurs.
It can be seen that the state transition process "match-not match-re-match" corresponds to the event one, and the matching state transition process "match-not match-deleted" corresponds to the event two and the event three. While case a is an ideal state, it does not have the condition of missed or false detection, it does not have video frames that need to be sampled, and therefore does not fall under the scope of the discussion of the scheme of the present application.
Based on the above state transition process, the judgment logic provided in this embodiment is to analyze the abnormal type of the state transition event, and if the state transition event is event one, determine that the abnormal type is missed detection.
Further, according to the optimization objective, if consider excluding the situations of inter-objective occlusion, objective moving out of the screen, etc., the judgment logic here may be adjusted to: if the event I is detected, the object to be detected is not overlapped with other objects, and the object to be detected does not exceed the edge of the picture, the object to be detected is considered to be missed in the period II, and the frame ID corresponding to the period II is recorded. The method has the advantages that a large number of invalid frames extracted under the condition that the targets are blocked/overlapped and the targets are shifted out of the picture can be screened out, and redundancy is remarkably reduced. Of course, if the model optimization targets are detection problems in critical states such as occlusion/overlapping among targets, target shift out of picture, etc., these frames can be preserved.
If the state transition event is an event two, counting the history matching times of the target to be detected; if the historical matching frame number is smaller than a preset threshold value, determining that the abnormal type is false detection;
and if the historical matching frame number is larger than a preset threshold value, determining that the abnormal type is missed detection.
If the event is event two, further confirmation of the history matching times (hits) of the target to be detected is needed:
if the hits are small (less than a set threshold, such as 10), the tracking object is considered to be misdetected during time period I, and the tracking object is not detected to be normally represented during time period II, and the frame ID corresponding to time period I is recorded.
Otherwise (hits are greater than the set threshold), the tracking object is considered to be normally detected during time period I, and missed detection occurs during time period II, with time period II being recorded for the frame ID.
Further, according to the target optimization requirement, the situations of keeping or screening out the mutual shielding of the targets, moving the targets out of the picture and the like can be selected.
The size of the history matching times hits affects the discrimination result, and the basis is that: for a certain newly added target to be detected, whether the corresponding target really exists or not cannot be judged (if so, the detection belongs to normal, and if not, the detection belongs to false detection). However, the problem to be solved by the present invention is that: the current model to be optimized is normal in most cases, and false detection or missed detection only happens occasionally, for example, human body target missed detection happens occasionally when the human body posture changes, and the phenomenon of flickering of a detection frame is shown. Based on this premise, it can be presumed that: if the history matching times of a tracking object are small and then the matching is unsuccessful and then deleted, the target is presumed to be absent in a large probability, and the corresponding detection belongs to false detection; otherwise, if the history matching number of one tracking object is large, the existence of the target is presumed.
If the state transition event is event three, determining that the abnormality type is false detection
More specifically, for the implementation in this embodiment that needs to determine whether two targets overlap, an embodiment is provided below:
for two targets a and B in the same frame, the overlap ratio IoM is calculated by the following equation:
wherein Interselection (A, B) represents the overlapping area of the target A, B, S A 、S B Representing the area of the target A, B, respectively;
if IoM is greater than the preset threshold, it is determined that the target a and the target B overlap, otherwise, the target a and the target B do not overlap.
S206, marking the corresponding target video frames;
s207, extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
The steps of this embodiment, which are not described in detail, are similar to those of the previous embodiment, and are not described here again.
In one embodiment of the present application, optionally, in step S205, specifically includes: for any target to be detected, judging whether the matching state contains matching failure or not under the condition that the tracking state transition data is converted from an unacknowledged state to a confirmed state; if the matching state comprises matching failure, acquiring an overlapping state and a target frame position of a target to be detected in a video frame image with the matching failure; judging whether the object overlaps or the object exceeds the picture according to the overlapping state and the object frame position; if no target overlap or target exceeding picture occurs, determining the abnormal type as model omission; and generating an abnormal frame image set according to the video frame images which fail to match in the matching state conversion data.
In this embodiment, the transition data in the tracking state is: under the condition of unconfirmed state (initial tracking state) -confirmed state, the confirmed state is divided into two cases, one is that under the state, the target to be detected is continuously tracked successfully until the tracking is completed, and the case is an ideal state; the other is that the matching state of the object to be detected in a certain time period is failed in matching, and the other time periods are successful in continuous matching, however, the number of frames of failed matching is small, and the deletion standard is not reached, so that the tracking state is not changed, and in this state, the tracking abnormality occurs. Therefore, in the case of such trace conversion, it is judged whether or not the matching result of the matching failure is contained in the matching state conversion data, and if the matching failure is contained, it is interpreted that the trace abnormality occurs. At this time, situations such as occlusion between targets, moving out of the targets, and the like need to be considered, namely, an overlapping state of the targets to be detected at the matching failure position and a target frame position are acquired, and it is required to be noted that the video frame image with the matching failure is the first video frame with the matching failure in the matching state. The overlapping state is used for judging whether the object to be detected overlaps with other objects in the video frame, and the object frame position is used for judging whether the object to be detected moves out of the picture in the video frame. If the object to be detected is not overlapped with other objects and the object to be detected does not exceed the edge of the picture, the tracking abnormality is considered to be caused by the model detection abnormality.
Further, if the objects to be detected overlap or move out of the picture among the objects in the video frame with failed matching, the model is indicated to be normal. That is, the tracking anomaly of the object to be detected is not caused by the object detection model, which means that the tracking anomaly video frame at this time is redundant data for the object detection model, and even if the model is optimized by using the anomaly frame, there is no improvement in performance, so that the object detection model does not need to be optimized based on the tracking anomaly.
Further, in determining that the tracking abnormality is caused by the abnormality detection of the object detection model, the model needs to be optimized based on an abnormal frame in which the abnormality occurs, wherein the abnormality may be caused by model omission or model false detection, and in order to improve the optimization effect, the abnormality type classification is required based on the abnormality condition, and further the model needs to be optimized in a targeted manner according to different abnormality types.
Specifically, in the case where the confirmed model detects an abnormality and the tracking state transition data is an unconfirmed state-confirmed state, it is explained that there is a missing detection of the model in the confirmed state, that is, the type of abnormality is the model missing detection. At this time, it can be confirmed that a smaller number of video frames failed in continuous matching are included in the confirmation state, and this portion of video frames is an abnormal frame that is missed in the model.
The following describes the technical solutions in the present application:
the application provides a deep learning training data sampling method, which comprises the following steps:
determining a target to be detected;
tracking the target to be detected in each frame by utilizing a target detection technology to obtain state information of the target to be detected;
judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information;
if the condition meeting the predefined state transition event is determined to occur, marking the corresponding target video frame;
and extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
Optionally, the state information includes a matching number, a matching state and a tracking state, the matching state is whether a target to be detected in a video frame is associated with a target detection frame, the tracking state is a confirmation condition of the target to be detected, the tracking state includes confirmed, unconfirmed and deleted, and the matching state includes matching and unconfirmed.
Optionally, the state transition event includes:
event one: in the first time period, the second time period and the third time period, the matching state corresponds to a state transition process of matching-unmatching-re-matching, and the tracking state is converted from unacknowledged to acknowledged;
Optionally, the state transition event includes:
event two: in the first time period, the second time period and the third time period, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is changed from unacknowledged to acknowledged and then changed into deleted;
optionally, the state transition event includes:
event three: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is directly transitioned from unacknowledged to deleted.
Optionally, the tracking the target to be detected in each frame by using the target detection technology includes:
tracking the target to be detected in each frame by utilizing a target detection frame;
the target detection frame is obtained by the following steps:
and inputting the video data to be sampled into a deep learning model to be optimized, so as to carry out logic operation on each frame in the video data, and obtaining a target detection frame for identifying the target to be detected.
Optionally, after determining that the predefined state transition event is met, the method further comprises:
and analyzing the abnormal type of the state transition event.
Optionally, the analyzing the abnormal type of the state transition event includes:
and if the state transition event is event one, determining that the abnormal type is missed detection.
Optionally, the analyzing the abnormal type of the state transition event includes:
if the state transition event is an event two, counting the history matching times of the target to be detected;
if the historical matching frame number is smaller than a preset threshold value, determining that the abnormal type is false detection;
and if the historical matching frame number is larger than a preset threshold value, determining that the abnormal type is missed detection.
Optionally, the analyzing the abnormal type of the state transition event includes:
and if the state transition event is an event three, determining that the abnormal type is false detection.
Optionally, if the state transition event is event one, the object to be detected does not overlap with other objects, and the object to be detected does not exceed the boundary of the frame, it is determined that the anomaly type is missed detection.
Optionally, whether the object to be detected overlaps with other objects is determined by:
for two targets a and B in the same frame, the overlap ratio IoM is calculated by the following equation:
Wherein Interselection (A, B) represents the overlapping area of the target A, B, S A 、S B Representing the area of the target A, B, respectively;
if IoM is greater than the preset threshold, it is determined that the target a and the target B overlap, otherwise, the target a and the target B do not overlap.
Optionally, extracting the marked target video frame to obtain sampling data for optimizing the deep learning model includes the following two ways:
mode one: directly extracting the marked target video frame;
mode two: in the process of identifying a target detection frame and tracking a target to be detected, caching a preset number of adjacent frame sequences;
after marking the corresponding target video frame, the marked target video frame is extracted from the buffered sequence of adjacent frames.
Optionally, when the video data to be sampled is offline data, the first mode or the second mode is adopted to perform target video frame extraction, and when the video data to be sampled is online video, the second mode is adopted to perform target video frame extraction.
The foregoing provides an embodiment of a deep learning training data sampling method in the present application, and the following describes an embodiment of a deep learning training data sampling system provided in the present application:
Referring to fig. 12, the present application provides a deep learning training data sampling system, including:
a determining unit 301 for determining an object to be detected;
the model input unit 302 is configured to input video data to be sampled into the deep learning model to be optimized, so as to perform a logic operation on each frame in the video data, and obtain a target detection frame for identifying the target to be detected;
a tracking unit 303, configured to track a target to be detected in each frame by using a target detection technology, so as to obtain state information of the target to be detected;
a judging unit 304, configured to judge whether a tracking process generates a state transition event according to the state information;
a marking unit 305, configured to mark a corresponding target video frame if it is determined that a state transition event conforming to a predefined state occurs;
and a frame extraction unit 306, configured to extract the marked target video frame, and obtain sampling data for optimizing the deep learning model.
Optionally, the state information includes a matching number, a matching state, and a tracking state, where the matching state is whether a target to be detected in a video frame is associated with a target detection frame, the tracking state is a confirmation case of the target to be detected, the tracking state includes confirmed, unconfirmed, and deleted, the matching state includes matching and unconfirmed, and the state transition event includes:
Event one: in the first time period, the second time period and the third time period, the matching state corresponds to a state transition process of matching-unmatching-re-matching, and the tracking state is converted from unacknowledged to acknowledged;
event two: in the first time period, the second time period and the third time period, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is changed from unacknowledged to acknowledged and then changed into deleted;
event three: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is directly transitioned from unacknowledged to deleted.
Optionally, an anomaly analysis unit 307 is further included for analyzing the anomaly type of the state transition event.
Optionally, the anomaly analysis unit 307 is specifically configured to:
and if the state transition event is event one, determining that the abnormal type is missed detection.
Optionally, the anomaly analysis unit 307 is specifically configured to:
if the state transition event is an event two, counting the history matching times of the target to be detected;
if the historical matching frame number is smaller than a preset threshold value, determining that the abnormal type is false detection;
And if the historical matching frame number is larger than a preset threshold value, determining that the abnormal type is missed detection.
And if the state transition event is an event three, determining that the abnormal type is false detection.
Optionally, the anomaly analysis unit 307 is specifically configured to:
and if the state transition event is event one, the object to be detected is not overlapped with other objects, and the object to be detected does not exceed the boundary of the picture, determining that the abnormal type is omission.
Optionally, the anomaly analysis unit 307 is specifically configured to:
judging whether the object to be detected overlaps with other objects or not by the following method:
for two targets a and B in the same frame, the overlap ratio IoM is calculated by the following equation:
wherein Intersection (a, B) represents the overlapping area of the target A, B, and SA, SB represent the area of the target A, B, respectively;
if IoM is greater than the preset threshold, it is determined that the target a and the target B overlap, otherwise, the target a and the target B do not overlap.
Referring to fig. 13, the present application further provides a deep learning training data sampling device, including:
a processor 401, a memory 402, an input/output unit 403, and a bus 404;
the processor 401 is connected to the memory 402, the input/output unit 403, and the bus 404;
The memory 402 holds a program, and the processor 401 calls the program to execute any of the methods as described above.
The present application also relates to a computer readable storage medium having a program stored thereon, characterized in that the program, when run on a computer, causes the computer to perform any of the methods as above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (10)

1. A method for deep learning training data sampling, the method comprising:
determining a target to be detected;
tracking the target to be detected in each frame by utilizing a target detection technology to obtain state information of the target to be detected;
judging whether a tracking process generates a state transition event conforming to a predefined state according to the state information;
if the condition meeting the predefined state transition event is determined to occur, marking the corresponding target video frame;
and extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
2. The deep learning training data sampling method of claim 1, wherein the state information includes a number of matches, a matching state, the matching state being whether a target to be detected in a video frame is associated to a target detection frame, and a tracking state, the tracking state being a confirmation of the target to be detected, the tracking state including confirmed, unconfirmed, and deleted, the matching state including a match and a no match.
3. The deep learning training data sampling method of claim 2, wherein the state transition event comprises:
Event one: in the first, second and third time periods, the matching state corresponds to a state transition process of matching-not matching-re-matching, and the tracking state is changed from unacknowledged to acknowledged.
4. The deep learning training data sampling method of claim 2, wherein the state transition event comprises:
event two: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is changed from unacknowledged to acknowledged to deleted again.
5. The deep learning training data sampling method of claim 2, wherein the state transition event comprises:
event three: in the first, second and third time periods, the matching state is subjected to a matching-unmatching-deleted state transition process, and the tracking state is directly transitioned from unacknowledged to deleted.
6. The deep learning training data sampling method of claim 1, wherein the tracking the object to be detected in each frame using an object detection technique comprises:
Tracking the target to be detected in each frame by utilizing a target detection frame;
the target detection frame is obtained by the following steps:
and inputting the video data to be sampled into a deep learning model to be optimized, so as to carry out logic operation on each frame in the video data, and obtaining a target detection frame for identifying the target to be detected.
7. The deep learning training data sampling method of claim 1, further comprising, after determining that a state transition event consistent with a predefined is occurring:
and analyzing the abnormal type of the state transition event.
8. A deep learning training data sampling system, comprising:
a determining unit for determining a target to be detected;
the tracking unit is used for tracking the target to be detected in each frame by utilizing a target detection technology to obtain the state information of the target to be detected;
the judging unit is used for judging whether a tracking process generates a state transition event conforming to a predefined or not according to the state information;
the marking unit is used for marking the corresponding target video frame if the condition that the predefined state transition event is met is determined;
And the frame extraction unit is used for extracting the marked target video frame to obtain sampling data for optimizing the deep learning model.
9. A deep learning training data sampling device, the device comprising:
a processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 7.
10. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 7.
CN202311357056.8A 2023-10-18 2023-10-18 Deep learning training data sampling method, system, device and storage medium Pending CN117612052A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311357056.8A CN117612052A (en) 2023-10-18 2023-10-18 Deep learning training data sampling method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311357056.8A CN117612052A (en) 2023-10-18 2023-10-18 Deep learning training data sampling method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN117612052A true CN117612052A (en) 2024-02-27

Family

ID=89946824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311357056.8A Pending CN117612052A (en) 2023-10-18 2023-10-18 Deep learning training data sampling method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117612052A (en)

Similar Documents

Publication Publication Date Title
Zhong et al. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection
CN110084165B (en) Intelligent identification and early warning method for abnormal events in open scene of power field based on edge calculation
CN101271518A (en) System and method for managing the interaction of object detection and tracking systems in video surveillance
JP5261312B2 (en) Image analysis apparatus, image analysis method, and program
WO2021103868A1 (en) Method for structuring pedestrian information, device, apparatus and storage medium
CN106846361B (en) Target tracking method and device based on intuitive fuzzy random forest
CN109977895B (en) Wild animal video target detection method based on multi-feature map fusion
CN111738240A (en) Region monitoring method, device, equipment and storage medium
EP2474163A2 (en) Foreground object detection in a video surveillance system
US20210312799A1 (en) Detecting traffic anomaly event
CN103246896A (en) Robust real-time vehicle detection and tracking method
CN114049581B (en) Weak supervision behavior positioning method and device based on action segment sequencing
JP2011059898A (en) Image analysis apparatus and method, and program
CN113205138B (en) Face and human body matching method, equipment and storage medium
CN114758271A (en) Video processing method, device, computer equipment and storage medium
Wang et al. A semi-automatic video labeling tool for autonomous driving based on multi-object detector and tracker
CN115641357A (en) Intelligent storage personnel tracking algorithm and video monitoring system based on machine learning
CN111784750A (en) Method, device and equipment for tracking moving object in video image and storage medium
CN110263704B (en) Face data acquisition method, device, server, video acquisition device and medium
CN102789645A (en) Multi-objective fast tracking method for perimeter precaution
CN117612052A (en) Deep learning training data sampling method, system, device and storage medium
CN116330658B (en) Target tracking method, device and system based on depth image and image pickup equipment
CN115937675A (en) Target and defect identification method in substation inspection environment
CN112288775B (en) Multi-target shielding tracking method based on long-term and short-term prediction model
Yan et al. Improved SiamFC Target Tracking Algorithm Based on Anti‐Interference Module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination