WO2020156361A1

WO2020156361A1 - Training sample obtaining method and apparatus, electronic device and storage medium

Info

Publication number: WO2020156361A1
Application number: PCT/CN2020/073396
Authority: WO
Inventors: 徐青松; 李青
Original assignee: 杭州睿琪软件有限公司
Priority date: 2019-02-02
Filing date: 2020-01-21
Publication date: 2020-08-06
Also published as: CN109753975A; CN109753975B

Abstract

The present invention provides a training sample obtaining method and apparatus, an electronic device and a storage medium. The method comprises: obtaining a scene segment in a video; selecting a video frame comprising a target object from the scene segment as an initial frame, and marking a target area where the target object in the initial frame is located; extracting feature information of the marked target area in the initial frame; performing feature search on forward and/or backward video frames in the scene segment by taking the initial frame as a reference, determining the area in each searched frame of which the feature information matches the feature information of the target area, and automatically marking the determined area in each searched frame; and extracting the image of each marked video frame in the scene segment as a training sample. The present invention can solve the problems in the prior art of low efficiency and high cost of image training sample acquisition.

Description

Method, device, electronic equipment and storage medium for obtaining training samples

Technical field

The present invention relates to the field of machine learning technology, and in particular to a method, device, electronic equipment and computer-readable storage medium for obtaining training samples.

Background technique

The establishment of an artificial intelligence recognition model requires a large number of training samples for training, and the training samples are generally in image format. However, in order to meet the training requirements, it is usually necessary to obtain a large number of pictures as training samples, and when labeling, each picture needs to be targeted separately, which is inefficient and costly.

Summary of the invention

The purpose of the present invention is to provide a method, a device, an electronic device and a computer-readable storage medium for obtaining training samples to solve the problems of low efficiency and high cost of obtaining image training samples in the prior art.

To solve the above technical problems, the present invention provides a method for obtaining training samples, including:

Obtain scene fragments in the video;

Selecting a video frame containing a target object in the scene segment as an initial frame, and labeling the target area in the initial frame where the target object is located;

Extracting feature information of the target area marked in the initial frame;

Using the initial frame as a reference, perform a feature search on the forward and/or backward video frames in the scene segment, and determine the area in each searched frame whose feature information matches the feature information of the target area, and Automatically mark the area determined in each searched frame;

The image of each marked video frame in the scene segment is extracted as a training sample.

Optionally, the obtaining the scene fragment in the video includes:

If the video is a single scene video, use the video as a scene segment;

If the video is a multi-scene video, the scene switching detection technology is used to divide the video into multiple scene segments.

Optionally, the scene switching detection technology includes: a pixel domain-based detection algorithm and/or a compressed domain-based detection algorithm.

Optionally, before extracting the feature information of the target area marked in the initial frame, the method further includes:

Image preprocessing is performed on the initial frame to make the feature information of the target region in the initial frame more obvious.

Optionally, the feature information of the target area includes one or more of color features, texture features, and shape features.

Optionally, the step of performing feature search on the forward and/or backward video frames in the scene segment includes:

Using a mean shift algorithm, a Kalman filter algorithm, or a particle filter algorithm, feature search is performed on the forward and/or backward video frames in the scene segment.

Optionally, the method further includes:

If there is no area in a searched frame whose feature information matches the feature information of the target area, acquire the target feature information, determine the area in the searched frame where the feature information matches the target feature information, and Automatically mark the area determined in the searched frame;

Wherein, the target feature information is: feature information of the marked area in the adjacent preset number of frames of the searched frame.

The present invention also provides a training sample obtaining device, including:

Obtaining module, used to obtain scene fragments in the video;

A first labeling module, configured to select a video frame containing a target object in the scene fragment as an initial frame, and label the target area in the initial frame;

A first extraction module, configured to extract feature information of the target area marked in the initial frame;

The second labeling module is used to perform feature search on the forward and/or backward video frames in the scene segment based on the initial frame, and determine the feature information in each searched frame and the feature of the target area The area where the information matches, and automatically mark the area determined in each searched frame;

The second extraction module is used to extract the marked images of each video frame in the scene segment as training samples.

Optionally, the obtaining module obtains a scene segment in the video, including:

If the video is a single scene video, use the video as a scene segment;

Optionally, the device further includes:

The preprocessing module is configured to perform image preprocessing on the initial frame before the first extraction module extracts the feature information of the target region marked in the initial frame, so that the The characteristic information of the target area is more obvious.

Optionally, the second extraction module performs feature search on the forward and/or backward video frames in the scene segment, including:

Optionally, the second extraction module is further used for:

The present invention also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; wherein,

The memory is used to store computer programs;

When the processor is used to execute the computer program stored in the memory, it implements the training sample obtaining method described in any one of the above.

The present invention also provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, the training sample obtaining method described in any one of the above is implemented.

The solution provided by the present invention firstly annotates the initial frame in the scene segment of the video, and then uses the target tracking technology to automatically annotate other video frames in the entire scene segment, thereby obtaining a large number of annotated images as a subsequent target recognition model Training samples. In the prior art, manual annotation is performed by acquiring a large number of pictures. The cost of image acquisition and annotation is relatively high. However, the present invention can shoot a video, and the acquisition of annotation materials is more convenient and easy. Then a large number of automatically marked samples can be collected from the video, which reduces The cost of sample labeling improves the efficiency of labeling processing.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a training sample obtaining apparatus according to an embodiment of the present invention;

Fig. 3 is a structural block diagram of an electronic device provided by an embodiment of the present invention.

detailed description

The method, device, electronic device, and computer-readable storage medium for obtaining training samples provided by the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. According to the claims and the following description, the advantages and features of the present invention will be clearer.

In order to solve the problems of the prior art, the embodiments of the present invention provide a method, device, electronic device, and computer-readable storage medium for obtaining training samples.

It should be noted that the training sample obtaining method of the embodiment of the present invention can be applied to the training sample obtaining device of the embodiment of the present invention, and the training sample obtaining device can be configured on an electronic device. Wherein, the electronic device may be a personal computer, a mobile terminal, etc., and the mobile terminal may be a hardware device with various operating systems such as a mobile phone or a tablet computer.

Fig. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention. Please refer to Fig. 1. A method for obtaining training samples may include the following steps:

S101: Obtain a scene segment in a video.

A video is generally composed of one or more scene segments, and a scene is composed of multiple video frames. The video on which the present invention is based can be a single-scene video or a multi-scene video. If the video is a single scene video, since the video contains only one scene segment, the video can be directly used as an obtained scene segment, and subsequent processing steps are executed.

If the video is a multi-scene video, the scene switching detection technology can be used to divide the video into multiple scene segments. After dividing multiple scene fragments, only one of the scene fragments can be used, and by performing subsequent processing steps, the images of each marked video frame in the scene fragment can be obtained as training samples, or it can be uniform for each scene fragment. Performing subsequent processing steps can further increase the number of training samples obtained.

Scene switching detection technology refers to finding out the frame and frame position where scene switching occurs in a video. The obtained frame position can be used for fast and accurate video editing or further processing, and the frame sequence composed of the obtained frames can be used for rough description The entire video content.

At present, traditional video scene switching detection methods generally use manual feature extraction methods, such as calculating the color histogram similarity of adjacent frames, or directly calculating the frame difference, or using the change of high-frequency subband coefficients of each frame in the video scene The degree feature VH (viewportHeight, window height) detects scene switching. The calculation of high-frequency subband coefficients requires algorithms such as three-dimensional wavelet transform. These technologies will calculate a feature value and compare it with the threshold. If it is greater than the threshold or less than the threshold, it is determined For switching frames. There are also some adaptive threshold algorithms based on the above technologies, such as a video scene change detection method based on adaptive thresholds, but the sliding window size and preset thresholds in this method still need to be manually set.

In the present invention, the scene switching detection technology can adopt a pixel domain-based detection algorithm or a compression domain-based detection algorithm, and set corresponding scene switching thresholds according to different scenes, which can improve the speed and accuracy of scene switching detection. The detection algorithm based on the pixel domain or the compressed domain can be referred to the prior art, which will not be repeated here.

S102: Select a video frame containing a target object in the scene segment as an initial frame, and mark the target area where the target object is located in the initial frame.

Among them, the target object may be an object of interest. For each scene segment, it can be identified according to the video frames it contains, and a video frame containing the target object can be selected as the initial frame for labeling. The first frame where the target object appears can be selected as the initial frame. If the target object is in the first frame If the feature of is not obvious, look for a frame with more obvious features of the target object in the subsequent video frames as the initial frame. The requirements for this step are not very strict. You can probably choose a better video frame as the initial frame. Its purpose is to mark the target area where the target object is located, so as to extract the feature information of the target area, so that the feature can be passed in the subsequent processing. The search automatically marks the feature matching area in the previous or next video frames.

Further, before extracting the feature information of the target region marked in the initial frame in step S103, image preprocessing, such as image denoising, contrast enhancement, etc., may be performed on the initial frame, so that the initial frame The characteristic information of the target area in the frame is more obvious.

S103: Extract feature information of the target area marked in the initial frame.

The feature information of the target area may include one or more of color features, texture features, and shape features.

The color feature is a global feature that describes the surface properties of the scene corresponding to the image or image area. Generally, color characteristics are based on the characteristics of pixels. At this time, all pixels belonging to an image or image area have their own contributions. Color histogram is the most commonly used method to express color characteristics. It can simply describe the global distribution of colors in an image, that is, the proportions of different colors in the entire image. It is especially suitable for describing images and images that are difficult to automatically segment. The image of the spatial position of the object does not need to be considered, and it is not affected by the change of image rotation and translation, and further normalization is not affected by the change of image scale. The most commonly used color spaces are RGB color space and HSV color space. Color histogram feature matching methods include: histogram intersection method, distance method, center distance method, reference color table method, and cumulative color histogram method.

Texture feature is also a kind of global feature, it also describes the surface properties of the scene corresponding to the image or image area. However, because texture is only a characteristic of the surface of an object, and cannot fully reflect the essential attributes of the object, high-level image content cannot be obtained by using only texture features. Different from the color feature, the texture feature is not based on the feature of pixels, it needs to perform statistical calculation in the area containing multiple pixels. In pattern matching, this regional feature has greater advantages and will not fail to match successfully due to local deviations. As a statistical feature, texture features often have rotation invariance and have strong resistance to noise.

The description methods of texture features include statistical methods, geometric methods, model methods, and signal processing methods. The typical representative of statistical methods is a texture feature analysis method called gray-level co-occurrence matrix. Gotlieb and Kreyszig et al. based on the study of various statistical features in the co-occurrence matrix, through experiments, obtained four gray-level co-occurrence matrix Key features: energy, inertia, entropy, and correlation; another typical method in statistical methods is to extract texture features from the image's autocorrelation function (ie, the image's energy spectrum function), that is, through the image's energy spectrum function Calculate and extract characteristic parameters such as texture thickness and directionality.

The geometric method is a texture feature analysis method based on the theory of texture primitives (basic texture elements). The texture primitive theory believes that a complex texture can be composed of a number of simple texture primitives in a certain regular form. In the geometric method, there are two more influential algorithms: Voronio checkerboard feature method and structural method.

The model method is based on the structural model of the image, and uses the parameters of the model as texture features. Typical methods are random field model methods, such as Markov random field (MRF) model method and Gibbs random field model method.

In the signal processing method, the extraction and matching of texture features mainly include: gray level co-occurrence matrix, Tamura texture feature, autoregressive texture model, wavelet transform, etc. The feature extraction and matching of gray-level co-occurrence matrix mainly rely on four parameters: energy, inertia, entropy and correlation. Tamura's texture feature is based on the psychological research of human visual perception of texture, and proposes six attributes, namely: roughness, contrast, direction, line image, regularity and roughness. The auto-regressive texture model (simultaneous auto-regressive, SAR) is an application example of the Markov Random Field (Markov Random Field, MRF) model.

The characteristic of the shape feature is that various retrieval methods based on the shape feature can effectively use the target of interest in the image for retrieval. Generally, there are two types of representation methods for shape features, one is contour features, and the other is regional features. The contour feature of the image is mainly for the outer boundary of the object, while the regional feature of the image is related to the entire shape area.

First of all, several typical shape feature description methods are: boundary feature method, Fourier shape descriptor method, geometric parameter method, and shape invariant moment method.

Boundary feature method, this method obtains the shape parameters of the image by describing the boundary feature. The Hough transform method for detecting parallel lines and the histogram method for boundary directions are classic methods. Hough transform is a method that uses the global characteristics of the image to connect edge pixels to form a closed boundary of a region. The basic idea is the point-line duality; the boundary direction histogram method first differentiates the image to obtain the edge of the image, and then makes Regarding the histogram of the edge size and direction, the usual method is to construct an image gray gradient direction matrix.

The basic idea of the Fourier shape descriptor method is to use the Fourier transform of the object boundary as the shape description, and use the closedness and periodicity of the region boundary to transform a two-dimensional problem into a one-dimensional problem. Three shape expressions are derived from boundary points, which are curvature function, centroid distance, and complex coordinate function.

The geometric parameter method is a simpler area feature description method used for shape expression and matching. For example, a shape factor method (shape factor) is used for quantitative measurement of shape (such as moment, area, perimeter, etc.). In the QBIC system (a content-based image retrieval system), geometric parameters such as roundness, eccentricity, principal axis direction, and algebraic invariant moments are used for image retrieval based on shape features. It should be noted that the extraction of shape parameters must be based on image processing and image segmentation. The accuracy of the parameters must be affected by the segmentation effect. For images with poor segmentation effects, the shape parameters cannot even be extracted.

The shape invariant moment method uses the moment of the area occupied by the target as the shape description parameter.

In addition, the representation and matching of shape features also include methods such as Finite Element Method (FEM), Turning Function (Turning Function), and Wavelet Descriptor (Wavelet Descriptor).

Secondly, the shape feature extraction and matching method based on wavelet and relative moments. This method first uses wavelet transform modulus maxima to obtain multi-scale edge images, then calculates 7 invariant moments of each scale, and then converts them into 10 relative moments , The relative moments on all scales are used as image feature vectors to unify the region and closed and unclosed structures.

S104: Using the initial frame as a reference, perform feature search on the forward and/or backward video frames in the scene segment, and determine the area in each searched frame whose feature information matches the feature information of the target area , And automatically mark the area determined in each searched frame.

That is, according to the feature information extracted from the initial frame, the forward and/or backward feature search is performed on the video frames in the scene segment, and the regions in each searched frame that can match the feature information extracted from the initial frame are determined, and then the matching The area of ?? is automatically labeled, which realizes target tracking and automatic labeling in the scene fragment. In addition, before the feature search, each searched video frame can also be pre-processed, such as image denoising, contrast enhancement, etc., to make the feature information of the matching area in each searched frame more obvious.

In practical applications, algorithms such as mean shift, Kalman filter, and particle filter can be used for feature search.

The mean shift algorithm is a non-parametric method based on density gradient rise, which finds the target position through iterative calculations to achieve target tracking. The so-called tracking is to find the position of the target in the next frame through the known position of the target in the image frame. The significant advantage of the mean shift algorithm is that the algorithm has a small amount of calculation, is simple and easy to implement, and is very suitable for real-time tracking. Through experiments, it is proposed to use the kernel histogram to calculate the target distribution, which proves that the mean shift algorithm has good real-time characteristics. Mean shift has a wide range of applications in clustering, image smoothing, segmentation and tracking.

The mean shift algorithm locks the local maximum of the probability function in an iterative manner. For example, if there is a rectangular window to frame a certain part of an image, the principle is to find the center of gravity of the data point in the predefined window, or the weighted average. The algorithm moves the center of the window to the center of gravity of the data point, and repeats this process until the center of gravity of the window converges to a stable point. Therefore, whether the result of the iteration is good or bad depends on the input probability map (the above-mentioned predefined window) and its initial position.

The entire tracking steps of the mean shift algorithm include: setting the initial tracking target, that is, framing the target to be tracked; obtaining the histogram of the chromaticity H channel image in the HSV of the target to be tracked; normalizing the histogram to be tracked; and obtaining new data The histogram to be tracked is back-projected in the frame image; the mean value shifts, and the tracking position is updated.

Kalman (Kalman) filter: It can overcome the shortcoming that Wiener filter needs infinite past data and it is difficult to guarantee real-time performance. It is impossible to make the final real result and the filtering result completely equal, and can only be approximated. Kalman filtering selects the minimum mean square error as the criterion, and introduces a state space model for recursive estimation. Kalman filters are often used in navigation, radar, surveillance and other fields involving target tracking. The basic process is: adopt the state space model of signal and noise, recursively in the order of "prediction-actual measurement-correction", use the information of the previous moment to estimate the state variable at the current moment, and use the real observation value The model at the previous moment was adjusted. A typical application of Kalman filter is to predict the state of the target at the next moment from a limited observation value including the target position and noise. In surveillance video, target tracking is the process of selecting the target corresponding to the determined target from the multiple foreground blocks detected in the current frame, thereby obtaining the target's trajectory. In this process, the Kalman filter is used to predict the change of the position and the target center, and then the target is accurately located through multi-feature matching. This is the target tracking of the Kalman filter. In general, the use of Kalman filter to track the target is mainly divided into four steps: the first step is to calculate the target center, SIFT feature, color histogram and other feature points according to the target detection result; the second step is based on the target Set the prediction area at the Kalman prediction position of the next frame, and select eligible candidate targets in this area to match one by one; the third step is to define similarity functions for SIFT features, color histograms, target centers and other features, and select the best Match the target; the fourth step is to optimize the Kalman filter parameters according to the target state (such as normal tracking, tracking loss, fusion splitting, target entry and exit).

Particle filtering is a non-parametric Monte Carlo simulation method to achieve recursive Bayesian filtering. It is suitable for any nonlinear system that can be described by a state space model, and its accuracy can approach the optimal estimation. Particle filters are simple and easy to implement. They provide an effective solution for analyzing nonlinear dynamic systems, and are widely used in target tracking, signal processing, and automatic control. The core idea of the particle filter algorithm is to use the weighted sum of a series of random samples to approximate the posterior probability density function, and approximate the integral operation by summing. This algorithm is derived from Monte Carlo's idea, that is, the frequency of an event is used to refer to the probability of the event. Therefore, in the filtering process, where probability is needed, variables are sampled, and a large number of samples and their corresponding weights are used to approximate the probability density function. The most common particle filter algorithm is SIR (Samping Importance Resampling) filter, which is completed by the following four steps:

1) Prediction stage: The particle filter first generates a large number of samples based on the state transition function prediction, these samples are called particles, and the weighted sum of these particles is used to approximate the posterior probability density;

2) Correction stage: As the observations arrive in sequence, the corresponding importance weight is calculated for each particle. This weight represents the probability of obtaining the observation when the predicted pose takes the i-th particle. In this way, all particles are evaluated in this way, the more likely to obtain the observed particles, the higher the weight obtained;

3) Re-sampling stage: re-distribute the sampled particles according to the weight ratio. Since the number of particles that approximate the continuous distribution is limited, this step is very important. In the next round of filtering, input the resampled particle set into the state transition equation to obtain new predicted particles;

4) Map estimation: For each sampled particle, the corresponding map estimation is calculated based on the sampled trajectory and observation.

Further, if there is no area in a searched frame whose characteristic information matches the characteristic information of the target area, then the target characteristic information is acquired, and the characteristic information in the searched frame is determined to match the target characteristic information. Area, and automatically mark the area determined in the searched frame; wherein, the target feature information is: the feature information of the marked area in the adjacent preset number of frames of the searched frame. That is, if a certain searched frame does not match the feature information extracted from the initial frame, the feature information of the area that has been successfully matched in several adjacent frames is used to perform feature matching and labeling on the searched frame again.

It is understandable that when a searched frame does not match the feature extracted from the initial frame, it means that the feature change of the target object in the current frame (that is, the searched frame) exceeds the threshold and cannot be matched, then you can start from the current The previous frame or the previous few frames select the frame that successfully matches the features of the initial frame, and perform feature matching and automatic labeling on the current frame again according to the feature information of the marked area in the selected frame. If the feature information of the marked area in the first few frames of the current frame cannot be matched with the current frame, you can select the frame that successfully matches the features of the initial frame from the next or next frames of the current frame. The frames are again feature-matched and automatically labeled. In addition, if the current frame is the last frame of the current scene segment, the video frame of the next scene segment can be used for feature matching. Similarly, if the current frame is the first frame of the current scene segment, it can be used in the previous scene Feature matching is performed in the segment. If the matching feature is still not found, the median value of the feature point coordinates of the previous and subsequent frames of the current frame can be used as the feature point coordinates of the current frame, and then the area in the current frame can be adjusted and labeled by manual processing.

If there are currently several consecutive frames that do not match the features extracted from the initial frame, the feature point coordinates of the intermediate frames of these consecutive frames can be estimated first, and then the median frame coordinates of the preceding and following frames and intermediate frames can be estimated in turn until All frames are estimated, and then the areas in these consecutive frames are adjusted and labeled by manual processing; it is also possible to manually adjust and label the areas in the intermediate frames after the feature point coordinates of the intermediate frames are estimated, and then extract the intermediate frames The feature information of the newly labeled area is then automatically matched and labeled for the previous and next frames.

S105, extracting the marked images of each video frame in the scene segment as a training sample.

After annotating each video frame in the scene segment, the image of each marked video frame can be extracted as a training sample. Since the scene fragment contains a large number of video frames, a large number of labeled image training samples can be obtained based on each scene fragment.

In summary, the solution provided by the present invention firstly annotates the initial frame in the video scene segment, and then uses target tracking technology to automatically annotate other video frames in the entire scene segment, thereby obtaining a large number of annotated images as The training samples of the target recognition model are established later. In the prior art, manual annotation is performed by acquiring a large number of pictures, and the cost of image acquisition and annotation is relatively high. However, the present invention can use to shoot a video. The acquisition of materials is more convenient and easy. Then a large number of automatically marked samples can be collected from the video, which reduces The cost of sample labeling improves the efficiency of labeling processing.

Corresponding to the aforementioned method for obtaining training samples, the present invention also provides a device for obtaining training samples. As shown in FIG. 2, the device includes:

The obtaining module 201 is used to obtain scene fragments in the video;

The first labeling module 202 is configured to select a video frame containing a target object in the scene fragment as an initial frame, and label the target area where the target object is located in the initial frame;

The first extraction module 203 is configured to extract feature information of the target area marked in the initial frame;

The second labeling module 204 is configured to perform feature search on the forward and/or backward video frames in the scene segment based on the initial frame, and determine the difference between the feature information in each searched frame and the target area. Areas that match the feature information, and automatically mark the areas determined in each searched frame;

The second extraction module 205 is configured to extract the marked images of each video frame in the scene segment as training samples.

Optionally, the obtaining module 201 is specifically used for:

If the video is a single scene video, use the video as a scene segment;

Optionally, the scene switching detection technology includes: a detection algorithm based on a pixel domain and a detection algorithm based on a compressed domain.

Optionally, the device further includes:

The preprocessing module is configured to perform image preprocessing on the initial frame before the first extraction module 203 extracts the feature information of the target area marked in the initial frame, so that the The characteristic information of the target area is more obvious.

Optionally, the second extraction module 204 performs feature search on the forward and/or backward video frames in the scene segment, specifically:

Optionally, the second extraction module 204 is further configured to:

The present invention also provides an electronic device, as shown in FIG. 3, including a processor 301, a communication interface 302, a memory 303, and a communication bus 304. The processor 301, the communication interface 302, and the memory 303 complete each other through the communication bus 304. Communication between,

The memory 303 is used to store computer programs;

The processor 301 is configured to implement the following steps when executing the program stored in the memory 303:

Obtain scene fragments in the video;

Extracting feature information of the target area marked in the initial frame;

For the specific implementation of each step of the method and related explanation content, please refer to the method embodiment shown in FIG. 1 above, which will not be repeated here.

In addition, other implementations of the training sample obtaining method implemented by the processor 301 executing the program stored in the memory 303 are the same as the implementations mentioned in the foregoing method embodiments, and will not be repeated here.

The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the aforementioned electronic device and other devices.

The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. Optionally, the memory may also be at least one storage device located far away from the foregoing processor.

The foregoing processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

The present invention also provides a computer-readable storage medium in which a computer program is stored, and when the computer program is executed by a processor, the method steps of the above-mentioned training sample obtaining method are realized.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment that includes the element.

The foregoing description is only a description of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any changes or modifications made by those of ordinary skill in the field of the present invention based on the foregoing disclosure shall fall within the protection scope of the claims.

Claims

A method for obtaining training samples is characterized in that it includes:

Obtain scene fragments in the video;

Selecting a video frame containing a target object in the scene fragment as an initial frame, and marking the target area where the target object is located in the initial frame;

Extracting feature information of the target area marked in the initial frame;

Using the initial frame as a reference, perform a feature search on the forward and/or backward video frames in the scene segment, and determine the area in each searched frame whose feature information matches the feature information of the target area, and Automatically mark the area determined in each searched frame;

The image of each marked video frame in the scene segment is extracted as a training sample.
The method for obtaining training samples according to claim 1, wherein said obtaining a scene segment in a video comprises:

If the video is a single scene video, use the video as a scene segment;

If the video is a multi-scene video, the scene switching detection technology is used to divide the video into multiple scene segments.
The method for obtaining training samples according to claim 2, wherein the scene switching detection technology comprises: a detection algorithm based on a pixel domain and/or a detection algorithm based on a compressed domain.
5. The method for obtaining training samples according to claim 1, wherein before extracting the feature information of the target area marked in the initial frame, the method further comprises:

Image preprocessing is performed on the initial frame to make the feature information of the target region in the initial frame more obvious.
The method for obtaining training samples according to claim 1, wherein the feature information of the target region includes one or more of color features, texture features, and shape features.
The method for obtaining training samples according to claim 1, wherein the step of performing feature search on the forward and/or backward video frames in the scene segment comprises:

Using a mean shift algorithm, a Kalman filter algorithm, or a particle filter algorithm, feature search is performed on the forward and/or backward video frames in the scene segment.
The method for obtaining training samples according to claim 1, wherein the method further comprises:

If there is no area in a searched frame whose feature information matches the feature information of the target area, acquire the target feature information, determine the area in the searched frame where the feature information matches the target feature information, and Automatically mark the area determined in the searched frame;

Wherein, the target feature information is: feature information of the marked area in the adjacent preset number of frames of the searched frame.
A device for obtaining training samples is characterized by comprising:

Obtaining module, used to obtain scene fragments in the video;

The first labeling module is configured to select a video frame containing a target object in the scene fragment as an initial frame, and label the target area where the target object is located in the initial frame;

A first extraction module, configured to extract feature information of the target area marked in the initial frame;

The second labeling module is used to perform feature search on the forward and/or backward video frames in the scene segment based on the initial frame, and determine the feature information in each searched frame and the feature of the target area The area where the information matches, and automatically mark the area determined in each searched frame;

The second extraction module is used to extract the marked images of each video frame in the scene segment as training samples.
8. The training sample obtaining device according to claim 8, wherein the obtaining module obtains scene fragments in the video, comprising:

If the video is a single scene video, use the video as a scene segment;

If the video is a multi-scene video, the scene switching detection technology is used to divide the video into multiple scene segments.
9. The training sample obtaining device according to claim 9, wherein the scene switching detection technology comprises: a detection algorithm based on a pixel domain and/or a detection algorithm based on a compressed domain.
8. The training sample obtaining device according to claim 8, wherein the device further comprises:

The preprocessing module is configured to perform image preprocessing on the initial frame before the first extraction module extracts the feature information of the target region marked in the initial frame, so that the The characteristic information of the target area is more obvious.
8. The training sample obtaining device according to claim 8, wherein the feature information of the target region includes one or more of color feature, texture feature, and shape feature.
The training sample obtaining device according to claim 8, wherein the second extraction module performs feature search on the forward and/or backward video frames in the scene segment, comprising:

Using a mean shift algorithm, a Kalman filter algorithm, or a particle filter algorithm, feature search is performed on the forward and/or backward video frames in the scene segment.
8. The training sample obtaining device according to claim 8, wherein the second extraction module is further configured to:

If there is no area in a searched frame whose feature information matches the feature information of the target area, acquire the target feature information, determine the area in the searched frame where the feature information matches the target feature information, and Automatically mark the area determined in the searched frame;

Wherein, the target feature information is: feature information of the marked area in the adjacent preset number of frames of the searched frame.
An electronic device, characterized by comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus; wherein,

The memory is used to store computer programs;

When the processor is used to execute the computer program stored in the memory, the method according to any one of claims 1-7 is implemented.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed, the method according to any one of claims 1-7 is realized.