CN113627342B - Method, system, equipment and storage medium for video depth feature extraction optimization - Google Patents

Method, system, equipment and storage medium for video depth feature extraction optimization Download PDF

Info

Publication number
CN113627342B
CN113627342B CN202110918450.9A CN202110918450A CN113627342B CN 113627342 B CN113627342 B CN 113627342B CN 202110918450 A CN202110918450 A CN 202110918450A CN 113627342 B CN113627342 B CN 113627342B
Authority
CN
China
Prior art keywords
video
frame
feature
invalid
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110918450.9A
Other languages
Chinese (zh)
Other versions
CN113627342A (en
Inventor
游强
王坚
李兵
余昊楠
胡卫明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin Zhongke Jinan Intelligent Technology Co ltd
Original Assignee
Renmin Zhongke Jinan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin Zhongke Jinan Intelligent Technology Co ltd filed Critical Renmin Zhongke Jinan Intelligent Technology Co ltd
Priority to CN202110918450.9A priority Critical patent/CN113627342B/en
Publication of CN113627342A publication Critical patent/CN113627342A/en
Application granted granted Critical
Publication of CN113627342B publication Critical patent/CN113627342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for extracting and optimizing video depth features, relates to the technical field of computer machine vision, and aims to solve the problem that the prior art is not robust enough for video of complex scenes. The method comprises the following steps: acquiring a video invalid frame seed; constructing an invalid feature base; updating an invalid feature base, acquiring a video valid feature set, and training a frame validity binary judgment model according to the updated invalid feature base and the video valid feature set; and extracting video effective characteristics by using a frame effectiveness bipartite discriminant model. The system comprises: the method comprises the steps of obtaining a video invalid frame seed unit, constructing an invalid feature base unit, updating the unit, training the unit and extracting video valid feature units. The invention performs filtering in the feature vector space instead of the time-space domain of the original video frame, and can purposefully optimize the video of the complex scene.

Description

Method, system, equipment and storage medium for video depth feature extraction optimization
Technical Field
The invention relates to the technical field of computer machine vision, in particular to a video depth feature extraction optimization method, a system, equipment and a storage medium based on feature space screening.
Background
Video can be seen as a sequence of temporally successive video frames (pictures) and in the actual encoding process, in order to eliminate redundant information from frame to frame, only the necessary content is accessed in the form of key frames plus inter-frame differences in order to reduce the storage pressure. In order to more hierarchically express the story line of a person, the video often intentionally forms solid-color frames such as black or white or transitional frames through empty shots to complete the transition and conversion of the scene. In practical detection or retrieval applications, the video requires subsequent further processing by the decoded video frame sequence. The video frame sequence formed by decoding the video has a large amount of redundant information, particularly some videos with not abundant motion information, the gap between frames is very small, the redundant frames are directly removed, and the influence on the subsequent processing of the video is very small. In addition, some solid-color frames and transition frames, which are not specifically defined, not only cause waste of computing resources, but also affect visual processing tasks such as subsequent retrieval, and these frames are called invalid frames. The deep learning feature extraction method based on vision is currently generally based on convolutional neural networks (Convolutional Neural Network, CNN). CNN is based on extracted texture features, which determines that some features extracted by images with not abundant texture information or single distribution often cannot meet the requirements, and under certain application scenes related to retrieval, the introduction of the extracted features can cause large-scale mismatching phenomenon. The reason for this is that the underlying features learned by CNNs are the distribution of textures, and as depth gets deeper, the upper layer can be considered as a distributed semantic description of the image (Distributed Representation). Images with different semantic descriptions are often characterized by large weight differences in different dimensions, are far apart in a feature space, and the extracted features of the images with insufficient texture information can be seen as the superposition of a large number of image features with rich textures, and are likely to be relatively close to the features of a large number of images in the feature space, so that the extracted features are required to be screened before the visual processing tasks are executed, and besides the performance is improved, the effect of the visual processing tasks is also improved conveniently.
At present, the screening of invalid frames and redundant frames in a video is often defined in an original time-space domain, the frames are segmented into shots and scenes through video clips, and finally the frames are obtained, and the judgment basis is that the frames are processed in the time-space domain where the frames are located. And (3) screening the invalid frames, judging the invalid frames based on apparent statistics such as brightness, contrast, blurring degree and the like which are displayed by the video frames, and then directly screening. And for screening the redundant frames, obtaining a video frame sequence in each shot based on shot segmentation, clustering the video frame sequences or directly calculating the mean value, then calculating the clustering center or the degree of small mean value difference, directly judging the redundant frames, and then screening based on the judging result. The following problems are found in the screening of video frames in the original time-space domain:
is not robust enough for video of complex scenes. The filtering of the invalid frames and the redundant frames can only be performed according to a preset apparent characteristic threshold, for example, when the invalid frame threshold parameter set in some videos with less motion information is applied to the videos with more motion information, a large number of frames are judged to be invalid frames. Video with rich motion information often has a blurred picture of each frame, and a large number of frames are determined to be invalid depending on a threshold established by blur appearance. The determination of redundant frames based on cluster centers or averages is also highly susceptible to noise.
The video frame screening process and the subsequent depth feature extraction task are two relatively independent stages, which causes a certain degree of disjoint phenomenon between the previous screening and the subsequent task, i.e. the previously screened invalid frames and redundant frames may be helpful for the subsequent retrieval based on depth feature extraction, whereas the previously non-screened frames cannot enable the subsequent retrieval to operate efficiently.
Disclosure of Invention
The invention provides a method, a system, equipment and a storage medium for extracting and optimizing video depth characteristics, which are used for solving the problem that the prior art is not robust to video of a complex scene.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the first part, a method for extracting and optimizing video depth features in an embodiment of the invention comprises the following steps: s1, acquiring a video invalid frame seed; the step of acquiring video invalid frame seeds is to acquire monochrome video frames, global fuzzy video frames, single texture video frames or single scene video frames in the video; s2, constructing an invalid feature base; specifically, features of expanded video invalid frame seeds are extracted through a feature extraction model, an invalid feature base is constructed, and a video invalid frame seed set is set as The invalid feature set after mapping transformation is { V ] j J=1, …, M }, the invalid feature set constitutes an initial invalid feature base; s3, updating the invalid feature base and acquiring a video valid feature set; s4, training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set; specifically, each video invalid frame seed and each valid frame in the valid frame set are sent into a CNN class-II classification model, and training is carried out to obtain the frame validity binary judgment model; before inputting each video invalid frame seed and each valid frame in the valid frame set into the CNN class-two classification model, performing expansion operation on each invalid frame and each valid frame, including: brightness transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise; s5, extracting video effective features by using the frame effectiveness bipartite discriminant model.
Preferably, step S5 further includes: updating the video effective feature set by the extracted video effective features; and extracting invalid features by using the frame validity bipartite discriminant model, and updating the invalid feature base by using the extracted invalid features.
Preferably, step S5 further comprises the step of: s6, screening video redundant features from the video effective feature set to obtain an effective key feature set.
Further, the step S6 further includes the steps of: and S7, setting a corresponding threshold according to the task, and optimizing the effective key feature set.
Further, the acquiring a monochrome video frame in the video, when there is no reference image set, specifically includes:
will I rgb Conversion to I gray
The color uniformity index is calculated by the following K-L divergence formula:
setting U thresh If Uniformity (I) gray ||μ gray )≤U thres Judging the video frame to be a monochrome video frame and acquiring the monochrome video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray-scale images, hist (I) gray ) Representing a normalized gray level histogram of a video frame, B representing the barrel number of the histogram, hist (μ) gray ) Representing the corresponding gray average value uniform distribution, U thresh Representing the K-L divergence threshold for a monochrome video frame.
Further, the acquiring a monochrome video frame in the video, when there is a reference image set, specifically includes: calculating a normalized gray level histogram of each reference video frame in the reference image set; ordering according to the gray scale descending order; calculating a constant integral value of accumulated distribution of gray scales in the previous x% barrel as a color uniformity threshold, screening and acquiring a monochrome video frame, wherein the calculation formula is as follows:
Wherein I is gray Representing gray-scale images, hist (I) gray ) The normalized gray level histogram representing the video frame, x representing the set percentage value.
Further, the acquiring the global blurred video frame in the video specifically includes: will I rgb Conversion to I gray The method comprises the steps of carrying out a first treatment on the surface of the The original video frame is selected by sharpness, which is calculated by the following formula:
set S thres If Sharpness (I) gray )≤S thresh Judging the video frame as global fuzzy and acquiring the video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray scale images, S thresh Representing sharpness threshold, delta x And delta y Gray gradients in two orthogonal directions representing the sharpness.
Further, the step of obtaining a single texture video frame in the video and the step of generating by simulation specifically includes: and transforming the original video frame or intercepting a part of the original video frame through translation, rotation or/and scaling, and putting the transformed image into a monochrome image to form the single texture video frame.
Further, the obtaining a single texture video frame in the video, through gradient distribution histogram threshold screening, includes: will I rgb Conversion to I gray The method comprises the steps of carrying out a first treatment on the surface of the I is as follows gray Any axis of (a) is a rotation axis, and the rotation angle is a E [0,180 ], then
Directional gradientThe average value of (2) is:
the variance of the directional gradient is:
the sharpness is:
if Sharpness (i) gray )≥s thres And delta 2 (ΔI gray )≤δ 2 thresh Judging the video frame as a single texture video frame and acquiring the single texture video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray scale images, S thresh Representing sharpness threshold, delta 2 thres Representing the direction gradient variance threshold, delta x And delta y Gray gradients in two orthogonal directions representing the sharpness.
Further, transforming the original video frame or intercepting a part of the original video frame through translation, rotation or/and scaling, and placing the transformed image into a monochrome image; calculating the monochrome image containing the converted image by the formula 4 and the formula 5 to obtain the delta 2 (ΔI gray ) The method comprises the steps of carrying out a first treatment on the surface of the Calculating the monochrome image including the converted image by the formula 6 to obtain the Sharpness (I gray )。
Further, the single scene video frames in the acquired video are specifically generated by simulating salt and pepper noise on a monochrome image.
Preferably, in step S1, the expanding operation is performed on the video invalid frame seed, including: luminance transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise.
Preferably, the updating the invalid feature base and obtaining the valid feature set of the video in step S3 specifically includes: mapping the video to be analyzed into a candidate feature set through a feature extraction model; and carrying out similarity expression on the features in the candidate feature set and the features in the invalid feature base one by one, wherein the similarity expression is calculated according to the following formula:
If s is ij >s 1 Then determine f i Adding the invalid features into the invalid feature base; if s is ij ≤s 2 Then determine f i Adding the video effective feature to the video effective feature set as effective feature;
wherein f i Representing candidate feature sets, V j Representing a set of features in a base of invalid features s 1 Is the first threshold of similarity, s 2 Is a second threshold of similarity.
Further, if s ij >s 1 Will also f i The corresponding original video frame column is the video invalid frame seed; if s is ij ≤s 2 An active frame set is also constructed and f i And adding the corresponding original video frames into the active frame set.
Further, if s 2 <s ij ≤s 1 Then determine f i For candidate effective feature, constructing candidate effective frame set, and f i And adding the corresponding original video frames into the candidate effective frame set.
Further, the candidate effective frame set is input into the frame effectiveness binary judgment model, the judged ineffective frames are listed as the video ineffective frame seeds, and the judged effective frames are added into the effective frame set.
Further, after the invalid frame judged by the frame validity bipartite judging model is subjected to a feature extraction model, adding invalid features into the invalid feature base; and adding the effective features into the video effective feature set after the effective frames judged by the frame effectiveness bipartite judging model are subjected to the feature extraction model.
Preferably, clustering operation is carried out on each feature in the invalid feature base.
Further, in step S6, the step of screening video redundant features from the video effective feature set to obtain an effective key feature set specifically includes: s61, comparing the similarity between the current feature in the video effective feature set and the feature of the next time sequence; s62, if the comparison result is smaller than the third threshold S of similarity 3 Marking the current feature as an effective key feature, adding the effective key feature set, and assigning the subsequent time sequence feature as the current feature, and returning to the S61; otherwise, go to S63; s63, if the comparison result is greater than or equal to the third threshold S of similarity 3 Then its subsequent timing characteristics are filtered out and S61 is returned.
In a second aspect, a system for video depth feature extraction optimization according to an embodiment of the present invention includes: the video invalidation frame seed acquiring unit is used for acquiring video invalidation frame seeds; the step of acquiring video invalid frame seeds is to acquire monochrome video frames, global fuzzy video frames, single texture video frames or single scene video frames in the video; an invalid feature base unit is constructed and used for constructing an invalid feature base; specifically, features of expanded video invalid frame seeds are extracted through a feature extraction model, an invalid feature base is constructed, and a video invalid frame seed set is set as The invalid feature set after mapping transformation is { V ] j J=1, …, M }, the invalid feature set constitutes an initial invalid feature base; the updating unit is used for updating the invalid feature base and acquiring a video valid feature set; the training unit is used for training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set; specifically, each video invalid frame seed and each valid frame in the valid frame set are sent into a CNN class-II classification model, and training is carried out to obtain the frame validity binary judgment model; in seeding the video inactive frames and collecting the active framesBefore each effective frame is input into the CNN class-II classification model, the expansion operation is carried out on each ineffective frame and the effective frame, which comprises the following steps: brightness transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise; and extracting video effective characteristics by using the frame effectiveness bipartite discriminant model.
Preferably, the method further comprises: and the video redundancy screening feature unit is used for screening video redundancy features from the video effective feature set to obtain an effective key feature set.
Further, the method further comprises the following steps: and the optimizing unit is used for setting a corresponding threshold according to the task and optimizing the effective key feature set.
A third part, a computer device according to an embodiment of the present invention includes: the video depth feature extraction optimization method comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the video depth feature extraction optimization method according to any embodiment of the invention when executing the computer program.
In a fourth aspect, a storage medium containing computer-executable instructions for performing the method for video depth feature extraction optimization of any of the embodiments of the present invention when executed by a computer processor.
The method, the system, the equipment and the storage medium for extracting and optimizing the video depth features, disclosed by the invention, are used for filtering in a feature vector space instead of a time-space domain of an original video frame, and can be used for purposefully optimizing the video of a complex scene.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of a method for video depth feature extraction optimization according to embodiment 1 of the present invention;
FIG. 2 is a flow chart of a method for video depth feature extraction optimization according to embodiment 2 of the present invention;
FIG. 3 is a schematic diagram of a system for optimizing video depth feature extraction according to embodiment 3 of the present invention;
FIG. 4 is a schematic diagram of a system for optimizing video depth feature extraction according to embodiment 4 of the present invention;
fig. 5 is a schematic structural diagram of a computer device in embodiment 5 of the present invention.
Detailed Description
The inventor provides a video depth feature extraction optimization method, a video depth feature extraction optimization system, video depth feature extraction optimization equipment and a video depth feature extraction optimization storage medium based on feature space screening through researches, and the method, the system, the video depth feature extraction optimization equipment and the video depth feature extraction optimization storage medium are specifically described in the following through embodiments.
The method for extracting and optimizing video depth features in embodiment 1 and this embodiment, as shown in fig. 1, includes the following main steps:
110. and acquiring a video invalid frame seed.
In this step, the video invalid frame seed is specifically a monochrome video frame, a global blurred video frame, a single texture video frame, or a single scene video frame in the acquired video.
Where monochrome video frames, such as white, black, gray, and other colors, are acquired, nothing is meant in the video, perhaps only for slicing among a plurality of different shot transitions, which is common in video of the movie type. The single-color image can cause the weight of each dimension of the extracted feature of the CNN model to be consistent due to the loss of texture information, and the consistency has little influence on classification and detection tasks, because classification and detection pay more attention to local information, but in the task of searching the global information, accuracy can be reduced.
Assume that the original video frame uses I rgb Representation, conversion to grey scale image
I gray =0.299*Ir+0.587*Ig+0.114*Ib;
There are two methods for judging monochrome video frames, one is a method based on no reference image and the other is a method based on a reference image set.
Ginseng-freeWhen the image set is examined, the color uniformity is defined by calculating based on K-L scattering index of gray average value uniform distribution, and the normalized gray histogram of the video frame is expressed as hist (I gray ) The histogram has bin number B and corresponding gray level average value of hist (mu) gray ) The color uniformity index can be calculated by the following K-L divergence
The selection of the monochrome video frame is judged by setting the K-L divergence threshold value of the monochrome video frame, and the basic logic is that the threshold value U is set according to experience thresh If Uniformity (I) hray ||μ gray )≤U thresh The candidate video frame is considered to be a monochrome video frame seed, otherwise not.
Based on the reference image set, judging whether the candidate video frames are in the part which is ranked forward in relative color uniformity in the reference image set, and calculating the relative color uniformity index of the images as a threshold value so as to finish screening of the monochrome video frames. Assuming that the reference image set is a certain batch of video frames, firstly calculating a normalized gray level histogram of the reference image set, and then sorting the reference image set according to a gray level descending order, in the embodiment, calculating a constant integral value of the accumulated distribution of gray levels in the first 5% barrel as a color uniformity threshold, and taking the constant integral value as a single-color image frame screening basis of the reference image set.
The global blurred video frames, such as blur caused by motion information of a video, are acquired, and when a camera shoots the video, the video frames can be simulated by Gaussian filtering in certain directions due to blur caused by focusing inaccuracy. Such video frames retain a small portion of the texture features compared to monochrome video frames, but still belong generally to video frames that contain little texture information and should still be considered as invalid frames in the video.
Global blurred video frames can be found by gradient information, candidate video frames are selected based on a sharpness index, where sharpness can be seen as the sum of squares of gray scale gradients in two orthogonal directions, e.g., the x-axis and the y-axis, delta x And delta y Representing the gray gradient in two orthogonal directions of sharpness.
Video frames with global blur can be screened out, and a threshold S is set thres When the Sharpness (I) gray )≤S thre Then it is considered a globally blurred video frame. But higher sharpness values may also reflect globally unique texture information such as the plain text, forests, sand, etc. mentioned above, so that a single texture video frame selection is necessary.
The method comprises the steps of obtaining a single texture video frame, such as a pure subtitle scene frame in a video or a frame in which the video is completely covered by barrage characters, and leaves, sand and the like of a full graph, wherein the type of texture information is single, and the single local image can be repeatedly transformed and simulated, so that false detection of characters and the like in local areas in the scene can be caused for some detection tasks.
There are two methods for selecting a single texture video frame, one is a simulation image generation method, and the other is a gradient distribution histogram threshold screening method. The former is a method of forming a single texture video frame by a single object or capturing a part of a real image, and then putting the transformed object or part of the image into a blank or monochrome image through operations such as translation, rotation, scaling and the like. The latter is selected by a formula, in this embodiment, assuming that the x-axis of the image (which may be an image subjected to operations such as translation, rotation, and scaling) is taken as a rotation axis (any axis of the image is taken as a rotation axis), and the counterclockwise rotation angle is a e [0,180 ], the mean and variance of the directional gradients are respectively:
the likelihood that a video frame with greater sharpness but less variance in the direction gradient is a single texture video frame is relatively high. In practical application, firstly, a part of single texture video frames are generated based on operations such as translation, rotation and scaling, and then two indexes of sharpness and variance of directional gradient of the batch of single texture video frames are calculated, and the two indexes are used as reference and used as screening thresholds of the single texture video frames of a real video. If the simulation scheme is not constructed, a threshold of sharpness and direction gradient variance may be set as the screening scheme, for example: sharpness of Sharpness (I) gray ) 0.8 or more and direction gradient variance delta 2 (ΔI gray ) Less than or equal to 0.1. In practical applications, a first scheme is generally adopted, namely, a batch of single-texture video frame sets is generated through simulation, and then the average value of the sharpness and the variance of the directional gradient of the batch of video frames is counted to serve as a reference for selecting the real video single-texture frames.
The method comprises the steps of obtaining video frames of a single scene, such as very single frames of scenes such as stars, point lights and the like in certain videos at night, wherein the video frames can be partially simulated by overlapping impulse noise with monochrome images. The problem that the texture information of the video frames is less and single is solved, and under the convolution processing in the CNN model, the noise of the salt and pepper samples is partially filtered, so that the situation similar to the single-color video frames is converted, the weight of each dimension of the extracted features is relatively approximate, the subsequent retrieval task is affected, and the accuracy in retrieval is reduced. For example, some videos have fixed head and tail, do not express specific meanings, only serve as brand marks, and should be placed in invalid frame seeds.
In a specific implementation, a single scene video frame may be generated by simulating salt and pepper noise on a monochrome image, for example, blue sky and white clouds may be simulated by using sky blue images plus diffuse salt noise, and a complex sky and sky scene in the night may be simulated by superimposing white salt noise on black images. Considering that the single scene may be selected in the previous process flow, and the features obtained after the CNN model are similar to the features of the monochrome video frame, in practical application, the single scene may be selected by placing a small number of video frame samples into the video inactive frame seeds. The semantics related to the head and the tail of some videos are higher, the bottom indexes can be difficult to measure, and the corresponding head and tail frames can be directly collected as single scene video frame seeds.
In summary, the basic principle of acquiring the video invalid frame seeds is to select video frames with single or not abundant texture information, avoid the influence of video features extracted in the next feature extraction process on subsequent tasks, and optimize the finally obtained features through the screening of the invalid features. Through the acquisition process, a batch of video invalid frame seeds are obtained. Further, the video null frame seed may be expanded, including but not limited to: brightness transformation, gaussian blur, motion blur, translational rotation transformation or/and superposition of salt-and-pepper noise, so that the video invalid frame seeds can adapt to more complex video environments, the number of the video invalid frame seeds is increased, and the feature extraction capability of the model on the valid frame or the invalid frame can be ensured. And then constructing an invalid feature base based on the batch of video invalid frame seeds, updating the invalid feature base in the actual screening process, and finally screening out valid features of the video.
120. And constructing an invalid feature base.
The inventor considers that a classification model for judging the effectiveness can be directly trained in specific implementation, but only one batch of video invalid frame seeds exist in the step, so that the quantity is small, if the batch is enlarged, a large amount of resources are consumed, and sample data of valid frames are not available. Therefore, in this embodiment, based on the concept of semi-supervised learning, features of an existing video invalid frame seed are extracted based on an existing feature extraction model to form an invalid feature base, then when new video frames are extracted with features, video frame features similar to the invalid features are obtained through calculation of feature similarity, similar video frame features are added into the invalid feature base to finish updating, corresponding video frames are listed as video invalid frame seeds, so that a data sample is prepared for a discrimination model, and the validity discrimination model can be trained and updated until the scale of the data sample reaches a certain degree, and then the invalid features are screened out based on the validity discrimination model, so that valid features of the video are selected.
Based on the above considerations, an invalid feature base is constructed. Video null frame seeds are a series of video frames that vary widely in appearance but are generally not plentiful and single in texture. Before training a CNN feature extraction model, the video invalid frame seeds need to be expanded and enhanced, and at least the enhancement processes such as brightness change, gaussian blur, motion blur, translational rotation transformation, and superimposed salt-and-pepper noise, which are involved in the video invalid frame seed extraction process, need to be adapted, so that the features of the video invalid frame seeds can be better obtained. The construction flow of the invalid feature base is as follows:
assuming that the input of the CNN extraction model is any image Irgb, and the output is a feature representation v e RK of the image, where K is the feature dimension extracted, the process of CNN extracting features can be represented by the following map f:
f:I rgb v formula 107;
assuming that the video null frame seed set isThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, which constitute the initial invalid feature base.
130. And updating the invalid feature base and acquiring a video valid feature set.
The video is decoded to obtainThe candidate feature set { f) is obtained through the mapping f given by the CNN feature extraction model i =1, 2, …, N }, assuming the candidate feature set is { f } i =1, 2, …, N }, where N is the total number of frames of video. The similarity of a candidate feature to a feature in the invalid feature base may be represented by a cosine similarity between the two:
in this embodiment, a first threshold s of invalid feature alignment similarity is defined 1 =0.9, when s ij >s 1 At this time, the candidate feature f is determined i For invalid features, adding the invalid features into the invalid feature base, and adding the candidate features f i The corresponding raw video frame column is the video null frame seed. In this embodiment, the second threshold of the effective feature comparison similarity is defined as S 2 When S is =0.5 ij ≤S 2 At the time, candidate feature F is determined i As an active feature, add it to the video active feature set and put its corresponding original video frame into the active frame set. If s is 2 <S ij ≤S 1 And judging the feature as a candidate effective feature, putting the corresponding original video frame as the candidate effective frame into a candidate effective frame set, and waiting for secondary confirmation of a judgment model trained subsequently.
140. And training a frame effectiveness bipartite judging model according to the updated invalid characteristic base and the video effective characteristic set.
Assuming that the scales of the video invalid frame seeds and the valid frame sets reach a certain degree, for example, the p is more than or equal to 10000, a frame validity binary judgment model can be trained to complete the video valid characteristic screening process. The image set corresponding to the video invalid frame seeds is a negative sample set, the confirmed valid frame set is a positive sample set, and the frame validity binary judgment model is obtained through training after data enhancement operations such as at least brightness transformation, gaussian blur, motion blur, translational rotation transformation, superposition of salt and pepper noise and the like are carried out, and then the image set is sent to a CNN class-II classification model (such as ResNet50, and an output layer is modified into two classes of output):
m p :I rgb a label e {0,1} formula 109;
then, based on a frame effectiveness bipartite judging model, further inputting a candidate effective frame set needing secondary confirmation, and listing the judged ineffective frames as video ineffective frame seeds, and adding ineffective features into an ineffective feature base after the ineffective frames pass through a feature extraction model; and adding the determined effective frames into an effective frame set, and adding effective features into a video effective feature set after the effective frames pass through a feature extraction model.
150. And extracting video effective characteristics by using a frame effectiveness bipartite discriminant model.
As the dataset size p increases, model m p The method can train and update incrementally, update the video effective feature set by utilizing the video effective feature extracted by the frame effectiveness bipartite discriminant model, extract the ineffective feature and update the ineffective feature base. With the updating of data, after the scale M of the invalid feature base is increased, all features of the new candidate video frame feature set to be distinguished (the set size is N) are compared with all features in the invalid feature base, so that the calculated amount is often larger, and the complexity is thatBecause the dimension K of the feature is a determined value, clustering operation can be performed based on the existing invalid feature base, and after good clustering training, the complexity of effective frame screening can be reduced to +.>The speed of screening the effective features can be greatly improved.
According to the method, representative invalid frames are selected to be added into the video invalid frame seeds, an invalid feature base is constructed through a depth feature extraction method, and the influence of the invalid frames on follow-up can be obviously reduced through similarity threshold screening of the invalid feature base, wherein the influence of the invalid frames is reduced to be lower if the characteristic extracted by the depth model has relatively obvious noise resistance capability, and the training data in the deep learning model process is subjected to comparison learning by adding a representative noise model.
The method for extracting and optimizing video depth features in embodiment 2 and this embodiment, as shown in fig. 2, includes the following main steps:
210. and acquiring a video invalid frame seed.
In this step, the video invalid frame seed is specifically a monochrome video frame, a global blurred video frame, a single texture video frame, or a single scene video frame in the acquired video.
Where monochrome video frames, such as white, black, gray, and other colors, are acquired, nothing is meant in the video, perhaps only for slicing among a plurality of different shot transitions, which is common in video of the movie type. The single-color image can cause the weight of each dimension of the extracted feature of the CNN model to be consistent due to the loss of texture information, and the consistency has little influence on classification and detection tasks, because classification and detection pay more attention to local information, but in the task of searching the global information, accuracy can be reduced.
Assume that the original video frame uses I rgb Representation, conversion to grey scale image
I gray =0.299*Ir+0.587*Ig+0.114*Ib;
There are two methods for judging monochrome video frames, one is a method based on no reference image and the other is a method based on a reference image set.
Without a reference image set, defining color uniformity is calculated using a K-L divergence index based on gray-level mean uniformity distribution, and the normalized gray-level histogram of the video frame is expressed as hist (I gray ) The histogram has bin number B and corresponding gray level average value of hist (mu) gray ) The color uniformity index can be calculated by the following K-L divergence
The selection of the monochrome video frame is judged by setting the K-L divergence threshold value of the monochrome video frame, and the basic logic is that the threshold value U is set according to experience thres If Uniformity (I) gray ||μ gray )≤U thresh The candidate video frame is considered to be a monochrome video frame seed, otherwise not.
Based on the reference image set, judging whether the candidate video frames are in the part which is ranked forward in relative color uniformity in the reference image set, and calculating the relative color uniformity index of the images as a threshold value so as to finish screening of the monochrome video frames. Assuming that the reference image set is a certain batch of video frames, firstly calculating a normalized gray level histogram of the reference image set, and then sorting the reference image set according to a gray level descending order, in the embodiment, calculating a constant integral value of the accumulated distribution of gray levels in the first 5% barrel as a color uniformity threshold, and taking the constant integral value as a single-color image frame screening basis of the reference image set.
The global blurred video frames, such as blur caused by motion information of a video, are acquired, and when a camera shoots the video, the video frames can be simulated by Gaussian filtering in certain directions due to blur caused by focusing inaccuracy. Such video frames retain a small portion of the texture features compared to monochrome video frames, but still belong generally to video frames that contain little texture information and should still be considered as invalid frames in the video.
Global blurred video frames can be found by gradient information, candidate video frames are selected based on a sharpness index, where sharpness can be seen as the sum of squares of gray scale gradients in two orthogonal directions, e.g., the x-axis and the y-axis, delta x And delta y Representing the gray gradient in two orthogonal directions of sharpness.
Can be used forScreening out globally blurred video frames and setting a threshold S thresh When the Sharpness (I) gray )≤S thres Then it is considered a globally blurred video frame. But higher sharpness values may also reflect globally unique texture information such as the plain text, forests, sand, etc. mentioned above, so that a single texture video frame selection is necessary.
The method comprises the steps of obtaining a single texture video frame, such as a pure subtitle scene frame in a video or a frame in which the video is completely covered by barrage characters, and leaves, sand and the like of a full graph, wherein the type of texture information is single, and the single local image can be repeatedly transformed and simulated, so that false detection of characters and the like in local areas in the scene can be caused for some detection tasks.
There are two methods for selecting a single texture video frame, one is a simulation image generation method, and the other is a gradient distribution histogram threshold screening method. The former is a method of forming a single texture video frame by a single object or capturing a part of a real image, and then putting the transformed object or part of the image into a blank or monochrome image through operations such as translation, rotation, scaling and the like. The latter is selected by a formula, in this embodiment, assuming that the x-axis of the image (which may be an image subjected to operations such as translation, rotation, and scaling) is taken as a rotation axis (any axis of the image is taken as a rotation axis), and the counterclockwise rotation angle is a e [0,180 ], the mean and variance of the directional gradients are respectively:
sharpness is higher but direction gradient variance is higherThe likelihood that a small video frame is a single texture video frame is relatively large. In practical application, firstly, a part of single texture video frames are generated based on operations such as translation, rotation and scaling, and then two indexes of sharpness and variance of directional gradient of the batch of single texture video frames are calculated, and the two indexes are used as reference and used as screening thresholds of the single texture video frames of a real video. If the simulation scheme is not constructed, a threshold of sharpness and direction gradient variance may be set as the screening scheme, for example: sharpness of Sharpness (I) gray ) 0.8 or more and direction gradient variance delta 2 (ΔI gray ) Less than or equal to 0.1. In practical applications, a first scheme is generally adopted, namely, a batch of single-texture video frame sets is generated through simulation, and then the average value of the sharpness and the variance of the directional gradient of the batch of video frames is counted to serve as a reference for selecting the real video single-texture frames.
The method comprises the steps of obtaining video frames of a single scene, such as very single frames of scenes such as stars, point lights and the like in certain videos at night, wherein the video frames can be partially simulated by overlapping impulse noise with monochrome images. The problem that the texture information of the video frames is less and single is solved, and under the convolution processing in the CNN model, the noise of the salt and pepper samples is partially filtered, so that the situation similar to the single-color video frames is converted, the weight of each dimension of the extracted features is relatively approximate, the subsequent retrieval task is affected, and the accuracy in retrieval is reduced. For example, some videos have fixed head and tail, do not express specific meanings, only serve as brand marks, and should be placed in invalid frame seeds.
In a specific implementation, a single scene video frame may be generated by simulating salt and pepper noise on a monochrome image, for example, blue sky and white clouds may be simulated by using sky blue images plus diffuse salt noise, and a complex sky and sky scene in the night may be simulated by superimposing white salt noise on black images. Considering that the single scene may be selected in the previous process flow, and the features obtained after the CNN model are similar to the features of the monochrome video frame, in practical application, the single scene may be selected by placing a small number of video frame samples into the video inactive frame seeds. The semantics related to the head and the tail of some videos are higher, the bottom indexes can be difficult to measure, and the corresponding head and tail frames can be directly collected as single scene video frame seeds.
In summary, the basic principle of acquiring the video invalid frame seeds is to select video frames with single or not abundant texture information, avoid the influence of video features extracted in the next feature extraction process on subsequent tasks, and optimize the finally obtained features through the screening of the invalid features. Through the acquisition process, a batch of video invalid frame seeds are obtained. Further, the video null frame seed may be expanded, including but not limited to: brightness transformation, gaussian blur, motion blur, translational rotation transformation or/and superposition of salt-and-pepper noise, so that the video invalid frame seeds can adapt to more complex video environments, the number of the video invalid frame seeds is increased, and the feature extraction capability of the model on the valid frame or the invalid frame can be ensured. And then constructing a wireless feature base based on the batch of video invalid frame seeds, updating the invalid feature base in the actual screening process, and finally screening out valid features of the video.
220. And constructing an invalid feature base.
The inventor considers that a classification model for judging the effectiveness can be directly trained in specific implementation, but only one batch of video invalid frame seeds exist in the step, so that the quantity is small, if the batch is enlarged, a large amount of resources are consumed, and sample data of valid frames are not available. Therefore, in this embodiment, based on the concept of semi-supervised learning, features of an existing video invalid frame seed are extracted based on an existing feature extraction model to form an invalid feature base, then when new video frames are extracted with features, video frame features similar to the invalid features are obtained through calculation of feature similarity, similar video frame features are added into the invalid feature base to finish updating, corresponding video frames are listed as video invalid frame seeds, so that a data sample is prepared for a discrimination model, and the validity discrimination model can be trained and updated until the scale of the data sample reaches a certain degree, and then the invalid features are screened out based on the validity discrimination model, so that valid features of the video are selected.
Based on the above considerations, an invalid feature base is constructed. Video null frame seeds are a series of video frames that vary widely in appearance but are generally not plentiful and single in texture. Before training a CNN feature extraction model, the video invalid frame seeds need to be expanded and enhanced, and at least the enhancement processes such as brightness change, gaussian blur, motion blur, translational rotation transformation, and superimposed salt-and-pepper noise, which are involved in the video invalid frame seed extraction process, need to be adapted, so that the features of the video invalid frame seeds can be better obtained. The construction flow of the invalid feature base is as follows:
assuming that the input of the CNN extraction model is any image Irgb, and the output is a feature representation v e RK of the image, where K is the feature dimension extracted, the process of CNN extracting features can be represented by the following map f:
f:I rgb v formula 207;
assuming that the video null frame seed set isThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, which constitute the initial invalid feature base.
230. And updating the invalid feature base and acquiring a video valid feature set.
A decoded frame sequence obtained after the video is decoded is mapped f by a CNN feature extraction model to obtain a candidate feature set { f } i =1, 2, …, N }, assuming the candidate feature set is { f } i =1, 2, …, N }, where N is the total number of frames of video. The similarity of a candidate feature to a feature in the invalid feature base may be represented by a cosine similarity between the two:
this practice isIn the embodiment, an invalid characteristic comparison similarity first threshold s is defined 1 =0.9, when s ij >s 1 At this time, the candidate feature f is determined i For invalid features, adding the invalid features into the invalid feature base, and adding the candidate features f i The corresponding raw video frame column is the video null frame seed. In this embodiment, the second threshold of the effective feature comparison similarity is defined as s 2 =0.5, when s ij ≤s 2 At this time, the candidate feature f is determined i As an active feature, add it to the video active feature set and put its corresponding original video frame into the active frame set. If s is 2 <s ij ≤s 1 And judging the feature as a candidate effective feature, putting the corresponding original video frame as the candidate effective frame into a candidate effective frame set, and waiting for secondary confirmation of a judgment model trained subsequently.
240. And training a frame effectiveness bipartite judging model according to the updated invalid characteristic base and the video effective characteristic set.
Assuming that the scales of the video invalid frame seeds and the valid frame sets reach a certain degree, for example, the scale can be set to be P more than or equal to 10000, and a frame validity binary judgment model can be trained to complete the video valid characteristic screening process. The image set corresponding to the video invalid frame seeds is a negative sample set, the confirmed valid frame set is a positive sample set, and the frame validity binary judgment model is obtained through training after data enhancement operations such as at least brightness transformation, gaussian blur, motion blur, translational rotation transformation, superposition of salt and pepper noise and the like are carried out, and then the image set is sent to a CNN class-II classification model (such as ResNet50, and an output layer is modified into two classes of output):
m p :I rgb a label e {0,1} formula 209;
then, based on a frame effectiveness bipartite judging model, further inputting a candidate effective frame set needing secondary confirmation, and listing the judged ineffective frames as video ineffective frame seeds, and adding ineffective features into an ineffective feature base after the ineffective frames pass through a feature extraction model; and adding the determined effective frames into an effective frame set, and adding effective features into a video effective feature set after the effective frames pass through a feature extraction model.
250. And extracting video effective characteristics by using a frame effectiveness bipartite discriminant model.
As the dataset size p increases, model m p The method can train and update incrementally, update the video effective feature set by utilizing the video effective feature extracted by the frame effectiveness bipartite discriminant model, extract the ineffective feature and update the ineffective feature base. With the updating of data, after the scale M of the invalid feature base is increased, all features of the new candidate video frame feature set to be distinguished (the set size is N) are compared with all features in the invalid feature base, so that the calculated amount is often larger, and the complexity is thatBecause the dimension K of the feature is a determined value, clustering operation can be performed based on the existing invalid feature base, and after good clustering training, the complexity of effective frame screening can be reduced to +.>The speed of screening the effective features can be greatly improved.
260. And screening video redundant features from the video effective feature set to obtain an effective key feature set.
The inventor considers that a video has a great deal of redundant information, and particularly, extremely high similarity exists between adjacent frames in the same scene. This is often unnecessary in the subsequent classification or retrieval tasks, on the one hand consuming computing resources and reducing processing performance; on the other hand, the accuracy may be degraded by some mismatch. The traditional time-space domain-based method generally directly calculates the inter-frame difference, and redundant frames are considered to exist after the inter-frame difference is smaller than a certain threshold value, but the redundant frames are easily interfered by noise. According to the method, redundant features are screened in the feature space based on the features extracted by the CNN, the characteristics of a CNN model and data enhancement operation in the training process can be fully utilized, and the robustness to original spatial noise is improved.
Specifically, the current feature and the next time sequence feature in the video effective feature set are matchedSimilarity comparison, for example, the comparison result is smaller than the third threshold s of similarity 3 And if not, screening out the next time sequence feature until all the features in the video effective feature set complete the operation.
In a specific implementation, for example: the candidate feature set after the validity feature screening becomes { f } t : t=1, 2..a., T }, wherein the subscript T represents a temporal sequence number representing only a temporal sequence, not an original frame number, and after active frame screening, some frame features are determined as invalid features and discarded. The redundant feature screening process based on the feature similarity calculation between video frames is as follows:
calculation of
Where p=1, … T-1, q=1, …, T, defining a third threshold for inter-frame feature similarity as s 3 If s pq ≥s 3 The feature similarity of the sequence p and the subsequent sequence q is higher, the feature of the sequence q and the feature of the previous sequence p are redundant, the frame feature of the subsequent redundancy is removed, and q is more than q+1; otherwise, marking the feature with the time sequence of p as a valid key feature, and assigning q behind the current feature to p, namely p+.q, and re-executing the operation until p reaches the end T of the time sequence. In actual use. Based on the obtained effective key feature set, the key feature is optimized in the following specific application requirements.
270. And setting a corresponding threshold according to the task, and optimizing the effective key feature set.
All the related threshold superparameters in the above flow can be obtained according to the specific task indexes finally applied by the effective key features. Through Grid Search, threshold parameters are subjected to Grid division based on a possible value range, then super parameters of Grid division are used for obtaining key feature combinations of corresponding videos, statistics are finally based on specific tasks such as classification, detection, retrieval and the like, indexes of the corresponding tasks are judged, and optimization selection is performed on video feature screening and final effective key features so as to select super parameters which are most suitable for the specific tasks, so that optimal effective key features are screened.
Compared with the method for directly screening the invalid frames and the redundant frames in the original time-space domain, the method for screening the invalid frames in the embodiment selects the representative invalid frames to be added into the video invalid frame seeds, constructs an invalid feature base through a depth feature extraction method, and can obviously reduce the influence of the invalid frames on follow-up by screening the similarity threshold value of the invalid feature base, wherein the influence of the invalid frames is reduced to be lower if the training data in the deep learning model process are subjected to comparison learning by adding the representative noise model. The filtering of the redundant frames is carried out in the feature space, and the filtering can be unified with the retrieval task required by the subsequent features, so that the joint optimization can be realized, the filtering process of the redundant frames is combined with the specific task, and the retrieval effect is improved in a targeted manner.
Embodiment 3, the system for optimizing frequency depth feature extraction in this embodiment, as shown in fig. 3, includes: a video inactive frame seed unit 310, an inactive feature library construction unit 320, an update unit 330, a training unit 340, and an extract video active feature unit 350.
A video null frame seed acquiring unit 310 for acquiring a video null frame seed. Specifically, the video invalid frame seed is a monochrome video frame, a global blurred video frame, a single texture video frame, or a single scene video frame in the acquired video.
Where monochrome video frames, such as white, black, gray, and other colors, are acquired, nothing is meant in the video, perhaps only for slicing among a plurality of different shot transitions, which is common in video of the movie type. The single-color image can cause the weight of each dimension of the extracted feature of the CNN model to be consistent due to the loss of texture information, and the consistency has little influence on classification and detection tasks, because classification and detection pay more attention to local information, but in the task of searching the global information, accuracy can be reduced.
Assume that the original video frame uses I rgb Representation, conversion to grey scale image
I gray =0.299*Ir+0.587*Ig+0.114*Ib;
There are two methods for judging monochrome video frames, one is a method based on no reference image and the other is a method based on a reference image set.
Without a reference image set, defining color uniformity is calculated using a K-L divergence index based on gray-level mean uniformity distribution, and the normalized gray-level histogram of the video frame is expressed as hist (I gray ) The histogram has bin number B and corresponding gray level average value of hist (mu) gray ) The color uniformity index can be calculated by the following K-L divergence:
the selection of the monochrome video frame is judged by setting the K-L divergence threshold value of the monochrome video frame, and the basic logic is that the threshold value U is set according to experience thres If Uniformity (I) gray ||μ gray )≤U thresh The candidate video frame is considered to be a monochrome video frame seed, otherwise not.
Based on the reference image set, judging whether the candidate video frames are in the part which is ranked forward in relative color uniformity in the reference image set, and calculating the relative color uniformity index of the images as a threshold value so as to finish screening of the monochrome video frames. Assuming that the reference image set is a certain batch of video frames, firstly calculating a normalized gray level histogram of the reference image set, and then sorting the reference image set according to a gray level descending order, in the embodiment, calculating a constant integral value of the accumulated distribution of gray levels in the first 5% barrel as a color uniformity threshold, and taking the constant integral value as a single-color image frame screening basis of the reference image set.
The global blurred video frames, such as blur caused by motion information of a video, are acquired, and when a camera shoots the video, the video frames can be simulated by Gaussian filtering in certain directions due to blur caused by focusing inaccuracy. Such video frames retain a small portion of the texture features compared to monochrome video frames, but still belong generally to video frames that contain little texture information and should still be considered as invalid frames in the video.
Global blurred video frames can be found by gradient information, candidate video frames are selected based on a sharpness index, where sharpness can be seen as the sum of squares of gray scale gradients in two orthogonal directions, e.g., the x-axis and the y-axis, delta x And delta y Representing the gray gradient in two orthogonal directions of sharpness.
Video frames with global blur can be screened out, and a threshold S is set thre When the Sharpness (I) gray )≤S thres Then it is considered a globally blurred video frame. But higher sharpness values may also reflect globally unique texture information such as the plain text, forests, sand, etc. mentioned above, so that a single texture video frame selection is necessary.
The method comprises the steps of obtaining a single texture video frame, such as a pure subtitle scene frame in a video or a frame in which the video is completely covered by barrage characters, and leaves, sand and the like of a full graph, wherein the type of texture information is single, and the single local image can be repeatedly transformed and simulated, so that false detection of characters and the like in local areas in the scene can be caused for some detection tasks.
There are two methods for selecting a single texture video frame, one is a simulation image generation method, and the other is a gradient distribution histogram threshold screening method. The former is a method of forming a single texture video frame by a single object or capturing a part of a real image, and then putting the transformed object or part of the image into a blank or monochrome image through operations such as translation, rotation, scaling and the like. The latter is selected by a formula, in this embodiment, assuming that the x-axis of the image (which may be an image subjected to operations such as translation, rotation, and scaling) is taken as a rotation axis (any axis of the image is taken as a rotation axis), and the counterclockwise rotation angle is a e [0,180 ], the mean and variance of the directional gradients are respectively:
the likelihood that a video frame with greater sharpness but less variance in the direction gradient is a single texture video frame is relatively high. In practical application, firstly, a part of single texture video frames are generated based on operations such as translation, rotation and scaling, and then two indexes of sharpness and variance of directional gradient of the batch of single texture video frames are calculated, and the two indexes are used as reference and used as screening thresholds of the single texture video frames of a real video. If the simulation scheme is not constructed, a threshold of sharpness and direction gradient variance may be set as the screening scheme, for example: sharpness of Sharpness (I) gray ) 0.8 or more and direction gradient variance delta 2 (ΔI gray ) Less than or equal to 0.1. In practical applications, a first scheme is generally adopted, namely, a batch of single-texture video frame sets is generated through simulation, and then the average value of the sharpness and the variance of the directional gradient of the batch of video frames is counted to serve as a reference for selecting the real video single-texture frames.
The method comprises the steps of obtaining video frames of a single scene, such as very single frames of scenes such as stars, point lights and the like in certain videos at night, wherein the video frames can be partially simulated by overlapping impulse noise with monochrome images. The problem that the texture information of the video frames is less and single is solved, and under the convolution processing in the CNN model, the noise of the salt and pepper samples is partially filtered, so that the situation similar to the single-color video frames is converted, the weight of each dimension of the extracted features is relatively approximate, the subsequent retrieval task is affected, and the accuracy in retrieval is reduced. For example, some videos have fixed head and tail, do not express specific meanings, only serve as brand marks, and should be placed in invalid frame seeds.
In a specific implementation, a single scene video frame may be generated by simulating salt and pepper noise on a monochrome image, for example, blue sky and white clouds may be simulated by using sky blue images plus diffuse salt noise, and a complex sky and sky scene in the night may be simulated by superimposing white salt noise on black images. Considering that the single scene may be selected in the previous process flow, and the features obtained after the CNN model are similar to the features of the monochrome video frame, in practical application, the single scene may be selected by placing a small number of video frame samples into the video inactive frame seeds. The semantics related to the head and the tail of some videos are higher, the bottom indexes can be difficult to measure, and the corresponding head and tail frames can be directly collected as single scene video frame seeds.
In summary, the basic principle of acquiring the video invalid frame seeds is to select video frames with single or not abundant texture information, avoid the influence of video features extracted in the next feature extraction process on subsequent tasks, and optimize the finally obtained features through the screening of the invalid features. Through the acquisition process, a batch of video invalid frame seeds are obtained. Further, the video null frame seed may be expanded, including but not limited to: brightness transformation, gaussian blur, motion blur, translational rotation transformation or/and superposition of salt-and-pepper noise, so that the video invalid frame seeds can adapt to more complex video environments, the number of the video invalid frame seeds is increased, and the feature extraction capability of the model on the valid frame or the invalid frame can be ensured. And then constructing a wireless feature base based on the batch of video invalid frame seeds, updating the invalid feature base in the actual screening process, and finally screening out valid features of the video.
An invalid feature base unit 320 is constructed for constructing an invalid feature base. The inventor considers that a classification model for judging the effectiveness can be directly trained in specific implementation, but only one batch of video invalid frame seeds exist at present, if the number of the video invalid frame seeds is small, if the batch is enlarged, a large amount of resources are consumed, and sample data of valid frames are not available. Therefore, in this embodiment, based on the concept of semi-supervised learning, features of an existing video invalid frame seed are extracted based on an existing feature extraction model to form an invalid feature base, then when new video frames are extracted with features, video frame features similar to the invalid features are obtained through calculation of feature similarity, similar video frame features are added into the invalid feature base to finish updating, corresponding video frames are listed as video invalid frame seeds, so that a data sample is prepared for a discrimination model, and the validity discrimination model can be trained and updated until the scale of the data sample reaches a certain degree, and then the invalid features are screened out based on the validity discrimination model, so that valid features of the video are selected.
Based on the above considerations, an invalid feature base is constructed. Video null frame seeds are a series of video frames that vary widely in appearance but are generally not plentiful and single in texture. Before training a CNN feature extraction model, the video invalid frame seeds need to be expanded and enhanced, and at least the enhancement processes such as brightness change, gaussian blur, motion blur, translational rotation transformation, and superimposed salt-and-pepper noise, which are involved in the video invalid frame seed extraction process, need to be adapted, so that the features of the video invalid frame seeds can be better obtained. The construction flow of the invalid feature base is as follows:
assuming that the input of the CNN extraction model is any image Irgb, and the output is a feature representation v e RK of the image, where K is the feature dimension extracted, the process of CNN extracting features can be represented by the following map f:
f:I rgb a V formula 307;
assuming that the video null frame seed set isThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, which constitute the initial invalid feature base.
An updating unit 330, configured to update the invalid feature base and obtain the valid video feature set. Specifically, a decoded frame sequence obtained after video decoding is mapped f by a CNN feature extraction model to obtain a candidate feature set { f } i =1, 2, …, N }, assuming the candidate feature set is { f } i =1, 2, …, N }, where N is the total number of frames of video. The similarity of a candidate feature to a feature in the invalid feature base may be represented by a cosine similarity between the two:
in this embodiment, a first threshold s of invalid feature alignment similarity is defined 1 =0.9, when s ij >s 1 At this time, the candidate feature f is determined i For invalid features, adding the invalid features into the invalid feature base, and adding the candidate features f i The corresponding raw video frame column is the video null frame seed. In this embodiment, the second threshold of the effective feature comparison similarity is defined as s 2 =0.5, when s ij ≤s 2 At this time, the candidate feature f is determined i As an active feature, add it to the video active feature set and put its corresponding original video frame into the active frame set. If s is 2 <s ij ≤s 1 And judging the feature as a candidate effective feature, putting the corresponding original video frame as the candidate effective frame into a candidate effective frame set, and waiting for secondary confirmation of a judgment model trained subsequently.
The training unit 340 is configured to train the frame validity binary judgment model according to the updated invalid feature base and the video valid feature set. Specifically, assuming that the scales of the video invalid frame seeds and the valid frame sets reach a certain degree, for example, the scales can be set to be P more than or equal to 10000, and a frame validity bipartite judging model can be trained to complete the video valid characteristic screening process. The image set corresponding to the video invalid frame seeds is a negative sample set, the confirmed valid frame set is a positive sample set, and the frame validity binary judgment model is obtained through training after data enhancement operations such as at least brightness transformation, gaussian blur, motion blur, translational rotation transformation, superposition of salt and pepper noise and the like are carried out, and then the image set is sent to a CNN class-II classification model (such as ResNet50, and an output layer is modified into two classes of output):
m p :I rgb A label e {0,1} formula 309;
then, based on a frame effectiveness bipartite judging model, further inputting a candidate effective frame set needing secondary confirmation, and listing the judged ineffective frames as video ineffective frame seeds, and adding ineffective features into an ineffective feature base after the ineffective frames pass through a feature extraction model; and adding the determined effective frames into an effective frame set, and adding effective features into a video effective feature set after the effective frames pass through a feature extraction model.
The video valid feature extraction unit 350 extracts video valid features using a frame validity bipartite discriminant model. As the dataset size p increases, model m p The method can train and update incrementally, update the video effective feature set by utilizing the video effective feature extracted by the frame effectiveness bipartite discriminant model, extract the ineffective feature and update the ineffective feature base. With the updating of data, after the scale M of the invalid feature base is increased, all features of the new candidate video frame feature set to be distinguished (the set size is N) are compared with all features in the invalid feature base, so that the calculated amount is often larger, and the complexity is thatBecause the dimension K of the feature is a determined value, clustering operation can be performed based on the existing invalid feature base, and after good clustering training, the complexity of effective frame screening can be reduced to +. >The speed of screening the effective features can be greatly improved.
In the system of the embodiment, a representative invalid frame is selected to be added into a video invalid frame seed, an invalid feature base is constructed through depth feature extraction, and the influence of the invalid frame on the follow-up can be obviously reduced through similarity threshold screening of the invalid feature base, because the features extracted by the depth model have relatively obvious noise resistance, if the training data in the deep learning model process are pertinently subjected to comparison learning by adding a representative noise model, the influence of the invalid frame is reduced to be lower.
Embodiment 4, the system for optimizing frequency depth feature extraction in this embodiment, as shown in fig. 4, includes: the video invalid frame seed unit 410 is acquired, the invalid feature base unit 420 is constructed, the updating unit 430, the training unit 440, the video valid feature extracting unit 450, the video redundant feature screening unit 460 and the optimizing unit 470.
A video null frame seed acquiring unit 410 is configured to acquire a video null frame seed. Specifically, the video invalid frame seed is a monochrome video frame, a global blurred video frame, a single texture video frame, or a single scene video frame in the acquired video.
Where monochrome video frames, such as white, black, gray, and other colors, are acquired, nothing is meant in the video, perhaps only for slicing among a plurality of different shot transitions, which is common in video of the movie type. The single-color image can cause the weight of each dimension of the extracted feature of the CNN model to be consistent due to the loss of texture information, and the consistency has little influence on classification and detection tasks, because classification and detection pay more attention to local information, but in the task of searching the global information, accuracy can be reduced.
Assume that the original video frame uses I rgb Representation, conversion to grey scale image
I gray =0.299*Ir+0.587*Ig+0.114*Ib;
There are two methods for judging monochrome video frames, one is a method based on no reference image and the other is a method based on a reference image set.
Without a reference image set, defining color uniformity is calculated using a K-L divergence index based on gray-level mean uniformity distribution, and the normalized gray-level histogram of the video frame is expressed as hist (I gray ) The histogram has bin number B and corresponding gray level average value of hist (mu) gray ) The color uniformity index can be calculated by the following K-L divergence:
The selection of the monochrome video frame is judged by setting the K-L divergence threshold value of the monochrome video frame, and the basic logic is that the threshold value U is set according to experience thres If Uniformity (I) gray ||μ gray )≤U thresh The candidate video frame is considered to be a monochrome video frame seed, otherwise not.
Based on the reference image set, judging whether the candidate video frames are in the part which is ranked forward in relative color uniformity in the reference image set, and calculating the relative color uniformity index of the images as a threshold value so as to finish screening of the monochrome video frames. Assuming that the reference image set is a certain batch of video frames, firstly calculating a normalized gray level histogram of the reference image set, and then sorting the reference image set according to a gray level descending order, in the embodiment, calculating a constant integral value of the accumulated distribution of gray levels in the first 5% barrel as a color uniformity threshold, and taking the constant integral value as a single-color image frame screening basis of the reference image set.
The global blurred video frames, such as blur caused by motion information of a video, are acquired, and when a camera shoots the video, the video frames can be simulated by Gaussian filtering in certain directions due to blur caused by focusing inaccuracy. Such video frames retain a small portion of the texture features compared to monochrome video frames, but still belong generally to video frames that contain little texture information and should still be considered as invalid frames in the video.
Global blurred video frames can be found by gradient information, candidate video frames are selected based on a sharpness index, where sharpness can be seen as the sum of squares of gray scale gradients in two orthogonal directions, e.g., the x-axis and the y-axis, delta x And delta y Representing the gray gradient in two orthogonal directions of sharpness.
Video frames with global blur can be screened out, and a threshold S is set thre When the Sharpness (I) gray )≤S thres Then it is considered a globally blurred video frame. But higher sharpness values may also reflect globally unique texture information such as the plain text, forests, sand, etc. mentioned above, so that a single texture video frame selection is necessary.
The method comprises the steps of obtaining a single texture video frame, such as a pure subtitle scene frame in a video or a frame in which the video is completely covered by barrage characters, and leaves, sand and the like of a full graph, wherein the type of texture information is single, and the single local image can be repeatedly transformed and simulated, so that false detection of characters and the like in local areas in the scene can be caused for some detection tasks.
There are two methods for selecting a single texture video frame, one is a simulation image generation method, and the other is a gradient distribution histogram threshold screening method. The former is a method of forming a single texture video frame by a single object or capturing a part of a real image, and then putting the transformed object or part of the image into a blank or monochrome image through operations such as translation, rotation, scaling and the like. The latter is selected by a formula, in this embodiment, assuming that the x-axis of the image (which may be an image subjected to operations such as translation, rotation, and scaling) is taken as a rotation axis (any axis of the image is taken as a rotation axis), and the counterclockwise rotation angle is a e [0,180 ], the mean and variance of the directional gradients are respectively:
The likelihood that a video frame with greater sharpness but less variance in the direction gradient is a single texture video frame is relatively high. In practical application, firstly, a part of single texture video frames are generated based on operations such as translation, rotation and scaling, and then two indexes of sharpness and variance of directional gradient of the batch of single texture video frames are calculated, and the two indexes are used as reference and used as screening thresholds of the single texture video frames of a real video. If the simulation scheme is not constructed, a threshold of sharpness and direction gradient variance may be set as the screening scheme, for example: sharpness of Sharpness (I) gray ) 0.8 or more and direction gradient variance delta 2 (ΔI gray ) Less than or equal to 0.1. In practical applications, a first scheme is generally adopted, namely, a batch of single-texture video frame sets is generated through simulation, and then the average value of the sharpness and the variance of the directional gradient of the batch of video frames is counted to serve as a reference for selecting the real video single-texture frames.
The method comprises the steps of obtaining video frames of a single scene, such as very single frames of scenes such as stars, point lights and the like in certain videos at night, wherein the video frames can be partially simulated by overlapping impulse noise with monochrome images. The problem that the texture information of the video frames is less and single is solved, and under the convolution processing in the CNN model, the noise of the salt and pepper samples is partially filtered, so that the situation similar to the single-color video frames is converted, the weight of each dimension of the extracted features is relatively approximate, the subsequent retrieval task is affected, and the accuracy in retrieval is reduced. For example, some videos have fixed head and tail, do not express specific meanings, only serve as brand marks, and should be placed in invalid frame seeds.
In a specific implementation, a single scene video frame may be generated by simulating salt and pepper noise on a monochrome image, for example, blue sky and white clouds may be simulated by using sky blue images plus diffuse salt noise, and a complex sky and sky scene in the night may be simulated by superimposing white salt noise on black images. Considering that the single scene may be selected in the previous process flow, and the features obtained after the CNN model are similar to the features of the monochrome video frame, in practical application, the single scene may be selected by placing a small number of video frame samples into the video inactive frame seeds. The semantics related to the head and the tail of some videos are higher, the bottom indexes can be difficult to measure, and the corresponding head and tail frames can be directly collected as single scene video frame seeds.
In summary, the basic principle of acquiring the video invalid frame seeds is to select video frames with single or not abundant texture information, avoid the influence of video features extracted in the next feature extraction process on subsequent tasks, and optimize the finally obtained features through the screening of the invalid features. Through the acquisition process, a batch of video invalid frame seeds are obtained. Further, the video null frame seed may be expanded, including but not limited to: brightness transformation, gaussian blur, motion blur, translational rotation transformation or/and superposition of salt-and-pepper noise, so that the video invalid frame seeds can adapt to more complex video environments, the number of the video invalid frame seeds is increased, and the feature extraction capability of the model on the valid frame or the invalid frame can be ensured. And then constructing a wireless feature base based on the batch of video invalid frame seeds, updating the invalid feature base in the actual screening process, and finally screening out valid features of the video.
An invalid feature base unit 420 is constructed for constructing an invalid feature base. The inventor considers that a classification model for judging the effectiveness can be directly trained in specific implementation, but only one batch of video invalid frame seeds exist at present, if the number of the video invalid frame seeds is small, if the batch is enlarged, a large amount of resources are consumed, and sample data of valid frames are not available. Therefore, in this embodiment, based on the concept of semi-supervised learning, features of an existing video invalid frame seed are extracted based on an existing feature extraction model to form an invalid feature base, then when new video frames are extracted with features, video frame features similar to the invalid features are obtained through calculation of feature similarity, similar video frame features are added into the invalid feature base to finish updating, corresponding video frames are listed as video invalid frame seeds, so that a data sample is prepared for a discrimination model, and the validity discrimination model can be trained and updated until the scale of the data sample reaches a certain degree, and then the invalid features are screened out based on the validity discrimination model, so that valid features of the video are selected.
Based on the above considerations, an invalid feature base is constructed. Video null frame seeds are a series of video frames that vary widely in appearance but are generally not plentiful and single in texture. Before training a CNN feature extraction model, the video invalid frame seeds need to be expanded and enhanced, and at least the enhancement processes such as brightness change, gaussian blur, motion blur, translational rotation transformation, and superimposed salt-and-pepper noise, which are involved in the video invalid frame seed extraction process, need to be adapted, so that the features of the video invalid frame seeds can be better obtained. The construction flow of the invalid feature base is as follows:
Assuming that the input of the CNN extraction model is any image Irgb, and the output is a feature representation v e RK of the image, where K is the feature dimension extracted, the process of CNN extracting features can be represented by the following map f:
f:I rgb v formula 407;
assuming that the video null frame seed set isThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, which constitute the initial invalid feature base.
Updating unit430 for updating the invalid feature base and obtaining the video valid feature set. Specifically, a decoded frame sequence obtained after video decoding is mapped f by a CNN feature extraction model to obtain a candidate feature set { f } i =1, 2, …, N }, assuming the candidate feature set is { f } i =1, 2, …, N }, where N is the total number of frames of video. The similarity of a candidate feature to a feature in the invalid feature base may be represented by a cosine similarity between the two:
in this embodiment, a first threshold s of invalid feature alignment similarity is defined 1 =0.9, when s ij >s 1 At this time, the candidate feature f is determined i For invalid features, adding the invalid features into the invalid feature base, and adding the candidate features f i The corresponding raw video frame column is the video null frame seed. In this embodiment, the second threshold of the effective feature comparison similarity is defined as s 2 =0.5, when s ij ≤s 2 At this time, the candidate feature f is determined i As an active feature, add it to the video active feature set and put its corresponding original video frame into the active frame set. If s is 2 <s ij ≤s 1 And judging the feature as a candidate effective feature, putting the corresponding original video frame as the candidate effective frame into a candidate effective frame set, and waiting for secondary confirmation of a judgment model trained subsequently.
The training unit 440 is configured to train the frame validity binary judgment model according to the updated invalid feature base and the video valid feature set. Specifically, assuming that the scales of the video invalid frame seeds and the valid frame sets reach a certain degree, for example, the scales can be set to be P more than or equal to 10000, and a frame validity bipartite judging model can be trained to complete the video valid characteristic screening process. The image set corresponding to the video invalid frame seeds is a negative sample set, the confirmed valid frame set is a positive sample set, and the frame validity binary judgment model is obtained through training after data enhancement operations such as at least brightness transformation, gaussian blur, motion blur, translational rotation transformation, superposition of salt and pepper noise and the like are carried out, and then the image set is sent to a CNN class-II classification model (such as ResNet50, and an output layer is modified into two classes of output):
m p :I rgb A label e {0,1} formula 409;
then, based on a frame effectiveness bipartite judging model, further inputting a candidate effective frame set needing secondary confirmation, and listing the judged ineffective frames as video ineffective frame seeds, and adding ineffective features into an ineffective feature base after the ineffective frames pass through a feature extraction model; and adding the determined effective frames into an effective frame set, and adding effective features into a video effective feature set after the effective frames pass through a feature extraction model.
The video valid feature extracting unit 450 extracts video valid features using the frame validity bipartite discriminant model. As the dataset size p increases, model m p The method can train and update incrementally, update the video effective feature set by utilizing the video effective feature extracted by the frame effectiveness bipartite discriminant model, extract the ineffective feature and update the ineffective feature base. With the updating of data, after the scale M of the invalid feature base is increased, all features of the new candidate video frame feature set to be distinguished (the set size is N) are compared with all features in the invalid feature base, so that the calculated amount is often larger, and the complexity is thatBecause the dimension K of the feature is a determined value, clustering operation can be performed based on the existing invalid feature base, and after good clustering training, the complexity of effective frame screening can be reduced to +. >The speed of screening the effective features can be greatly improved.
And a video redundancy feature screening unit 460, configured to screen video redundancy features from the video effective feature set, so as to obtain an effective key feature set. The inventor considers that a video has a great deal of redundant information, and particularly, extremely high similarity exists between adjacent frames in the same scene. This is often unnecessary in the subsequent classification or retrieval tasks, on the one hand consuming computing resources and reducing processing performance; on the other hand, the accuracy may be degraded by some mismatch. Conventional time-space domain based schemes generally calculate the inter-frame difference directly, and after the inter-frame difference is smaller than a certain threshold, redundant frames are considered to exist, but are easily interfered by noise. According to the method, redundant features are screened in the feature space based on the features extracted by the CNN, the characteristics of a CNN model and data enhancement operation in the training process can be fully utilized, and the robustness to original spatial noise is improved.
Specifically, the similarity between the current feature in the video effective feature set and the feature of the next time sequence is compared, if the comparison result is smaller than the third threshold s of similarity 3 And if not, screening out the next time sequence feature until all the features in the video effective feature set complete the operation.
In a specific implementation, for example: the candidate feature set after the validity feature screening becomes { f } t : t=1, 2., T, where the subscript T indicates a temporal sequence number that indicates only the temporal sequence and does not indicate the original frame number, and after active frame screening, some frame features are determined to be invalid and discarded. The redundant feature screening process based on the feature similarity calculation between video frames is as follows:
calculation of
Where p=1, … T-1, q=1, …, T, defining a third threshold for inter-frame feature similarity as s 3 If s pq ≥s 3 The feature similarity of the sequence p and the subsequent sequence q is higher, the feature of the sequence q and the feature of the previous sequence p are redundant, the frame feature of the subsequent redundancy is removed, and q is more than q+1; otherwise, marking the feature with the time sequence of p as a valid key feature, assigning q behind the current feature to p, namely p ≡q, and re-executingThe above operation is performed until p reaches the end T of the timing. In actual use. Based on the obtained effective key feature set, the key feature is optimized in the following specific application requirements.
And the optimizing unit 470 is configured to set a corresponding threshold according to the task, and optimize the valid key feature set. All the related threshold superparameters in the embodiment can be obtained according to the specific task indexes finally applied by the effective key features. Through Grid Search, threshold parameters are subjected to Grid division based on a possible value range, then super parameters of Grid division are used for obtaining key feature combinations of corresponding videos, statistics are finally based on specific tasks such as classification, detection, retrieval and the like, indexes of the corresponding tasks are judged, and optimization selection is performed on video feature screening and final effective key features so as to select super parameters which are most suitable for the specific tasks, so that optimal effective key features are screened.
Compared with the system of the embodiment, which directly screens the invalid frames and the redundant frames in the original time-space domain, the method selects the representative invalid frames to be added into the video invalid frame seeds, constructs the invalid feature base through depth feature extraction, and can obviously reduce the influence of the invalid frames on the follow-up process through the similarity threshold screening of the invalid feature base, because the extracted features of the depth model have obvious noise resistance, if the training data in the deep learning model process are subjected to comparison learning by adding the representative noise model, the influence of the invalid frames is reduced to be lower. The filtering of the redundant frames is carried out in the feature space, and the filtering can be unified with the retrieval task required by the subsequent features, so that the joint optimization can be realized, the filtering process of the redundant frames is combined with the specific task, and the retrieval effect is improved in a targeted manner.
Embodiment 5, computer device of the present embodiment, referring to fig. 5, the computer device 500 shown is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present invention.
As shown in fig. 5, the computer device 500 is in the form of a general purpose computing device. The components of computer device 500 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that connects the various system components (including the system memory 502 and processing units 501).
Bus 503 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 500 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 500 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 504 and/or cache memory 505. The computer device 500 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 506 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 503 through one or more data medium interfaces. The system memory 502 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.
A program/utility 508 having a set (at least one) of program modules 507 may be stored in, for example, system memory 502, such program modules 507 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 507 typically perform the functions and/or methods of the described embodiments of the invention.
The computer device 500 may also communicate with a display 510 or a plurality of external devices 509 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the computer device 500, and/or any device (e.g., network card, modem, etc.) that enables the computer device 500 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 511. Also, the computer device 500 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet, by way of the network adapter 512, shown in FIG. 5, communicating with other modules of the computer device 500 via the bus 503, it should be appreciated that, although not shown, other hardware and/or software modules may be utilized in connection with the computer device 500, including but not limited to microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 501 executes programs stored in the system memory 502 to perform various functional applications and data processing, for example, to implement the method for video depth feature extraction optimization provided by the embodiment of the present invention, and includes the following main steps: acquiring a video invalid frame seed; constructing an invalid feature base; updating the invalid feature base and acquiring a video valid feature set; training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set; and extracting video effective features by using the frame effectiveness bipartite discriminant model.
Embodiment 6, a storage medium containing computer executable instructions of the present embodiment, storing a computer program therein, which when executed by a processor, implements a method for video depth feature extraction optimization as provided by the embodiments of the present invention, including the following main steps: acquiring a video invalid frame seed; constructing an invalid feature base; updating the invalid feature base and acquiring a video valid feature set; training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set; and extracting video effective features by using the frame effectiveness bipartite discriminant model.
The storage media containing computer-executable instructions of the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (24)

1. A method for video depth feature extraction optimization, comprising the steps of:
s1, acquiring a video invalid frame seed; the step of acquiring video invalid frame seeds is to acquire monochrome video frames, global fuzzy video frames, single texture video frames or single scene video frames in the video;
s2, constructing an invalid feature base;
specifically, features of expanded video invalid frame seeds are extracted through a feature extraction model, an invalid feature base is constructed, and a video invalid frame seed set is set asThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, the invalid feature set constitutes an initial invalid feature base;
s3, updating the invalid feature base and acquiring a video valid feature set;
s4, training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set;
Specifically, each video invalid frame seed and each valid frame in the valid frame set are sent into a CNN class-II classification model, and training is carried out to obtain the frame validity binary judgment model; before inputting the video invalid frame seeds and the valid frames in the valid frame set into a CNN class-two classification model, performing an expansion operation on the valid frames and the invalid frames, including: brightness transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise;
s5, extracting video effective features by using the frame effectiveness bipartite discriminant model.
2. The method of video depth feature extraction optimization of claim 1, further comprising in step S5: updating the video effective feature set by the extracted video effective features; and extracting invalid features by using the frame validity bipartite discriminant model, and updating the invalid feature base by using the extracted invalid features.
3. The method of video depth feature extraction optimization of claim 1, further comprising the step of, after step S5:
s6, screening video redundant features from the video effective feature set to obtain an effective key feature set.
4. The method of video depth feature extraction optimization of claim 3, further comprising the step after step S6 of:
and S7, setting a corresponding threshold according to the task, and optimizing the effective key feature set.
5. The method for optimizing video depth feature extraction according to claim 1, wherein the acquiring monochrome video frames in the video, without reference image sets, specifically comprises:
converting Irgb into Igray;
the color uniformity index is calculated by the following K-L divergence formula:
setting U thresh If Uniformity (I) gray ||μ gray )≤U thres Judging the video frame to be a monochrome video frame and acquiring the monochrome video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray-scale images, hist (I) gray ) Representing a normalized gray level histogram of a video frame, B representing the barrel number of the histogram, hist (μ) gray ) Representing the corresponding gray average value uniform distribution, U thresh Representing the K-L divergence threshold for a monochrome video frame.
6. The method for optimizing video depth feature extraction according to claim 1, wherein the acquiring monochrome video frames in the video, when there is a reference image set, specifically comprises:
calculating a normalized gray level histogram of each reference video frame in the reference image set;
ordering according to the gray scale descending order;
Calculating a constant integral value of accumulated distribution of gray scales in the previous x% barrel as a color uniformity threshold, screening and acquiring a monochrome video frame, wherein the calculation formula is as follows:
wherein I is gray Representing gray-scale images, hist (I) gray ) The normalized gray level histogram representing the video frame, x representing the set percentage value.
7. The method for optimizing video depth feature extraction according to claim 1, wherein the acquiring the globally blurred video frame in the video specifically comprises:
will I rgb Conversion to I gray
The original video frame is selected by sharpness, which is calculated by the following formula:
set S thres If Sharpness (I) gray )≤S thresh Judging the video frame as global fuzzy and acquiring the video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray scale images, S thres Representing sharpness threshold, delta x And delta y Gray gradients in two orthogonal directions representing the sharpness.
8. The method of video depth feature extraction optimization of claim 1, wherein the obtaining a single texture video frame in the video, generating by simulation, specifically comprises:
and transforming the original video frame or intercepting a part of the original video frame through translation, rotation or/and scaling, and putting the transformed image into a monochrome image to form the single texture video frame.
9. The method of video depth feature extraction optimization of claim 1, wherein the acquiring a single texture video frame in a video, by gradient distribution histogram threshold screening, comprises:
will I rgb Conversion to I gray
I is as follows gray Any axis of (a) is a rotation axis, and the rotation angle is a E [0,180 ], then
The mean value of the directional gradient is:
the variance of the directional gradient is:
the sharpness is:
if Sharpness (I) gray )≥S theesh And delta 2 (ΔI gray )≤δ 2 thresh Judging the video frame as a single texture video frame and acquiring the single texture video frame;
wherein I is rgb Representing the original video frame, I gray Representing gray scale images, S thres Representing sharpness threshold, delta 2 thtrsh Representing the direction gradient variance threshold, delta x And delta y Gray gradients in two orthogonal directions representing the sharpness.
10. The method of video depth feature extraction optimization of claim 9, wherein:
transforming the original video frame or intercepting a part of the original video frame through translation, rotation or/and scaling, and putting the transformed image into a monochrome image;
calculating the monochrome image containing the converted image by the formula 4 and the formula 5 to obtain the delta 2 (ΔI gray );
Calculating the monochrome image including the converted image by the formula 6 to obtain the SHarpness (I gray )。
11. The method of video depth feature extraction optimization of claim 1, wherein the acquiring a single scene video frame in a video is generated by simulating salt and pepper noise on a monochrome image.
12. The method of optimizing video depth feature extraction according to claim 1, wherein expanding the video null frame seeds in step S1 comprises: luminance transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise.
13. The method of optimizing video depth feature extraction according to claim 1, wherein updating the invalid feature base and obtaining the video valid feature set in step S3 specifically comprises:
mapping the video to be analyzed into a candidate feature set through a feature extraction model;
and carrying out similarity expression on the features in the candidate feature set and the features in the invalid feature base one by one, wherein the similarity expression is calculated according to the following formula:
if s is ij >s 1 Then determine f i Adding the invalid features into the invalid feature base; if s is ij ≤s 2 Then determine f i Adding the video effective feature to the video effective feature set as effective feature;
wherein f i Representing candidate feature sets, V j Representing a set of features in a base of invalid features s 1 Is the first threshold of similarity, s 2 Is a second threshold of similarity.
14. The method of video depth feature extraction optimization of claim 13,
If s is ij >s 1 Will also f i The corresponding original video frame column is the video invalid frame seed;
if s is ij ≤S 2 An active frame set is also constructed and f i And adding the corresponding original video frames into the active frame set.
15. The method of video depth feature extraction optimization of claim 14, wherein if s 2 <s ij ≤s 1 Then determine f i For candidate effective feature, constructing candidate effective frame set, and f i And adding the corresponding original video frames into the candidate effective frame set.
16. The method of claim 15, wherein the candidate active frame set is input into the frame validity binary decision model, and the determined inactive frames are listed as the video inactive frame seeds, and the determined active frames are added to the active frame set.
17. The method of video depth feature extraction optimization of claim 16, wherein after an invalid frame determined by the frame validity bipartite discriminant model is further subjected to a feature extraction model, adding an invalid feature to the invalid feature base;
and adding the effective features into the video effective feature set after the effective frames judged by the frame effectiveness bipartite judging model are subjected to the feature extraction model.
18. The method of video depth feature extraction optimization of claim 1, wherein each feature in the invalid feature base is clustered.
19. The method of optimizing video depth feature extraction according to claim 3, wherein the step S6 of screening video redundant features from the video active feature set to obtain an active key feature set specifically comprises:
s61, comparing the similarity between the current feature in the video effective feature set and the feature of the next time sequence;
s62, if the comparison result is smaller than the third threshold S of similarity 3 Marking the current feature as an effective key feature, adding the effective key feature set, and assigning the subsequent time sequence feature as the current feature, and returning to the S61; otherwise, go to S63;
s63, if the comparison result is greater than or equal to the third threshold S of similarity 3 Then its subsequent timing characteristics are filtered out and S61 is returned.
20. A system for video depth feature extraction optimization, comprising:
the video invalidation frame seed acquiring unit is used for acquiring video invalidation frame seeds; the step of acquiring video invalid frame seeds is to acquire monochrome video frames, global fuzzy video frames, single texture video frames or single scene video frames in the video;
An invalid feature base unit is constructed and used for constructing an invalid feature base; specifically, features of expanded video invalid frame seeds are extracted through a feature extraction model, an invalid feature base is constructed, and a video invalid frame seed set is set asThe invalid feature set after mapping transformation is { V ] j J=1, …, M }, the invalid feature set constitutes an initial invalid feature base;
the updating unit is used for updating the invalid feature base and acquiring a video valid feature set;
the training unit is used for training a frame effectiveness bipartite judging model according to the updated invalid feature base and the video effective feature set; specifically, each video invalid frame seed and each valid frame in the valid frame set are sent into a CNN class-II classification model, and training is carried out to obtain the frame validity binary judgment model; before inputting the video invalid frame seeds and the valid frames in the valid frame set into a CNN class-two classification model, performing an expansion operation on the valid frames and the invalid frames, including: brightness transformation, gaussian blur, motion blur, translational rotation transformation, or/and superimposed pretzel noise;
And extracting video effective characteristics by using the frame effectiveness bipartite discriminant model.
21. The system for video depth feature extraction optimization of claim 20, further comprising:
and the video redundancy screening feature unit is used for screening video redundancy features from the video effective feature set to obtain an effective key feature set.
22. The video depth feature extraction optimization system of claim 21, further comprising:
and the optimizing unit is used for setting a corresponding threshold according to the task and optimizing the effective key feature set.
23. A computer device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of video depth feature extraction optimization of any one of claims 1-19 when the computer program is executed by the processor.
24. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the method of video depth feature extraction optimization of any one of claims 1-19.
CN202110918450.9A 2021-08-11 2021-08-11 Method, system, equipment and storage medium for video depth feature extraction optimization Active CN113627342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110918450.9A CN113627342B (en) 2021-08-11 2021-08-11 Method, system, equipment and storage medium for video depth feature extraction optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110918450.9A CN113627342B (en) 2021-08-11 2021-08-11 Method, system, equipment and storage medium for video depth feature extraction optimization

Publications (2)

Publication Number Publication Date
CN113627342A CN113627342A (en) 2021-11-09
CN113627342B true CN113627342B (en) 2024-04-12

Family

ID=78384341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110918450.9A Active CN113627342B (en) 2021-08-11 2021-08-11 Method, system, equipment and storage medium for video depth feature extraction optimization

Country Status (1)

Country Link
CN (1) CN113627342B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116228764B (en) * 2023-05-08 2023-07-18 聊城市东昌府区妇幼保健院 Neonate disease screening blood sheet acquisition quality detection method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218831A (en) * 2013-04-21 2013-07-24 北京航空航天大学 Video moving target classification and identification method based on outline constraint
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN110826491A (en) * 2019-11-07 2020-02-21 北京工业大学 Video key frame detection method based on cascading manual features and depth features
CN111259701A (en) * 2018-12-03 2020-06-09 杭州海康威视数字技术股份有限公司 Pedestrian re-identification method and device and electronic equipment
CN111339369A (en) * 2020-02-25 2020-06-26 佛山科学技术学院 Video retrieval method, system, computer equipment and storage medium based on depth features
CN112906631A (en) * 2021-03-17 2021-06-04 南京邮电大学 Dangerous driving behavior detection method and detection system based on video
CN113095295A (en) * 2021-05-08 2021-07-09 广东工业大学 Fall detection method based on improved key frame extraction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11195057B2 (en) * 2014-03-18 2021-12-07 Z Advanced Computing, Inc. System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218831A (en) * 2013-04-21 2013-07-24 北京航空航天大学 Video moving target classification and identification method based on outline constraint
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN111259701A (en) * 2018-12-03 2020-06-09 杭州海康威视数字技术股份有限公司 Pedestrian re-identification method and device and electronic equipment
CN110826491A (en) * 2019-11-07 2020-02-21 北京工业大学 Video key frame detection method based on cascading manual features and depth features
CN111339369A (en) * 2020-02-25 2020-06-26 佛山科学技术学院 Video retrieval method, system, computer equipment and storage medium based on depth features
CN112906631A (en) * 2021-03-17 2021-06-04 南京邮电大学 Dangerous driving behavior detection method and detection system based on video
CN113095295A (en) * 2021-05-08 2021-07-09 广东工业大学 Fall detection method based on improved key frame extraction

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Abnormal Event Detection Method in Surveillance Video Based on Temporal CNN and Sparse Optical Flow;Xia, Hongxia;ICCDE 2019: PROCEEDINGS OF THE 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING AND DATA ENGINEERING;全文 *
一种新的自适应的视频关键帧提取方法;王宇;汪荣贵;杨娟;;合肥工业大学学报(自然科学版)(11);全文 *
基于关键帧的双流卷积网络的人体动作识别方法;张聪聪;何宁;;南京信息工程大学学报(自然科学版);20191128(06);全文 *
基于内容的视频检索综述;胡志军;徐勇;;计算机科学(01);全文 *
基于卷积神经网络的视频图像超分辨率重建方法;刘村;李元祥;周拥军;骆建华;;计算机应用研究;20180209(04);全文 *
基于特征对抗对的视觉特征归因网络研究;张宪;史沧红;李孝杰;;计算机研究与发展(03);全文 *
深度学习下智慧社区视频监控异常识别方法;张海民;;西安工程大学学报(02);全文 *
视频中异常行为检测算法研究;张俊阳;中国优秀硕士学位论文全文数据库(信息科技辑);全文 *

Also Published As

Publication number Publication date
CN113627342A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN111292337A (en) Image background replacing method, device, equipment and storage medium
CN113313657A (en) Unsupervised learning method and system for low-illumination image enhancement
CN112950477B (en) Dual-path processing-based high-resolution salient target detection method
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN111275686B (en) Method and device for generating medical image data for artificial neural network training
CN111915525A (en) Low-illumination image enhancement method based on improved depth separable generation countermeasure network
Chen et al. Instance segmentation in the dark
CN111382647B (en) Picture processing method, device, equipment and storage medium
CN113850135A (en) Dynamic gesture recognition method and system based on time shift frame
Zhang et al. Underwater image enhancement using improved generative adversarial network
CN113627342B (en) Method, system, equipment and storage medium for video depth feature extraction optimization
CN115424164A (en) Method and system for constructing scene self-adaptive video data set
CN113807354B (en) Image semantic segmentation method, device, equipment and storage medium
Wang et al. A multi-scale attentive recurrent network for image dehazing
Lu et al. Real-time video stylization using object flows
CN117495718A (en) Multi-scale self-adaptive remote sensing image defogging method
CN117036202A (en) Remote sensing image type imbalance-oriented hybrid enhancement method and system
CN117036392A (en) Image detection method and related device
CN116758449A (en) Video salient target detection method and system based on deep learning
CN114283087A (en) Image denoising method and related equipment
Yuan et al. RM-IQA: A new no-reference image quality assessment framework based on range mapping method
Li et al. Delving Deeper Into Image Dehazing: A Survey
Guo et al. A study on the optimization simulation of big data video image keyframes in motion models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant