CN114463679A

CN114463679A - Video feature construction method, device and equipment

Info

Publication number: CN114463679A
Application number: CN202210102087.8A
Authority: CN
Inventors: 李虎; 李睿之; 熊博颖; 郑邦东; 吴昀蓁; 吴松霖
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-10

Abstract

The embodiment of the invention provides a method, a device and equipment for constructing characteristics of a video. The feature construction method of the video comprises the following steps: slicing the video to obtain N video segments; acquiring a first characteristic and a second characteristic of a single video clip; and respectively carrying out positive sequence superposition and negative sequence superposition on the N first characteristics and the N second characteristics to obtain a characteristic sequence of the video. The implementation method provided by the invention can improve the time sequence balance of the characteristic sequence constructed based on the video.

Description

Video feature construction method, device and equipment

Technical Field

The present invention relates to the technical field of video features, and in particular, to a method and an apparatus for constructing video features, a device and a corresponding storage medium for constructing video features.

Background

Common methods of short video tagging include manual auditing and machine identification. The manual review method has obvious defects, and firstly, the cost is high, the video needs to be viewed and understood manually, and the label marking is carried out, so that the labor is consumed; secondly, the method is not real-time, the user can not be timely tagged, the operation of manually browsing videos is needed, and when a large number of videos are uploaded at the same time, the feedback time is longer, so that the user experience is not good. The machine identification method is characterized in that a machine automatically marks short videos by using artificial intelligence, and can solve the defects of high cost and no real-time performance of a human engineering method, so that the current machine identification method has the tendency of solving the problem of short video marking. The manual auditing method has the advantage of being accurate, so that the problem now becomes how to improve the accuracy of the machine identification method.

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit. The acquisition cost of data is very high, so how to process features in the case of existing data is the key to further improve the online of machine learning. The general feature construction method of the problem is to use a dual-stream convolution network, which is unidirectional with respect to the collection of features, and the frames of the video belong to sequence features, and the sequence features have forgetfulness, that is, after being input into the network, features arranged in the front frame are easily forgotten, and more features are related to the rear frame, so that features in the video are easily missed.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a device, a method and an apparatus for constructing characteristics of a video, so as to at least solve the above technical problems.

In order to achieve the above object, a first aspect of the present invention is a method for constructing features of a video, the method including: slicing the video to obtain N video segments; respectively acquiring a first feature and a second feature of a single video clip; and respectively carrying out positive sequence superposition and negative sequence superposition on the N first characteristics and the N second characteristics to obtain a characteristic sequence of the video.

Preferably, the first feature comprises a processing result of a visual processing algorithm on the single video segment; the second feature comprises a subset of the content of the single video segment.

Preferably, the first feature includes: an optical flow graph derived from the single video segment based on an optical flow extraction algorithm in OpenCV; the second feature includes: one picture frame in the single video clip.

Preferably, the method further comprises: judging whether the frame number of the video is larger than the product of N and a preset frame number threshold value; and if the frame number of the video is not more than the product of N and a preset frame number threshold, copying and splicing the video to enable the video to be more than the product of N and the preset frame number threshold.

Preferably, the method further comprises: after N video clips are obtained by slicing the video, preprocessing picture frames in the video clips into a preset specification.

In a second aspect of the present invention, there is also provided a video feature construction apparatus, including: the video slicing module is used for slicing the video to obtain N video segments; the segment feature module is used for respectively acquiring a first feature and a second feature of a single video segment; and the feature superposition module is used for respectively carrying out positive sequence superposition and negative sequence superposition on the N first features and the N second features to obtain a feature sequence of the video.

Preferably, the device further comprises a video length processing module; the video length processing module is used for: judging whether the frame number of the video is larger than the product of N and a preset frame number threshold value; and if the frame number of the video is not more than the product of N and a preset frame number threshold, copying and splicing the video to ensure that the frame number of the video is more than the product of N and the preset frame number threshold.

Preferably, the device further comprises a pre-processing module; the preprocessing module is used for: after the video slicing module slices a video to obtain N video clips, preprocessing picture frames in the video clips into a preset specification.

In a third aspect of the present invention, there is provided a video feature construction device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the aforementioned video feature construction method when executing the computer program.

In a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the aforementioned method of characterizing a video.

In a fifth aspect of the invention, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the aforementioned method of feature construction of a video.

The technical scheme has the following beneficial effects:

the feature sequence constructed by the above embodiment can avoid the following defects of the existing double-current convolution network: the acquisition of the characteristics is unidirectional, the frames of the video belong to sequence characteristics, the sequence characteristics have forgetfulness, namely the characteristics of the frames arranged in front are easy to forget after being input into a network, and more characteristics are related to the frames behind, so the characteristics in the video are easy to miss. The feature construction method proposed by the embodiment, namely the bidirectional feature splicing method, extracts features again in the reverse direction with the back of the video as the beginning on the basis of the original features, and then splices the features with the features extracted in the forward direction. Therefore, more comprehensive video characteristics are obtained, and the labeling accuracy of the machine learning module is improved.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

fig. 1 schematically shows a step diagram of a feature construction method of a video according to an embodiment of the present application;

FIG. 2 schematically illustrates a structural diagram of a bidirectional cyclic long-term neural network according to an embodiment of the present application;

fig. 3 schematically shows an overall architecture diagram of a feature construction method of a video according to an embodiment of the present application;

fig. 4 schematically shows a structural diagram of a feature construction apparatus of a video according to an embodiment of the present application.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

Fig. 1 schematically shows a step diagram of a feature construction method of a video according to an embodiment of the present application. As shown in fig. 1, in an embodiment of the present application, a method for constructing features of a video includes:

101. slicing the video to obtain N video segments;

the value of N is a natural number, and the value of N determines the size of the characteristic sequence. The slicing mode here may be an average slicing mode or a slicing mode according to a preset rule. Through this step, one video is divided into N video segments.

102. Respectively acquiring a first feature and a second feature of a single video clip; the characteristics of the video segment include, but are not limited to, the attributes of the video segment itself, a snapshot of the content of the video segment, a slice of the content of the video segment, pixel information, motion information, inter-frame relationships contained in the video segment, and the like. The method extracts two features of the above multiple features from a single video segment, and the two features are respectively marked as a first feature and a second feature.

103. And respectively carrying out positive sequence superposition and negative sequence superposition on the N first characteristics and the N second characteristics to obtain a characteristic sequence of the video.

The order of the forward stacking can be the order of the video segments in the video, and the order of the reverse stacking is just the opposite. The combination mode can directly adopt sequential splicing and the like. The forward and reverse order superposition results can be separated, or the forward and reverse order superposition results can be combined into a feature sequence, which depends on the input setting of the subsequent processing model.

The feature sequence constructed by the above embodiment can avoid the following defects of the existing double-current convolution network: the acquisition of the characteristics is unidirectional, the frames of the video belong to sequence characteristics, the sequence characteristics have forgetfulness, namely the characteristics of the frames arranged in front are easy to forget after being input into a network, and more characteristics are related to the frames behind, so the characteristics in the video are easy to miss.

The feature construction method proposed in this embodiment, that is, the bidirectional feature splicing method, extracts features in the reverse direction from the rear side of a video as the beginning on the basis of original features, and then splices the extracted features with the features extracted in the forward direction. Therefore, more comprehensive video characteristics are obtained, and the labeling accuracy of the machine learning module is improved.

In some embodiments provided herein, the first feature comprises a result of processing of the single video segment by a visual processing algorithm; the second feature comprises a subset of the content of the single video segment. The vision processing algorithm includes, but is not limited to, functions in machine vision image processing software such as OpenCV, Halcon, Visionpro, etc. The method has the advantage of simple and convenient use by adopting the visual processing algorithm in the prior art to process a single video clip. The obtained processing result differs according to the output of the selected visual processing algorithm. A subset of the content of a single video clip is a portion of the content of the single video clip. The selected part is selected according to a preset selection rule, and can be a multi-frame fusion value or a frame of picture.

In some embodiments provided herein, the first feature comprises: an optical flow graph derived from the single video segment based on an optical flow extraction algorithm in OpenCV; the second feature includes: one picture frame in the single video clip. Optical Flow (Optical Flow) is the instantaneous velocity of pixel motion of spatially moving objects on the viewing imaging plane. The optical flow method is a method for calculating motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the corresponding relationship between a previous frame and a current frame. The present embodiment provides only one method for extracting a light flow graph, for example, by extracting based on an optical flow extraction algorithm in OpenCV. Specifically, OpenCV (open source computer vision library) based video frame extraction using python and TVL1 optical flow extraction are adopted. The method mainly comprises the following steps: reading frames from the video, obtaining relevant attributes, and setting which frames to store; and extracting the TVL1 optical flow through successive frames.

The picture frames can be selected from the video clips in a random selection mode, and fixed positions can also be selected, for example, the number of the picture frames in each video clip is selected uniformly. The light flow map and the order of acquisition of the picture frames are not limited by the order recited.

In some embodiments provided herein, the method further comprises: judging whether the frame number of the video is larger than the product of N and a preset frame number threshold value; and if not, copying and splicing the video to enable the frame number of the video to be larger than the product of N and a preset frame number threshold value. For example: the preset frame number threshold is 5, that is, each video segment has at least 5 frames, and if N is 20, the video is required to have at least 100 frames. If the video does not meet the length requirement of 100 frames, it needs to be copy spliced until more than 100 frames. By the embodiment, each video segment can be guaranteed to have enough frame number to realize optical flow extraction.

In some embodiments provided herein, the method further comprises: after N video clips are obtained by slicing the video, preprocessing picture frames in the video clips into a preset specification. For example: and (3) replaying and processing the image size of each frame in the video segment, and scaling the image size into (224, 224, 3) pictures, thereby improving the computational efficiency of subsequent tasks.

In some embodiments provided by the present invention, the feature sequence of the video is used for inputting Bi-directional LSTM RNN Bi-directional cyclic long-and-short neural network for training or judgment. The LSTM (Long short-term memory) is a special RNN, and mainly aims to solve the problems of gradient elimination and gradient explosion in the Long sequence training process. In short, LSTM can perform better in longer sequences than normal RNNs. The Attention mechanism may be introduced in LSTM. The basic idea of the Attention mechanism is to break the limitation that the traditional coder-decoder structure depends on an internal fixed-length vector during coding and decoding. The Attention mechanism is implemented by retaining intermediate output results of the LSTM encoder on input sequences, then training a model to selectively learn these inputs and associate the output sequences with them as the model is output. Fig. 2 schematically shows a structural diagram of a bidirectional cyclic long-term neural network according to an embodiment of the present application, as shown in fig. 2. The bidirectional cyclic long-time neural network comprises an input layer, a forward layer, a backward layer and an output layer, wherein w1-w6 in the figure respectively represent weight values.

Fig. 3 schematically shows an overall architecture diagram of a feature construction method of a video according to an embodiment of the present application. As shown in fig. 3, taking a short video as an example, the method includes the following steps:

1. acquiring short video to be processed, and slicing the short video into N segments to obtain video segments 1-N;

2. sequentially reading the video segments for preprocessing, such as size playback processing; respectively carrying out the processing of the step 3 and the step 4 on the preprocessed video clip;

3. randomly selecting an original frame picture from each video clip as a first characteristic;

4. calculating a light flow graph of each video clip as a second feature; the step and the step 3 can be in parallel relation;

5. temporarily storing the results obtained in the step 3 and the step 4 according to the slicing sequence until the N video segments are all read and processed;

6. and respectively superposing the light flow graph and the original frame picture in a forward direction and a reverse direction according to the slicing sequence so as to obtain the characteristic sequence of the short video. The signature sequence can be used as input for subsequent Attention-LSTM and other algorithms.

Through the implementation mode, the bidirectional splicing method is adopted, the characteristics are extracted again in the reverse direction from the tail part of the video on the basis of the original characteristics, and then the characteristics are spliced with the characteristics extracted in the forward direction. Therefore, more comprehensive video characteristics can be obtained, and the accuracy of short video labeling is improved.

Based on the same inventive concept, the embodiment of the invention also provides a video feature construction device. Fig. 4 schematically shows a structural diagram of a feature construction apparatus of a video according to an embodiment of the present application, as shown in fig. 4. An apparatus for feature construction of a video, the apparatus comprising: the video slicing module is used for slicing the video to obtain N video segments; the segment feature module is used for respectively acquiring a first feature and a second feature of a single video segment; and the feature superposition module is used for respectively carrying out positive sequence superposition and negative sequence superposition on the N first features and the N second features to obtain a feature sequence of the video.

In some alternative embodiments, the first characteristic comprises a result of processing of the single video segment by a visual processing algorithm; the second feature comprises a subset of the content of the single video segment.

In some alternative embodiments, the first feature comprises: an optical flow graph derived from the single video segment based on an optical flow extraction algorithm in OpenCV; the second feature includes: one picture frame in the single video clip.

In some optional embodiments, the apparatus further comprises a video length processing module; the video length processing module is used for: judging whether the frame number of the video is larger than the product of N and a preset frame number threshold value; and if the frame number of the video is not more than the product of N and a preset frame number threshold, copying and splicing the video to ensure that the frame number of the video is more than the product of N and the preset frame number threshold.

In some optional embodiments, the apparatus further comprises a pre-processing module; the preprocessing module is used for: after the video slicing module slices a video to obtain N video clips, preprocessing picture frames in the video clips into a preset specification.

For the specific limitations of each implementation step in the above feature configuration apparatus for video, reference may be made to the limitations of the feature configuration method for video, and details are not described here. The advantageous effect of this can be also estimated from the foregoing method of constructing the characteristics of the video.

The embodiment of the application provides equipment, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the characteristic construction method of the video.

The present application also provides a computer program product adapted to execute a program for initializing the steps of the feature construction method comprising the above-mentioned video, when executed on a data processing device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for feature construction of a video, the method comprising:

slicing the video to obtain N video segments;

acquiring a first characteristic and a second characteristic of a single video clip;

and respectively carrying out positive sequence superposition and negative sequence superposition on the N first characteristics and the N second characteristics to obtain a characteristic sequence of the video.

2. The method of claim 1, wherein the first feature comprises a result of processing the single video segment by a visual processing algorithm; the second feature comprises a subset of the content of the single video segment.

3. The method of claim 2, wherein the first feature comprises: an optical flow graph derived from the single video segment based on an optical flow extraction algorithm in OpenCV; the second feature includes: one picture frame in the single video clip.

4. The method of claim 1, further comprising:

judging whether the frame number of the video is larger than the product of N and a preset frame number threshold value;

and if the frame number of the video is not more than the product of N and a preset frame number threshold, copying and splicing the video to enable the video to be more than the product of N and the preset frame number threshold.

5. The method of claim 1, further comprising:

after N video clips are obtained by slicing the video, preprocessing picture frames in the video clips into a preset specification.

6. An apparatus for feature construction of a video, the apparatus comprising:

the video slicing module is used for slicing the video to obtain N video segments;

the segment feature module is used for respectively acquiring a first feature and a second feature of a single video segment; and

and the characteristic superposition module is used for superposing the N first characteristics and the N second characteristics according to at least two superposition modes to obtain a characteristic sequence of the video.

7. The apparatus of claim 6, wherein the first feature comprises a result of processing the single video segment by a visual processing algorithm; the second feature comprises a subset of the content of the single video segment.

8. The apparatus of claim 7, wherein the first feature comprises: an optical flow graph derived from the single video segment based on an optical flow extraction algorithm in OpenCV; the second feature includes: one picture frame in the single video clip.

9. The apparatus of claim 6, further comprising a video length processing module; the video length processing module is used for:

and if the frame number of the video is not more than the product of N and a preset frame number threshold, copying and splicing the video to ensure that the frame number of the video is more than the product of N and the preset frame number threshold.

10. The apparatus of claim 6, further comprising a pre-processing module; the preprocessing module is used for: after the video slicing module slices a video to obtain N video clips, preprocessing picture frames in the video clips into a preset specification.

11. A video feature construction device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the video feature construction method of any one of claims 1 to 5 when executing the computer program.

12. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to execute the method of characterizing video according to any one of claims 1 to 5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method of feature construction of a video according to any of claims 1 to 5.