CN113536823A

CN113536823A - Video scene label extraction system and method based on deep learning and application thereof

Info

Publication number: CN113536823A
Application number: CN202010281542.6A
Authority: CN
Inventors: 秦迎梅; 门聪; 车艳秋; 韩春晓
Original assignee: Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Current assignee: Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2021-10-22

Abstract

The invention discloses a video scene label extraction system based on deep learning and a method thereof, wherein the method comprises the following steps: step 1, a sample construction module collects picture samples and labels scene labels for the second time; step 2, preprocessing the picture sample obtained in the step 1 by a data preprocessing module, and dividing the picture sample into a training set and a verification set; step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, and then verifying by using a verification set to obtain an identification model; step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames; and inputting the key frame picture generated by preprocessing into a recognition model to obtain recognized possible scene labels. The invention solves the problem of insufficient extraction of video content information, and the client can optimize the recommendation and search effects of the video based on the identified information.

Description

Video scene label extraction system and method based on deep learning and application thereof

Technical Field

The invention relates to the technical field of video processing, in particular to a system and a method for extracting video scene labels based on deep learning and application thereof.

Background

In the field of video APP, users of short video and small video products are getting larger and larger, and a large amount of video data needs to be analyzed and processed. The platform of the video product needs to effectively analyze video results, content labels extracted based on video content are single at present, general video search mainly depends on matching of keywords and video titles, and the extracted labels have deviation with the video content.

Disclosure of Invention

The invention aims to provide a video scene tag extraction system based on deep learning, aiming at the problem that the deviation exists in extracting video tags according to keywords in the prior art.

Another object of the present invention is to provide an extraction method of the video scene tag extraction system.

Another object of the present invention is to provide an application of the video scene tag extraction system and the extraction method.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a video scene label extraction system based on deep learning is characterized by comprising a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;

the system comprises a sample construction module, a data preprocessing module, a deep learning model module, an identification and processing module, a video processing module and a data processing module, wherein the sample construction module is used for collecting picture samples and labeling the picture samples with scene labels, the data preprocessing module is used for preprocessing the picture samples in a filtering and standardization mode, the deep learning model in the deep learning model module is trained by the preprocessed picture samples to obtain an identification model, the identification and processing module is used for extracting pictures from videos and then standardizing the pictures, and the identification model is used for identifying the standardized extracted pictures and outputting the video scene labels.

In the above technical solution, the identification and processing module may be used locally or deployed in a cloud, and the process of deployment in the cloud is as follows:

step 1, deploying the rear end of a server by using a python flash framework, and building an http service; and 2, the server side open port processes the request transmitted by the Internet.

In the technical scheme, the deep learning model adopts an EfficientNet network structure.

In another aspect of the invention, the deep learning-based video scene label extraction system is applied to short video feature extraction and search strategy optimization.

In another aspect of the present invention, an extraction method of a deep learning-based video scene tag extraction system includes the following steps:

step 1, a sample construction module collects picture samples, wherein the picture samples comprise picture samples in a public data set and picture samples obtained through keyword search, all the picture samples are secondarily labeled with scene labels according to scenes needing to be mined, and pictures which do not accord with the scene labels in related categories are deleted;

step 2, preprocessing the picture samples obtained in the step 1 by a data preprocessing module, sorting the picture samples according to categories, and dividing the picture samples into a training set and a verification set;

step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, verifying the deep learning model by using the verification set, and storing the deep learning model with the optimal effect on the verification set to obtain an identification model;

step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames;

inputting the key frame picture generated by preprocessing into the recognition model to obtain the recognized possible scene label x and the corresponding score, and if the score is above a threshold value a, the picture is considered to have the label x, preferably, the score is between 0 and 1, and the value of a is 0.5.

In the above technical solution, during the preprocessing in step 2, firstly, the picture samples with the width less than 200 pixels are filtered, and then, the picture samples are resampled and black-filled to process into a picture with 446 × 446 pixels; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.

In the above technical solution, in the step 2, the ratio of the number of samples in the training set to the number of samples in the verification set is (3-5):1, and more preferably 4: 1.

In the above technical solution, in the step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or left and right mirroring.

In the above technical solution, in the step 4, a ffmpeg tool is used to extract a key frame of a video to be processed;

in the step 4, the key frame pictures are extracted every 2-5s, and in the step 4, after each key frame in the video is identified, the corresponding scene label of each time point, namely the scene label included in the key frame of each time point, can be generated.

In another aspect of the invention, the extraction method is applied to short video feature extraction and search strategy optimization.

Compared with the prior art, the invention has the beneficial effects that:

the method can extract and identify the scene labels of the video through a deep learning technology, and based on the sequence information, the efficiency and the accuracy of extracting the video characteristics can be improved. The method can be further applied to a short video platform, and richness and accuracy of search and recommendation results can be improved.

Drawings

FIG. 1 is an example of a potential application scenario of the method

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

A video scene label extraction system based on deep learning comprises a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;

The identification and processing module can be used locally or deployed in a cloud end, and the process of deployment in the cloud end is as follows:

step 1, deploying the back end of a server by using a python flash framework, and building an http service. The cloud system may be Aliyun. And 2, opening a certain port such as 8080 at the server end, and processing the request of internet transmission. The request for internet transmission is done via http protocol.

Example 2

The extraction method of the deep learning-based video scene tag extraction system according to embodiment 1 includes the following steps:

step 1, a sample construction module collects picture samples, wherein the picture samples comprise picture samples in a public data set and picture samples obtained through keyword search, and scene labels are secondarily labeled on the picture samples according to scenes needing to be mined; (scene label categories such as indoor, street, park, home, office, field, seaside, mountain, car, ship, snow mountain, desert), deleting pictures which do not conform to the scene label under the related category;

and inputting the key frame picture generated by preprocessing into a recognition model, and obtaining the recognized possible scene label x and a corresponding score (the score is between 0 and 1), and if the score is above a threshold value a, the picture is considered to have the label x.

Preferably, a is 0.5.

In order to optimize the recognition effect of the recognition model, the deep learning model adopts an EfficientNet network structure, and the large-scale method of the general deep learning classification network comprises the following steps: widening the network, deepening the network, and increasing resolution. And the EfficientNet takes the network width, the depth and the resolution as optimization parameters to obtain the optimal width, depth and resolution combination under a certain model complexity. The extraction system can improve the identification accuracy.

In order to unify the picture samples, in the step 2, during the preprocessing, firstly, the picture samples with the width less than 200 pixels are filtered, and then, the picture samples are resampled and black-filled to be processed into a picture with 446 × 446 pixels; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.

In step 1, a relevant sample is downloaded from a network public data set, such as imagenet, to obtain a picture sample in the public data set.

In step 2, the ratio of the number of samples in the training set to the number of samples in the validation set is (3-5):1, and more preferably 4: 1.

In order to improve the generalization capability of the deep learning model, in step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or left and right mirroring, for example, randomly translating up, down, left and right by 1-50 pixels and rotating random angles, for example, -20 degrees to 20 degrees. All the pictures are enhanced in this way, one picture is subjected to different enhancement methods to obtain a plurality of pictures, and the pictures are all used as training samples.

In order to extract the key frame picture, the key frame of the video to be processed is extracted by using the ffmpeg tool in the step 4,

in order to optimize the effect of extracting the labels, the key frame pictures are extracted every 2-5s in the step 4. In step 4, after each key frame in the video is identified, a corresponding scene tag of each time point, that is, a scene tag included in the key frame of each time point, may be generated. Specific uses of this information are detailed in example 3.

Example 3

This embodiment exemplifies an application scenario of the extraction system of embodiment 1 or the extraction method of embodiment 2.

3.1

The extraction system and the extraction method can be applied to short video feature extraction and search strategy optimization, such as calculating which main scene labels are contained in the short video and applying the main scene labels as features in video search. For example, the search no longer uses the video title alone as a standard, but directly searches the content tag of the video. The current video search scene search basis mainly takes title information as a main basis, and as shown in fig. 1, content information can be added to increase the richness of search results.

3.2

The extraction system and the extraction method can be applied to short video recommendation strategy optimization, dig out which labels have higher playing completion rate and higher praise rate of the users, and then increase the recommendation weight of the video of the type in recommendation.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video scene label extraction system based on deep learning is characterized by comprising a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;

2. The deep learning-based video scene tag extraction system of claim 1, wherein the recognition and processing module is configured to be used locally or deployed in a cloud, and the process of deployment in the cloud is as follows:

3. The deep learning based video scene tag extraction system of claim 1, wherein the deep learning model employs an EfficientNet network structure.

4. Use of a deep learning based video scene tag extraction system according to any of claims 1-3 in short video feature extraction and search strategy optimization.

5. The extraction method of the video scene label extraction system based on deep learning is characterized by comprising the following steps:

and inputting the key frame picture generated by preprocessing into the recognition model to obtain a recognized possible scene label x and a corresponding score, wherein if the score is above a threshold value a, the picture has the label x, preferably, the score is between 0 and 1, and the value of a is 0.5.

6. The extraction method according to claim 5, wherein in the step 2 preprocessing, the picture samples with the width less than 200 pixels are filtered, and then the picture samples are resampled and black-filled to be processed into a 446 × 446 pixel picture; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.

7. The extraction method according to claim 5, wherein in step 2, the ratio of the number of samples in the training set to the number of samples in the validation set is (3-5):1, and more preferably 4: 1.

8. The extraction method according to claim 5, wherein in the step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or mirroring left and right.

9. The extraction method as claimed in claim 5, wherein in step 4, the key frames of the video to be processed are extracted by using ffmpeg tool;

10. Use of the extraction method of any one of claims 5-9 in short video feature extraction and search strategy optimization.