CN113536823A - Video scene label extraction system and method based on deep learning and application thereof - Google Patents
Video scene label extraction system and method based on deep learning and application thereof Download PDFInfo
- Publication number
- CN113536823A CN113536823A CN202010281542.6A CN202010281542A CN113536823A CN 113536823 A CN113536823 A CN 113536823A CN 202010281542 A CN202010281542 A CN 202010281542A CN 113536823 A CN113536823 A CN 113536823A
- Authority
- CN
- China
- Prior art keywords
- deep learning
- video
- module
- picture
- preprocessing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013135 deep learning Methods 0.000 title claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 33
- 238000013136 deep learning model Methods 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000010276 construction Methods 0.000 claims abstract description 13
- 238000012795 verification Methods 0.000 claims abstract description 12
- 230000000694 effects Effects 0.000 claims abstract description 6
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000005457 optimization Methods 0.000 claims description 7
- 238000012952 Resampling Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a video scene label extraction system based on deep learning and a method thereof, wherein the method comprises the following steps: step 1, a sample construction module collects picture samples and labels scene labels for the second time; step 2, preprocessing the picture sample obtained in the step 1 by a data preprocessing module, and dividing the picture sample into a training set and a verification set; step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, and then verifying by using a verification set to obtain an identification model; step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames; and inputting the key frame picture generated by preprocessing into a recognition model to obtain recognized possible scene labels. The invention solves the problem of insufficient extraction of video content information, and the client can optimize the recommendation and search effects of the video based on the identified information.
Description
Technical Field
The invention relates to the technical field of video processing, in particular to a system and a method for extracting video scene labels based on deep learning and application thereof.
Background
In the field of video APP, users of short video and small video products are getting larger and larger, and a large amount of video data needs to be analyzed and processed. The platform of the video product needs to effectively analyze video results, content labels extracted based on video content are single at present, general video search mainly depends on matching of keywords and video titles, and the extracted labels have deviation with the video content.
Disclosure of Invention
The invention aims to provide a video scene tag extraction system based on deep learning, aiming at the problem that the deviation exists in extracting video tags according to keywords in the prior art.
Another object of the present invention is to provide an extraction method of the video scene tag extraction system.
Another object of the present invention is to provide an application of the video scene tag extraction system and the extraction method.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a video scene label extraction system based on deep learning is characterized by comprising a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;
the system comprises a sample construction module, a data preprocessing module, a deep learning model module, an identification and processing module, a video processing module and a data processing module, wherein the sample construction module is used for collecting picture samples and labeling the picture samples with scene labels, the data preprocessing module is used for preprocessing the picture samples in a filtering and standardization mode, the deep learning model in the deep learning model module is trained by the preprocessed picture samples to obtain an identification model, the identification and processing module is used for extracting pictures from videos and then standardizing the pictures, and the identification model is used for identifying the standardized extracted pictures and outputting the video scene labels.
In the above technical solution, the identification and processing module may be used locally or deployed in a cloud, and the process of deployment in the cloud is as follows:
step 1, deploying the rear end of a server by using a python flash framework, and building an http service; and 2, the server side open port processes the request transmitted by the Internet.
In the technical scheme, the deep learning model adopts an EfficientNet network structure.
In another aspect of the invention, the deep learning-based video scene label extraction system is applied to short video feature extraction and search strategy optimization.
In another aspect of the present invention, an extraction method of a deep learning-based video scene tag extraction system includes the following steps:
step 1, a sample construction module collects picture samples, wherein the picture samples comprise picture samples in a public data set and picture samples obtained through keyword search, all the picture samples are secondarily labeled with scene labels according to scenes needing to be mined, and pictures which do not accord with the scene labels in related categories are deleted;
step 2, preprocessing the picture samples obtained in the step 1 by a data preprocessing module, sorting the picture samples according to categories, and dividing the picture samples into a training set and a verification set;
step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, verifying the deep learning model by using the verification set, and storing the deep learning model with the optimal effect on the verification set to obtain an identification model;
step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames;
inputting the key frame picture generated by preprocessing into the recognition model to obtain the recognized possible scene label x and the corresponding score, and if the score is above a threshold value a, the picture is considered to have the label x, preferably, the score is between 0 and 1, and the value of a is 0.5.
In the above technical solution, during the preprocessing in step 2, firstly, the picture samples with the width less than 200 pixels are filtered, and then, the picture samples are resampled and black-filled to process into a picture with 446 × 446 pixels; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.
In the above technical solution, in the step 2, the ratio of the number of samples in the training set to the number of samples in the verification set is (3-5):1, and more preferably 4: 1.
In the above technical solution, in the step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or left and right mirroring.
In the above technical solution, in the step 4, a ffmpeg tool is used to extract a key frame of a video to be processed;
in the step 4, the key frame pictures are extracted every 2-5s, and in the step 4, after each key frame in the video is identified, the corresponding scene label of each time point, namely the scene label included in the key frame of each time point, can be generated.
In another aspect of the invention, the extraction method is applied to short video feature extraction and search strategy optimization.
Compared with the prior art, the invention has the beneficial effects that:
the method can extract and identify the scene labels of the video through a deep learning technology, and based on the sequence information, the efficiency and the accuracy of extracting the video characteristics can be improved. The method can be further applied to a short video platform, and richness and accuracy of search and recommendation results can be improved.
Drawings
FIG. 1 is an example of a potential application scenario of the method
Detailed Description
The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
A video scene label extraction system based on deep learning comprises a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;
the system comprises a sample construction module, a data preprocessing module, a deep learning model module, an identification and processing module, a video processing module and a data processing module, wherein the sample construction module is used for collecting picture samples and labeling the picture samples with scene labels, the data preprocessing module is used for preprocessing the picture samples in a filtering and standardization mode, the deep learning model in the deep learning model module is trained by the preprocessed picture samples to obtain an identification model, the identification and processing module is used for extracting pictures from videos and then standardizing the pictures, and the identification model is used for identifying the standardized extracted pictures and outputting the video scene labels.
The identification and processing module can be used locally or deployed in a cloud end, and the process of deployment in the cloud end is as follows:
step 1, deploying the back end of a server by using a python flash framework, and building an http service. The cloud system may be Aliyun. And 2, opening a certain port such as 8080 at the server end, and processing the request of internet transmission. The request for internet transmission is done via http protocol.
Example 2
The extraction method of the deep learning-based video scene tag extraction system according to embodiment 1 includes the following steps:
step 1, a sample construction module collects picture samples, wherein the picture samples comprise picture samples in a public data set and picture samples obtained through keyword search, and scene labels are secondarily labeled on the picture samples according to scenes needing to be mined; (scene label categories such as indoor, street, park, home, office, field, seaside, mountain, car, ship, snow mountain, desert), deleting pictures which do not conform to the scene label under the related category;
step 2, preprocessing the picture samples obtained in the step 1 by a data preprocessing module, sorting the picture samples according to categories, and dividing the picture samples into a training set and a verification set;
step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, verifying the deep learning model by using the verification set, and storing the deep learning model with the optimal effect on the verification set to obtain an identification model;
step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames;
and inputting the key frame picture generated by preprocessing into a recognition model, and obtaining the recognized possible scene label x and a corresponding score (the score is between 0 and 1), and if the score is above a threshold value a, the picture is considered to have the label x.
Preferably, a is 0.5.
In order to optimize the recognition effect of the recognition model, the deep learning model adopts an EfficientNet network structure, and the large-scale method of the general deep learning classification network comprises the following steps: widening the network, deepening the network, and increasing resolution. And the EfficientNet takes the network width, the depth and the resolution as optimization parameters to obtain the optimal width, depth and resolution combination under a certain model complexity. The extraction system can improve the identification accuracy.
In order to unify the picture samples, in the step 2, during the preprocessing, firstly, the picture samples with the width less than 200 pixels are filtered, and then, the picture samples are resampled and black-filled to be processed into a picture with 446 × 446 pixels; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.
In step 1, a relevant sample is downloaded from a network public data set, such as imagenet, to obtain a picture sample in the public data set.
In step 2, the ratio of the number of samples in the training set to the number of samples in the validation set is (3-5):1, and more preferably 4: 1.
In order to improve the generalization capability of the deep learning model, in step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or left and right mirroring, for example, randomly translating up, down, left and right by 1-50 pixels and rotating random angles, for example, -20 degrees to 20 degrees. All the pictures are enhanced in this way, one picture is subjected to different enhancement methods to obtain a plurality of pictures, and the pictures are all used as training samples.
In order to extract the key frame picture, the key frame of the video to be processed is extracted by using the ffmpeg tool in the step 4,
in order to optimize the effect of extracting the labels, the key frame pictures are extracted every 2-5s in the step 4. In step 4, after each key frame in the video is identified, a corresponding scene tag of each time point, that is, a scene tag included in the key frame of each time point, may be generated. Specific uses of this information are detailed in example 3.
Example 3
This embodiment exemplifies an application scenario of the extraction system of embodiment 1 or the extraction method of embodiment 2.
3.1
The extraction system and the extraction method can be applied to short video feature extraction and search strategy optimization, such as calculating which main scene labels are contained in the short video and applying the main scene labels as features in video search. For example, the search no longer uses the video title alone as a standard, but directly searches the content tag of the video. The current video search scene search basis mainly takes title information as a main basis, and as shown in fig. 1, content information can be added to increase the richness of search results.
3.2
The extraction system and the extraction method can be applied to short video recommendation strategy optimization, dig out which labels have higher playing completion rate and higher praise rate of the users, and then increase the recommendation weight of the video of the type in recommendation.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A video scene label extraction system based on deep learning is characterized by comprising a sample construction module, a data preprocessing module, a deep learning model module and an identification and processing module;
the system comprises a sample construction module, a data preprocessing module, a deep learning model module, an identification and processing module, a video processing module and a data processing module, wherein the sample construction module is used for collecting picture samples and labeling the picture samples with scene labels, the data preprocessing module is used for preprocessing the picture samples in a filtering and standardization mode, the deep learning model in the deep learning model module is trained by the preprocessed picture samples to obtain an identification model, the identification and processing module is used for extracting pictures from videos and then standardizing the pictures, and the identification model is used for identifying the standardized extracted pictures and outputting the video scene labels.
2. The deep learning-based video scene tag extraction system of claim 1, wherein the recognition and processing module is configured to be used locally or deployed in a cloud, and the process of deployment in the cloud is as follows:
step 1, deploying the rear end of a server by using a python flash framework, and building an http service; and 2, the server side open port processes the request transmitted by the Internet.
3. The deep learning based video scene tag extraction system of claim 1, wherein the deep learning model employs an EfficientNet network structure.
4. Use of a deep learning based video scene tag extraction system according to any of claims 1-3 in short video feature extraction and search strategy optimization.
5. The extraction method of the video scene label extraction system based on deep learning is characterized by comprising the following steps:
step 1, a sample construction module collects picture samples, wherein the picture samples comprise picture samples in a public data set and picture samples obtained through keyword search, all the picture samples are secondarily labeled with scene labels according to scenes needing to be mined, and pictures which do not accord with the scene labels in related categories are deleted;
step 2, preprocessing the picture samples obtained in the step 1 by a data preprocessing module, sorting the picture samples according to categories, and dividing the picture samples into a training set and a verification set;
step 3, training the deep learning model in the deep learning model module by using the training set obtained in the step 2, verifying the deep learning model by using the verification set, and storing the deep learning model with the optimal effect on the verification set to obtain an identification model;
step 4, the identification and processing module extracts key frames of the video to be processed, records corresponding time points of the key frames in the video, and preprocesses the extracted key frames;
and inputting the key frame picture generated by preprocessing into the recognition model to obtain a recognized possible scene label x and a corresponding score, wherein if the score is above a threshold value a, the picture has the label x, preferably, the score is between 0 and 1, and the value of a is 0.5.
6. The extraction method according to claim 5, wherein in the step 2 preprocessing, the picture samples with the width less than 200 pixels are filtered, and then the picture samples are resampled and black-filled to be processed into a 446 × 446 pixel picture; and when the preprocessing is performed in the step 4, resampling and black filling are performed, and a 446 × 446 pixel picture is processed.
7. The extraction method according to claim 5, wherein in step 2, the ratio of the number of samples in the training set to the number of samples in the validation set is (3-5):1, and more preferably 4: 1.
8. The extraction method according to claim 5, wherein in the step 2, after the preprocessing, the number of samples is increased by using a sample enhancement technique, including translating random pixels, rotating random angles or mirroring left and right.
9. The extraction method as claimed in claim 5, wherein in step 4, the key frames of the video to be processed are extracted by using ffmpeg tool;
in the step 4, the key frame pictures are extracted every 2-5s, and in the step 4, after each key frame in the video is identified, the corresponding scene label of each time point, namely the scene label included in the key frame of each time point, can be generated.
10. Use of the extraction method of any one of claims 5-9 in short video feature extraction and search strategy optimization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010281542.6A CN113536823A (en) | 2020-04-10 | 2020-04-10 | Video scene label extraction system and method based on deep learning and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010281542.6A CN113536823A (en) | 2020-04-10 | 2020-04-10 | Video scene label extraction system and method based on deep learning and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113536823A true CN113536823A (en) | 2021-10-22 |
Family
ID=78087766
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010281542.6A Pending CN113536823A (en) | 2020-04-10 | 2020-04-10 | Video scene label extraction system and method based on deep learning and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113536823A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114187558A (en) * | 2021-12-20 | 2022-03-15 | 深圳万兴软件有限公司 | Video scene recognition method and device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105187795A (en) * | 2015-09-14 | 2015-12-23 | 博康云信科技有限公司 | Video label positioning method and device based on view library |
CN108777815A (en) * | 2018-06-08 | 2018-11-09 | Oppo广东移动通信有限公司 | Method for processing video frequency and device, electronic equipment, computer readable storage medium |
CN109284784A (en) * | 2018-09-29 | 2019-01-29 | 北京数美时代科技有限公司 | A kind of content auditing model training method and device for live scene video |
CN110674345A (en) * | 2019-09-12 | 2020-01-10 | 北京奇艺世纪科技有限公司 | Video searching method and device and server |
CN110827250A (en) * | 2019-10-29 | 2020-02-21 | 浙江明峰智能医疗科技有限公司 | Intelligent medical image quality evaluation method based on lightweight convolutional neural network |
-
2020
- 2020-04-10 CN CN202010281542.6A patent/CN113536823A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105187795A (en) * | 2015-09-14 | 2015-12-23 | 博康云信科技有限公司 | Video label positioning method and device based on view library |
CN108777815A (en) * | 2018-06-08 | 2018-11-09 | Oppo广东移动通信有限公司 | Method for processing video frequency and device, electronic equipment, computer readable storage medium |
CN109284784A (en) * | 2018-09-29 | 2019-01-29 | 北京数美时代科技有限公司 | A kind of content auditing model training method and device for live scene video |
CN110674345A (en) * | 2019-09-12 | 2020-01-10 | 北京奇艺世纪科技有限公司 | Video searching method and device and server |
CN110827250A (en) * | 2019-10-29 | 2020-02-21 | 浙江明峰智能医疗科技有限公司 | Intelligent medical image quality evaluation method based on lightweight convolutional neural network |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114187558A (en) * | 2021-12-20 | 2022-03-15 | 深圳万兴软件有限公司 | Video scene recognition method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110297943B (en) | Label adding method and device, electronic equipment and storage medium | |
CN109657552B (en) | Vehicle type recognition device and method for realizing cross-scene cold start based on transfer learning | |
WO2021082589A1 (en) | Content check model training method and apparatus, video content check method and apparatus, computer device, and storage medium | |
US8315430B2 (en) | Object recognition and database population for video indexing | |
CN107103314B (en) | A kind of fake license plate vehicle retrieval system based on machine vision | |
CN111191695A (en) | Website picture tampering detection method based on deep learning | |
CN106844685B (en) | Method, device and server for identifying website | |
CN113407886A (en) | Network crime platform identification method, system, device and computer storage medium | |
CN107992937B (en) | Unstructured data judgment method and device based on deep learning | |
CN112488222B (en) | Crowdsourcing data labeling method, system, server and storage medium | |
CN109684511A (en) | A kind of video clipping method, video aggregation method, apparatus and system | |
Chen et al. | Text area detection from video frames | |
CN111601179A (en) | Network advertisement promotion method based on video content | |
CN115775363A (en) | Illegal video detection method based on text and video fusion | |
CN109408671A (en) | The searching method and its system of specific objective | |
CN111914649A (en) | Face recognition method and device, electronic equipment and storage medium | |
CN113536823A (en) | Video scene label extraction system and method based on deep learning and application thereof | |
CN113536032A (en) | Video sequence information mining system, method and application thereof | |
CN114037886A (en) | Image recognition method and device, electronic equipment and readable storage medium | |
CN113762034A (en) | Video classification method and device, storage medium and electronic equipment | |
CN115983873B (en) | User data analysis management system and method based on big data | |
CN112685510B (en) | Asset labeling method, computer program and storage medium based on full flow label | |
CN111860222B (en) | Video behavior recognition method, system, computer device and storage medium based on dense-segmented frame sampling | |
CN110163043B (en) | Face detection method, device, storage medium and electronic device | |
CN114267084A (en) | Video identification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211022 |