CN110222234B

CN110222234B - Video classification method and device

Info

Publication number: CN110222234B
Application number: CN201910516934.3A
Authority: CN
Inventors: 谷满昌; 张弛; 陈相男
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2021-07-23
Anticipated expiration: 2039-06-14
Also published as: CN110222234A

Abstract

The embodiment of the application discloses a video classification method and a video classification device, which are used for improving the video frame classification efficiency. The method in the embodiment of the application comprises the following steps: acquiring title data and a cover picture of a video to be classified; segmenting the header data to generate a feature word, and inputting the feature word into a width network model to obtain a first classification result; converting the header data into input samples for a deep network; inputting the input sample into the deep network to obtain a second classification result; performing feature extraction on the cover map to obtain feature data; inputting the feature data into the deep network to obtain a third classification result; and determining a target classification result according to the first classification result, the second classification result and the third classification result.

Description

Video classification method and device

Technical Field

The present application relates to the field of computers, and in particular, to a video classification method and apparatus.

Background

Video classification is an important research topic in the fields of computer vision and natural language processing, and is also a very challenging hot problem at present. With the rapid growth of video data, video classification attracts a lot of attention. In order to meet the requirements of different users, the video content needs to be classified.

The prior art video classification is mainly based on visual information, i.e. feature extraction needs to be performed on each frame of a video, and then based on traditional machine learning methods such as: and supporting a vector machine, and carrying out video classification by using a naive Bayesian model and the like. Specifically, when video classification is performed, a key area of each video frame in the video needs to be determined, then classification prediction is performed on the key area to obtain a classification result of the video to be classified, and then the categories of different video frames are compared to determine a final classification result.

This results in a large number of video frames to be identified and a low efficiency of classification.

Disclosure of Invention

The embodiment of the application provides a video classification method and device, which are used for improving the video frame classification efficiency.

In a first aspect, an embodiment of the present application provides a video classification method, which specifically includes: the video classification device acquires title data and a cover picture of a video to be classified; then the video classification device generates a feature word according to the segmentation of the header data, and inputs the feature word into a width network model to obtain a first classification result; then the video classification device converts the header data into an input sample of a depth network, and inputs the input sample into the depth network to obtain a second classification result; then the video classification device extracts the features of the cover map to obtain feature data, and inputs the feature data into the deep network to obtain a third classification result; and finally, the video classification device determines a target classification result according to the first classification result, the second classification result and the third classification result.

In the embodiment of the application, the video classification device extracts features by using the title data and the cover map of the video to be classified, and then realizes feature cross complementation by using a video classification model combining the width and the depth, so that the accuracy of video classification is improved, and the efficiency of video classification is enhanced.

Optionally, the video classification apparatus may adopt the following technical scheme when inputting the input sample into the depth network to obtain the second classification result: the video classification device sequentially performs feature extraction and feature compression on the input sample to obtain a semantic expression vector; then activating a preset part in the semantic expression vector through an attention mechanism so as to obtain a text expression vector; and finally, performing softmax classification on the text expression vector to obtain the second classification result.

Optionally, the video classification apparatus may adopt the following technical scheme when inputting the feature data into the deep network to obtain a third classification result: and the video classification device performs softmax classification on the feature data after sequentially passing through at least two feedforward neural networks to obtain the third classification result.

Optionally, when the video classification device performs segmentation on the header data to generate the feature words, the following technical scheme may be specifically adopted: the video classification device removes non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and segmenting the text data to be recognized to generate n-element template features, wherein the n-element template features are used as the feature words.

Similarly, optionally, the video classification apparatus may specifically adopt the following technical solutions when converting the header data into the input sample of the deep network: the video classification device removes non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and converting the text data to be recognized into input samples of the deep network.

Optionally, the specific operation of the video classification apparatus to convert the text data to be recognized into the input sample of the deep network may be as follows: the video classification device adopts a word embedding method to convert the text data to be recognized into the input sample of the deep network.

It will be appreciated that the video classification apparatus may extract image features of the cover sheet using a pre-trained ResNet-50 model to generate the feature data. Of course, in practical application, various models can be used for extracting image features.

In this embodiment, the format of the text data unified by the video classification apparatus may specifically be a unified character format, for example, all english letters are unified into lower case or upper case, all chinese character formats are unified into simplified chinese characters or traditional chinese characters, and all character formats are unified into full-angle or half-angle. The specific implementation mode of the video classification device for generating the text data to be identified according to the fixed length is as follows: the video classification device presets the length of the text data to be identified, namely the length is equivalent to the fixed length; the video classification apparatus then sequentially fills the text data with a fixed length until the fixed length is filled. For example, if the video classification apparatus sets the fixed length of the text data to be recognized to be 25 characters, and the length of the extracted text data to be 26 characters, the video classification apparatus discards the last character of the text data to generate the text data to be recognized; if the length of the extracted text data is 20 characters, the video classification device replaces the last 5 characters with 0, so as to complement 25 characters to generate the text data to be recognized.

Based on the above scheme, the video classification apparatus may adopt the following technical scheme when segmenting the text data to be identified to generate n-ary mold features: the video classification device performs word segmentation on the text data to be recognized to obtain continuous word strings; the n-element mold characteristics are obtained by performing statistical extraction on continuous words with the preset length of n. It is understood that the video classification device performs word segmentation on the text data to be recognized according to semantics to segment the text data to be recognized into continuous word strings. For example, the title of a certain video is: "do the following four steps, it is not a dream to finish the computer maintenance half an hour", the consecutive word string that the word segmentation gets is "do the following four steps, it is not a dream to finish the computer maintenance half an hour". The continuous string with the preset length of n is n continuous word strings as the length. For example, when n is taken to be 3, the (complete-computer-repair) is a specific string of consecutive words.

Optionally, when the video classification device performs feature extraction on the cover map to obtain feature data, the following technical scheme may be specifically adopted: the video classification device adjusts the size of the front cover picture to a preset value to obtain a front cover picture with a preset size; and then extracting the image features of the large graph with the preset size to obtain the feature data.

In a second aspect, an embodiment of the present application provides a video classification apparatus, which includes:

in one possible implementation, the video classification apparatus includes:

the acquisition module is used for acquiring the title data and the cover page picture of the video to be classified;

the processing module is used for segmenting the title data to generate a feature word and inputting the feature word into a width network model to obtain a first classification result; converting the header data into input samples for a deep network; inputting the input sample into the deep network to obtain a second classification result; performing feature extraction on the cover map to obtain feature data; inputting the feature data into the deep network to obtain a third classification result; and determining a target classification result according to the first classification result, the second classification result and the third classification result.

Optionally, the processing module is specifically configured to perform feature extraction and feature compression on the input sample in sequence to obtain a semantic expression vector; activating a preset part in the semantic representation vector through an attention mechanism so as to obtain a text representation vector; and performing softmax classification on the text representation vector to obtain the second classification result.

Optionally, the processing module is specifically configured to perform softmax classification on the feature data after sequentially passing through at least two feedforward neural networks to obtain the third classification result.

Optionally, the processing module is specifically configured to remove non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and segmenting the text data to be recognized to generate n-element template features, wherein the n-element template features are used as the feature words.

Optionally, the processing module is specifically configured to remove non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and converting the text data to be recognized into input samples of a deep network.

Optionally, the processing module is specifically configured to perform word segmentation on the text data to be recognized to obtain a continuous word string; and carrying out statistical extraction on the continuous string with the preset length of n to obtain the n-element mold characteristics.

Optionally, the processing module is specifically configured to convert the text data to be recognized into the input sample of the deep network by using a word embedding method.

Optionally, the processing module is specifically configured to adjust the size of the cover map to a preset value to obtain a cover map with a preset size; and extracting the image characteristics of the cover picture with the preset size to generate the characteristic data.

In another implementation, the video data apparatus includes: a processor and a memory, wherein the memory has a computer readable program stored therein, and the processor is configured to execute the program in the memory to perform any of the above methods.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium storing computer instructions for executing the method described in any one of the above.

In a fourth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods described above.

According to the technical scheme, the embodiment of the application has the following advantages: the video classification device extracts features by using the title data and the cover map of the video to be classified, and then realizes feature cross complementation by using a video classification model combining width and depth, thereby improving the accuracy of video classification and enhancing the efficiency of video classification.

Drawings

FIG. 1 is a schematic diagram of a system architecture of a video classification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of a video classification method in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a sub-network of the deep network in the embodiment of the present application;

FIG. 4 is a flow chart of another sub-network of the deep network in the embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of a video classification apparatus in an embodiment of the present application;

fig. 6 is a schematic diagram of another embodiment of the video classification apparatus in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Video classification is an important research topic in the fields of computer vision and natural language processing, and is also a very challenging hot problem at present. With the rapid growth of video data, video classification attracts a lot of attention. In order to meet the requirements of different users, the video content needs to be classified. The prior art video classification is mainly based on visual information, i.e. feature extraction needs to be performed on each frame of a video, and then based on traditional machine learning methods such as: and supporting a vector machine, and carrying out video classification by using a naive Bayesian model and the like. Specifically, when video classification is performed, a key area of each video frame in the video needs to be determined, then classification prediction is performed on the key area to obtain a classification result of the video to be classified, and then the categories of different video frames are compared to determine a final classification result. This results in a large number of video frames to be identified and a low efficiency of classification.

In order to solve the problem, the embodiment of the present application provides the following technical solutions: the video classification device acquires title data and a cover picture of a video to be classified; then the video classification device generates a feature word according to the segmentation of the header data, and inputs the feature word into a width network model to obtain a first classification result; then the video classification device converts the header data into an input sample of a depth network, and inputs the input sample into the depth network to obtain a second classification result; then the video classification device extracts the features of the cover map to obtain feature data, and inputs the feature data into the deep network to obtain a third classification result; and finally, the video classification device determines a target classification result according to the first classification result, the second classification result and the third classification result.

The following is described in conjunction with the system architecture shown in fig. 1: after the video classification device receives a video set to be classified, the video classification device respectively acquires a title and a cover picture of a video to be classified in the video set to be classified; then the video classification device generates an input text from the title by using a word embedding method, and then generates a classification result 2 from the input text through a subnetwork 1 of the deep network; meanwhile, the video classification device generates n-element mold characteristics for the title, and inputs the n-element mold characteristics into a width network model to obtain a classification result 1; meanwhile, the video classification device can extract the image characteristics of the cover picture by utilizing a pre-trained ResNet-50 model to generate the characteristic data, and the characteristic data is used for generating a classification result 3 through a sub-network 2 of the deep network; and finally obtaining a final prediction classification result according to the classification result 1, the classification result 2 and the classification result 3.

Specifically, referring to fig. 2, an embodiment of a video classification method in the embodiment of the present application includes:

201. the video classification device acquires title data and a cover page image of a video to be classified.

When the uploaded video to be classified is received, the video classifying device acquires the title of the video to be classified as title data and acquires a cover picture of the video to be classified.

It can be understood that the cover map of the video to be classified may be a video frame representative of the video to be classified selected by the user; the video to be classified may also be a static picture uploaded by the user, and the specific form is not limited here, that is, the picture is only a picture set by the user to describe the content of the video to be classified. The title data may be a title directly obtained from the user, or may be obtained from an html file of the web page. For example, the title set by the user is "find computer trouble in one second, please see carefully", and the title in the html file may be "< head > find computer trouble in one second, please see exactly </head >.

202. The video classification device preprocesses the title data to generate text data to be recognized, and preprocesses the cover page map to generate feature data.

The video classification device removes non-text data in the title data to obtain text data, wherein the non-text data comprises an htm1 label and a special symbol; the video classification device unifies the format of the text data and generates the text data to be recognized according to a preset length; the video classification device extracts image features of the cover picture by utilizing a pre-trained ResNet-50 model to generate the feature data.

In this embodiment, the format of the text data unified by the video classification apparatus may specifically be a unified font format, for example, all english letters are unified into lower case or upper case, all chinese character formats are unified into simplified chinese characters or traditional chinese characters, and all character formats are unified into full-angle or half-angle. The specific implementation mode of the video classification device for generating the text data to be identified according to the fixed length is as follows: the video classification device presets the length of the text data to be identified, namely the length is equivalent to the fixed length; the video classification apparatus then sequentially fills the text data with a fixed length until the fixed length is filled. For example, if the video classification apparatus sets the fixed length of the text data to be recognized to be 25 characters, and the length of the extracted text data to be 26 characters, the video classification apparatus discards the last character of the text data to generate the text data to be recognized; if the length of the extracted text data is 20 characters, the video classification device replaces the last 5 characters with 0, so as to complement 25 characters to generate the text data to be recognized. When the video classification device extracts the image features of the cover picture by using the pre-trained ResNet-50 model to generate the video feature data, the image features can be 2048-dimensional video image features. When the video classification device extracts the image features of the cover map, the video classification device may adjust the cover map to a preset size, and then extract the image features of the cover map by using a pre-trained ResNet-50 model. It will be appreciated that the video classification apparatus may extract image features of the cover sheet using a pre-trained ResNet-50 model to generate the feature data. Of course, in practical application, various models can be used for extracting image features.

203. The video classification device generates n-element mold characteristics according to the text data to be recognized, and inputs the n-element mold characteristics into a width network model to obtain a first classification result.

The specific way of generating the n-element mold feature according to the text data to be recognized by the video classification device can be as follows: the video classification device performs word segmentation on the text data to be recognized to obtain continuous word strings; carrying out statistical extraction on the continuous string with the preset length of n to obtain the n-element mold characteristics; finally, the video classification device inputs the n-element mold characteristics into a width network so as to obtain the first classification result.

It is understood that the video classification device performs word segmentation on the text data to be recognized according to semantics to segment the text data to be recognized into continuous word strings. For example, the title of a certain video is: "do the following four steps, it is not a dream to finish the computer maintenance half an hour", the consecutive word string that the word segmentation gets is "do the following four steps, it is not a dream to finish the computer maintenance half an hour". The continuous string with the preset length of n is n continuous word strings as the length. For example, when n is taken to be 3, the (complete-computer-repair) is a specific string of consecutive words. The n-gram feature also includes the word frequency of the string of consecutive words in all videos. When the n is 3, the n-gram feature is called a 3-gram, one example of the 3-gram is (done-computer-repair, 6), where 6 is used to indicate the frequency of occurrence of the (done-computer-repair) in all videos. Meanwhile, the video classification device can also set a preset threshold value, the preset threshold value is used for indicating a word frequency threshold value of the characteristic, and when the word frequency of the characteristic is lower than the preset threshold value, the video classification device deletes the characteristic.

204. The video classification device converts the text data to be recognized into an input sample of a deep network, and inputs the input sample into the deep network to obtain a second classification result.

Because the length of the header data is limited, the number of contained words is small, and the semantic information is limited, the video classification device needs to expand the semantic information of the text data to be recognized by using a word embedding method so as to generate an input sample; then the video classification device inputs the input sample into the deep network; then the video classification device sequentially passes the input sample through a convolutional layer (for feature extraction) and a pooling layer (for feature compression) to obtain a semantic expression vector; the video classification device activates a preset part in the semantic expression vector through an attention mechanism so as to obtain a text expression vector; the video classification device performs softmax classification on the text representation vector to obtain the second classification result. Specifically, the deep network may include two sub-networks, one of which is shown in fig. 3, the video classification apparatus obtains an input sample by passing the text data to be recognized through an input layer (i.e., a word embedding layer) in the sub-network, and then the input sample is convolved (feature extraction) by passing a convolution layer in the sub-network, and then a corresponding semantic expression vector is obtained by learning through a pooling layer (feature compression); activating a specific part in the semantic expression vector by using an attention mechanism in an attention module, and acquiring a final expression vector of the input sample; and finally, classifying by the softmax to obtain the second classification result.

205. And the video classification device inputs the feature data into the deep network to obtain a third classification result.

The video classification device inputs video characteristic data extracted from the cover map into the depth network, and performs softmax classification after sequentially passing through m feedforward neural networks to obtain the third classification result, wherein m is greater than or equal to 2. Specifically, the deep network may include two sub-networks, one of which is shown in fig. 4, and the video classification apparatus sequentially passes the video feature data through at least two feed-forward neural networks in the sub-network, and finally performs softmax classification to obtain the third classification result.

It should be understood that, in the embodiment of the present application, a time sequence is not limited between the step 203, the step 204, and the step 205, and the three steps may occur simultaneously, or may occur in any time sequence, which is not limited herein.

206. The video classification device determines a target classification result according to the first classification result, the second classification result and the third classification result.

The video classification device comprehensively considers the first classification result, the second classification result and the third classification result, and finally determines a target classification result.

The above describes that the present embodiment is a video classification method, and the following describes a video classification apparatus in the present embodiment.

Specifically, referring to fig. 5, an embodiment of a video classification apparatus in the embodiment of the present application includes:

an obtaining module 501, configured to obtain title data and a cover page map of a video to be classified;

the processing module 502 is configured to segment the header data to generate a feature word, and input the feature word into a width network model to obtain a first classification result; converting the header data into input samples for a deep network; inputting the input sample into the deep network to obtain a second classification result; performing feature extraction on the cover map to obtain feature data; inputting the feature data into the deep network to obtain a third classification result; and determining a target classification result according to the first classification result, the second classification result and the third classification result.

Optionally, the processing module 502 is specifically configured to sequentially perform feature extraction and feature compression on the input sample to obtain a semantic expression vector; activating a preset part in the semantic representation vector through an attention mechanism so as to obtain a text representation vector; and performing softmax classification on the text representation vector to obtain the second classification result.

Optionally, the processing module 502 is configured to perform softmax classification on the feature data after sequentially passing through at least two feedforward neural networks to obtain the third classification result.

Optionally, the processing module 502 is specifically configured to remove non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and segmenting the text data to be recognized to generate n-element template features, wherein the n-element template features are used as the feature words.

Optionally, the processing module 502 is specifically configured to remove non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and converting the text data to be recognized into input samples of a deep network.

Optionally, the processing module 502 is specifically configured to perform word segmentation on the text data to be recognized to obtain a continuous word string; and carrying out statistical extraction on the continuous string with the preset length of n to obtain the n-element mold characteristics.

Optionally, the processing module 502 is specifically configured to convert the text data to be recognized into the input sample of the deep network by using a word embedding method.

Optionally, the processing module 502 is specifically configured to adjust the size of the cover map to a preset value to obtain a cover map with a preset size; and extracting the image characteristics of the cover picture with the preset size to generate the characteristic data.

Referring to fig. 6, another embodiment of a video classification apparatus according to an embodiment of the present application includes:

a transceiver 601, a processor 602, a bus 603;

the transceiver 601 and the processor 602 are connected via the bus 603;

the bus 603 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The processor 602 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 602 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Referring to fig. 6, the video classification apparatus may further include a memory 604. The memory 604 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory 604 may also comprise a combination of the above types of memory.

Optionally, the memory 604 may also be used to store program instructions, and the processor 602 calls the program instructions stored in the memory 604, and may perform one or more steps in the above embodiments, or in alternative embodiments, implement the functions of the video classification apparatus in the above methods.

The processor 602 performs the following steps:

acquiring title data and a cover picture of a video to be classified; segmenting the header data to generate feature words, and inputting the feature words into a width network model to obtain a first classification result; converting the header data into input samples for a deep network; inputting the input sample into the deep network to obtain a second classification result; performing feature extraction on the cover map to obtain feature data; inputting the feature data into the deep network to obtain a third classification result; and determining a target classification result according to the first classification result, the second classification result and the third classification result.

Optionally, the processor 602 specifically executes the following steps: sequentially performing feature extraction and feature compression on the input sample to obtain a semantic expression vector; activating a preset part in the semantic representation vector through an attention mechanism so as to obtain a text representation vector; and performing softmax classification on the text representation vector to obtain the second classification result.

Optionally, the processor 602 specifically executes the following steps: and sequentially passing the characteristic data through at least two feedforward neural networks and then carrying out softmax classification to obtain the third classification result.

Optionally, the processor 602 specifically executes the following steps: removing non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and segmenting the text data to be recognized to generate n-element template features, wherein the n-element template features are used as the feature words.

Optionally, the processor 602 specifically executes the following steps: removing non-text data in the header data to obtain text data; unifying the format of the text data, and generating the text data to be recognized according to a preset length; and converting the text data to be recognized into input samples of a deep network.

Optionally, the processor 602 specifically executes the following steps: performing word segmentation on the text data to be recognized to obtain continuous word strings; and carrying out statistical extraction on the continuous string with the preset length of n to obtain the n-element mold characteristics.

Optionally, the processor 602 specifically executes the following steps: and converting the text data to be recognized into the input sample of the deep network by adopting a word embedding method.

Optionally, the processor 602 specifically executes the following steps: adjusting the size of the cover picture to a preset value to obtain a cover picture with a preset size; and extracting the image characteristics of the cover picture with the preset size to generate the characteristic data.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of video classification, comprising:

acquiring title data and a cover picture of a video to be classified;

segmenting the header data to generate feature words, and inputting the feature words into a width network model to obtain a first classification result;

converting the header data into input samples for a deep network;

inputting the input sample into the deep network to obtain a second classification result;

performing feature extraction on the cover map to obtain feature data;

inputting the feature data into the deep network to obtain a third classification result;

and determining a target classification result according to the first classification result, the second classification result and the third classification result.

2. The method of claim 1, wherein inputting the input samples into the deep network to obtain a second classification result comprises:

sequentially performing feature extraction and feature compression on the input sample to obtain a semantic expression vector;

activating a preset part in the semantic representation vector through an attention mechanism so as to obtain a text representation vector;

and performing softmax classification on the text representation vector to obtain the second classification result.

3. The method of claim 1, wherein entering the feature data into the deep network to obtain a third classification result comprises:

and sequentially passing the characteristic data through at least two feedforward neural networks and then carrying out softmax classification to obtain the third classification result.

4. The method according to any one of claims 1 to 3, wherein the segmenting the header data to generate feature words comprises:

removing non-text data in the header data to obtain text data;

unifying the format of the text data, and generating text data to be recognized according to a preset length; segmenting the text data to be recognized to generate n-element template features, wherein the n-element template features serve as the feature words;

the converting the header data into input samples for a deep network comprises:

removing non-text data in the header data to obtain text data;

unifying the format of the text data, and generating the text data to be recognized according to a preset length; and converting the text data to be recognized into input samples of a deep network.

5. The method of claim 4, wherein the segmenting the text data to be recognized to generate n-gram cast features comprises:

performing word segmentation on the text data to be recognized to obtain continuous word strings;

and carrying out statistical extraction on the continuous string with the preset length of n to obtain the n-element mold characteristics.

6. The method of claim 4, wherein converting the text data to be recognized into input samples for a deep network comprises:

and converting the text data to be recognized into the input sample of the deep network by adopting a word embedding method.

7. A video classification apparatus, comprising:

8. A video classification apparatus, comprising: a processor and a memory, wherein the memory has a computer readable program stored therein, and the processor is configured to execute the program in the memory to perform the method of any of claims 1 to 6.

9. A computer-readable storage medium having stored thereon computer instructions for performing the method of any one of claims 1-6.