CN114648713A

CN114648713A - Video classification method and device, electronic equipment and computer-readable storage medium

Info

Publication number: CN114648713A
Application number: CN202011509800.8A
Authority: CN
Inventors: 毛永波; 孙文胜; 韦晓全
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-21

Abstract

The embodiment of the disclosure discloses a video classification method, a video classification device, electronic equipment and a computer-readable storage medium. The video classification method comprises the following steps: acquiring a plurality of video frames of a video to be classified; classifying the plurality of video frames to obtain a first category of the plurality of video frames, wherein the first category comprises an object external video frame and an object internal video frame; identifying the object according to the object external video frame to obtain a first classification vector of the object external video frame; identifying the object according to the object internal video frame to obtain a first classification vector of the object internal video frame; and determining a classification result of the video to be classified according to the first classification vector. The method solves the technical problem of low recall rate of video classification by combining the object external video frame and the object internal video frame.

Description

Video classification method and device, electronic equipment and computer-readable storage medium

Technical Field

The present disclosure relates to the field of video classification, and in particular, to a video classification method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In recent years, with the rapid development of the mobile internet, the short video industry rises rapidly, and the advantages of high propagation speed, low manufacturing threshold and strong social attribute are favored by a large number of users and creators. In order to recommend related content for a user more accurately, each video needs to be labeled in a category, for example, in a car video, a train described by the video needs to be labeled. For video train classification in a user created content (UGC) scene, the conventional technical scheme is to extract a video frame, extract a target area with the largest area ratio in a picture by using a detection model, and return to a vehicle train type of the target area.

Aiming at the problem of identifying a multi-label video train, the prior art scheme mainly has the following defects: 1. in the automobile video sample, a large amount of introduction contents to the automobile interior are included, and the existing scheme only identifies the automobile interior through the appearance of the automobile, so that the overall recall rate is not high; 2. the situation that a plurality of vehicles exist in a single-frame video sometimes, the existing scheme only returns a target result with the largest area, and misjudgment possibly exists; 3. the final result is only related to a single frame, and if the single frame is identified incorrectly, the accuracy is greatly reduced.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In order to solve the above technical problem, the embodiments of the present disclosure propose the following technical solutions.

In a first aspect, an embodiment of the present disclosure provides a video classification method, including:

acquiring a plurality of video frames of a video to be classified;

classifying the plurality of video frames to obtain a first category of the plurality of video frames, wherein the first category comprises an object external video frame and an object internal video frame;

identifying the object according to the object external video frame to obtain a first classification vector of the object external video frame;

identifying the object according to the object internal video frame to obtain a first classification vector of the object internal video frame;

and determining a classification result of the video to be classified according to the first classification vector.

In a second aspect, an embodiment of the present disclosure provides a video classification apparatus, including:

the video frame acquisition module is used for acquiring a plurality of video frames of the video to be classified;

a first classification module, configured to classify the plurality of video frames to obtain a first class of the plurality of video frames, where the first class includes an object external video frame and an object internal video frame;

the external first classification vector acquisition module is used for identifying the object according to the external video frame of the object to obtain a first classification vector of the external video frame of the object;

the internal first classification vector acquisition module is used for identifying the object according to the object internal video frame to obtain a first classification vector of the object internal video frame;

and the classification result determining module is used for determining the classification result of the video to be classified according to the first classification vector.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding first aspects.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to perform the method of any one of the foregoing first aspects.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flowchart of a video classification method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a video classification method provided in an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of a video classification method according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart of a video classification method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a video classification method according to an embodiment of the disclosure

Fig. 6 is a schematic flow chart of a video classification method according to an embodiment of the present disclosure;

fig. 7 is a schematic flow chart of a video classification method according to an embodiment of the present disclosure;

fig. 8 is a schematic view of an application scenario of a video classification method according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an embodiment of a video classification apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart of an embodiment of a video classification method provided in this disclosure, where the video classification method provided in this embodiment may be executed by a video classification apparatus, and the video classification apparatus may be implemented as software, or implemented as a combination of software and hardware, and the video classification apparatus may be integrated in some device in a video classification system, such as a video classification server or a video classification terminal device. As shown in fig. 1, the method comprises the steps of:

step S101, acquiring a plurality of video frames of a video to be classified;

the video to be classified can be any type of video. And if the classification is the type of the object to be classified in the video to be classified, if the classification is the type of the automobile, the classification indicates that the automobile system of the automobile is included in the video to be classified, or the content of the video to be classified is related to the automobile system.

It can be understood that the video to be classified may be read from a preset location, or received through a preset interface, for example, the video to be classified is read from a preset storage location or a network location, or the video to be classified uploaded by a user is received through a human-computer interaction interface, and so on, which is not described herein again.

Optionally, the obtaining of the plurality of video frames of the video to be classified includes: and performing frame extraction on the video to be classified according to the frame extraction frequency to obtain the plurality of video frames. Exemplarily, the video to be classified is decimated according to a frequency of 2fps (frame per second), so as to obtain a video frame sequence I ═ I₁,I₂,……，I_nWhere n denotes the number of video frames.

Optionally, in order to facilitate the subsequent identification of the object in the video frame, preprocessing may be further included in this step. Illustratively, in this step, the video frames are normalized to have a length and a width of M and N, respectively, for facilitating the subsequent processing. It is understood that the preprocessing may include any preprocessing method, which is not described herein.

Returning to fig. 1, the video classification method further includes, in step S102, classifying the plurality of video frames to obtain a first category of the plurality of video frames, where the first category includes an object external video frame and an object internal video frame;

wherein the first category is to classify the video frames into object external video frames including external features of the object and object internal video frames including internal features of the object. For example, if the object is an automobile, in this step, the extracted video frames are divided into an automobile appearance video frame and an automobile interior video frame, wherein the automobile interior video frame may further include an automobile center control video frame because the automobile center control in the automobile interior may more accurately reflect the automobile system of an automobile.

In order to prevent false recognition caused by too large proportion of non-target objects when multiple objects appear in the video frame, optionally, as shown in fig. 2, the step S102 further includes:

step S201, carrying out target detection on the video frame to obtain at least one target detection frame;

step S202, calculating the comprehensive confidence of the target frame according to the confidence of the target detection frame, the distance between the target detection frame and the center point of the video frame and the proportion of the area of the target detection frame in the video frame;

step S203, the category corresponding to the target detection box with the maximum integrated confidence is used as the first category of the video frame.

Wherein, the step S201 may be performed by a pre-trained target detection model. The target detection model is used for detecting two types of video frames, namely an object exterior and an object interior. In particularIf the detection result of a certain video frame is null, it indicates that the frame does not include an object, and therefore the frame is discarded. When the detection result of a certain frame is not null, the target detection model outputs at least one target detection frame, and exemplarily, the output of the target detection model is represented as: b ═ B₁,B₂,......,B_mWhere m denotes the number of object detection frames obtained in the frame. The kth target detection box is defined as B_k＝[x_k,y_k,w_k,h_k,c_k,s_k]Wherein x is_kAnd y_kRespectively representing the abscissa and ordinate, w, of the upper left corner of the target detection box_kAnd h_kRespectively represent the width and height of the target detection frame, c_kRepresents a first class, s_kRepresenting a confidence level of the target detection box.

Further, in the step S202, a comprehensive confidence is calculated using the following formula (1):

S_k＝s_k*d_k*a_k (1)

wherein d is_k∈(0,1]Representing the position score of the target detection frame, wherein the score is higher the closer the central point of the target detection frame is to the central point of the video frame; illustratively, said d_kCalculated according to the following equation (2):

d_k＝max(|x_k+w_k/2-N/2|,|y_k+h_k/2-M/2|) (2)

wherein, a_k∈(0,1]Represents the area ratio of the target detection frame in the video frame, and exemplarily, the a_kCalculated according to the following equation (3):

a_k＝w_k*h_k/(M*N) (3)

in step S202, a comprehensive confidence of each target detection box output by the model is calculated to obtain at least one comprehensive confidence.

In step S203, the at least one integrated confidence level is sorted according to size, and a category corresponding to the target detection frame with the highest integrated confidence level is used as the first category of the video frame.

In the above steps S201 to S203, by adding the position score and the area ratio during the target detection, the weight of the target detection frame having a larger area near the middle position of the video frame is made larger, so that when a plurality of target detection frames are detected, the video frame can be classified more accurately.

Returning to fig. 1, the video classification method further includes, in step S103, identifying the object according to the object external video frame to obtain a first classification vector of the object external video frame; and step S104, identifying the object according to the object internal video frame to obtain a first classification vector of the object internal video frame.

Optionally, in the step S103, the video frame classified as the object external video frame in the step S102 is input into a pre-trained object external classification model, where the object external classification model is used to classify the video frame through the object external video frame, for example, the appearance video frame is classified through the appearance video frame of the automobile to obtain the train system corresponding to the video frame.

Optionally, in the step S104, the video frames classified into the object internal video frames in the step S102 are input into a pre-trained object internal classification model, where the object internal classification model is used to classify the video frames through the object internal video frames, for example, a vehicle system corresponding to the video frames is obtained by classifying the central control video frames through a central control video frame of an automobile.

It is understood that, for faster processing speed, the video frame input into the object external classification model or the object internal classification model may be the image within the object detection frame obtained in step S102.

Each first element in the first classification vector corresponds to a second category, and the value of the first element represents the confidence level that the video frame is in the second category corresponding to the first element. Illustratively, the first classification vector is a normalized one-dimensional vector output by the object-external classification model or the object-internal classification model, and is represented by V_iTo representA first classification vector for the ith video frame, then:

V_i＝{V_i1,V_i2,......,V_ic}

wherein V_ijFor the first element in the first classification vector, j ∈ [1, c ∈ [ ]]Where c represents the number of second classes, i.e. V_ijRepresenting the confidence that the ith frame belongs to the jth class; wherein, V_i1+V_i2+......+V_ic＝1。

Through the steps S103 and S104, the object external video frame and the object internal video frame are respectively identified and classified to obtain the first classification vector, and the identification of the object internal video frame is added on the basis of the object external video frame, so that the overall recall rate of the classification of the video is improved.

Returning to fig. 1, the video classification method further includes, in step S105, determining a classification result of the video to be classified according to the first classification vector.

In step S103 and step S104, classifying and identifying the video frame of the video to be classified to obtain at least one first classification vector, and in general, obtaining a plurality of first classification vectors, and in step S105, synthesizing the plurality of first classification vectors to determine a classification result of the video to be classified, that is, a second class of the video to be classified.

Optionally, as shown in fig. 3, the step S105 further includes:

step S301, classifying the video frame corresponding to the first classification vector into at least one second class according to the first classification vector; wherein each second category corresponds to a set of video frames;

step S302, calculating the confidence coefficient of the video to be classified as the second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set;

step S303, determining the classification result of the video to be classified according to the confidence of the second category.

For each of the object external video frame and the object internal video frame obtained in step S102, a corresponding first classification vector may be obtained in step S103 or step S104, and the video frame may be classified into at least one second category by the classification vector, for example, the second category of the video frame is determined by the value of the largest first element in the first classification vector. In order to make the classification more accurate, a clustering threshold may be preset, and the video frames are classified according to the clustering threshold, optionally, step S301 further includes:

step S401, acquiring a clustering threshold;

step S402, comparing the value of each first element in the first classification vector with the size of the clustering threshold;

step S403, in response to that the value of the first element is greater than the clustering threshold, classifying the video frame corresponding to the first classification vector into a second category corresponding to the first element.

Let the clustering threshold be th₁And th₁∈[0,1]The above steps S402 and S403 can be realized by the following formula (4):

wherein, C_ijFor indicating whether the ith frame belongs to the jth class. The video frames obtained in step S103 or step S104 are classified into one or more second categories through step S402 and step S403, and it is understood that in the above steps, a certain frame may not belong to any second category, and in this case, the video frame may be discarded.

Through the steps, the video frame corresponding to each of the c second categories can be obtained, that is, each second category corresponds to a video frame set, and the video frame set includes the object external video frame and/or the object internal video frame.

Returning to fig. 3, in step S302, a comprehensive first classification vector of each second category is first calculated, and optionally, the comprehensive first classification vector is calculated by summing the first classification vectors of each video frame in the video frame set corresponding to the second category and then normalizing the summed first classification vectors. Taking the second class as class i as an example, the comprehensive first class vector is calculated by the following formula (5):

where A denotes a set of video frames belonging to class l, V_qA first classification vector, f (), representing the qth frame Softmax (); from this, a comprehensive first classification vector Vl _ result can be obtained, where the value Vl _ result of the l-th element in Vl _ result_lNamely the confidence of the second category of the video to be classified as the l category.

For each second category, the confidence level of the second category may be calculated, and then in step S303, the second category corresponding to the maximum confidence level of the second category may be determined as the classification result of the video to be classified.

In order to reduce the amount of calculation, before step S302, the method further includes: filtering the second category. Optionally, the filtering includes filtering out a second category with few video frames. Optionally, a quantity threshold parameter th is set₂Therein th₂∈[0,1]From this, the quantity threshold can be derived: n th₂And filtering out second categories corresponding to the video frame sets with the number of the video frames smaller than the number threshold, wherein the remaining second categories participate in the subsequent steps of calculating the comprehensive first classification vector and determining the confidence of the second categories. Thereby, a clearly incorrect second class can be filtered out to reduce the amount of computation in determining the confidence of the second class.

In step S303, since each second class participating in the calculation of the integrated first classification vector will obtain a corresponding Vl _ result_lTherefore, in this step, all Vl _ results can be compared_lWill Vl _ result_lThe second category with the largest value is used as the classification result of the video to be classified.

Alternatively, e.g. in accordance withVl_result_lWhen the second category with the largest value is used as the classification result of the video to be classified, all possibly the Vl _ results_lIs not large, and the classification result is inaccurate. Therefore, for the accuracy of the final classification result, the step S303 further includes:

obtaining a classification threshold value;

comparing the confidence level of the second class to the classification threshold;

and determining the second category with the confidence coefficient larger than the classification threshold value as the classification result of the video to be classified.

The classification threshold is a preset threshold th₃When Vl _ result_l>th₃Then, the classification result of the video to be classified is determined to be class I, and in order to prevent the video to be classified from being classified into a plurality of second classes th₃Can be set to a larger value, e.g. th₃＝0.8。

Through the steps S101-S105, when videos are classified, besides the external features of the object, the internal features of the object are added, so that the classification of the videos including the inside and the outside of the object is more accurate, the technical problem of low recall rate caused by only classifying the videos through the video frames outside the object is solved, and the problem of low accuracy rate caused by single-frame misrecognition is also solved; in addition, when the target identification is carried out on the single frame, the position and area information of the target detection frame is combined, so that the misjudgment is reduced.

Further, in some scenes, objects in some videos are obviously identified, which can accurately classify the videos, for example, in the case of train identification, the brand of an automobile can be identified by a car logo or a specific model in the brand of the automobile can be identified by the identification of the tail of the automobile. Thus, when an object identifier is identified in a frame, the weight of the frame can be increased when the integrated first classification vector is calculated. Therefore, optionally, as shown in fig. 5, the video classification method further includes:

step S501, identifying the identification of the object according to the external video frame of the object to obtain the identification confidence of the identification of the object in the external video frame of the object;

step S502, calculating a weight value of the external video frame of the object according to the identification confidence;

at this time, the step S302 further includes:

step S503, according to the weight value of the object external video frame, carrying out weighted calculation on a first classification vector of the object external video frame in a video frame set to obtain a weighted first classification vector;

step S504, calculating a confidence of the video to be classified as a second category corresponding to the video frame set according to the weighted first classification vector.

In the step S501, the external video frame of the object may be input into a detection model of an object identifier trained in advance to identify whether the external video frame of the object includes the identifier of the object, and an output result of the detection model of the object identifier is similar to an output of the model in the step S102, that is, one or more object detection boxes are output, and each object detection box is represented by a vertical and horizontal coordinate of an upper left corner, a length and width value, a category, and a confidence of the object detection box. Therefore, the identification confidence degrees of the target detection frames can be compared to obtain the maximum identification confidence degree.

Optionally, as shown in fig. 6, in step S502, calculating a weight value of the video frame outside the object according to the identification confidence includes:

step S601, acquiring a weight threshold;

step S602, when the identification confidence is greater than or equal to the weight threshold, calculating a weight value of the external video frame of the object according to the weight threshold and the identification confidence;

step S603, when the identification confidence is smaller than the weight threshold, setting a preset weight value as the weight value of the external video frame of the object; wherein the preset weight value is less than or equal to the first weight value.

Illustratively, the weight threshold th is preset₄Setting the identification confidence of the external video frame of the object obtained in the step S502 as S _ logo_qWhich represents the confidence of the identification of the existing object at the qth frame. The weight value of the video frame outside the object can be calculated by the following equation (6):

in step S503, a weighted first classification vector is obtained by performing a weighted calculation on the first classification vector of the external video frame of the object in the video frame set. As described in the above embodiment, the video frame q in the video frame set a with category i weights the first classification vector as: w is a_q*V_q。

In step S504, a comprehensive weighted first classification vector is first obtained according to the weighted first classification vectors of all the video frames in the set a, and the comprehensive weighted first classification vector is obtained by, for example, the following formula (7):

the definition of the parameters is the same as that in formula (5), and is not described herein again. Thus, the confidence of the second of the final l classes is: vl _ result_l. Then, the process of determining the classification result of the video to be classified refers to the description in step S105, which is not repeated herein.

In the above further embodiment, the video frames containing object identifiers are weighted in calculating the integrated first classification vector such that their weights become larger and the frame with greater confidence in the identification has a weight w_qThe larger the value of (A), the larger the proportion of the first classification vector in the calculation of the integrated weight, thereby resulting in the final classification vectorThe robustness of the classification result is enhanced, and the classification is more accurate.

In the above embodiments, the classification process is related to the features of the video frames; for the video uploaded by the user, the video usually comprises a title, and the title often comprises characters or keywords closely related to the content, so that the classification result can be strengthened through the title of the video to be classified. As shown in fig. 7, further, the video classification method further includes:

step S701, acquiring the title of the video to be classified;

step S702, calculating a first coefficient according to the title, wherein the value of the first coefficient is related to the times of the second category of names appearing in the title;

at this time, the step S302 further includes:

step S703, calculating a first confidence of the video to be classified as a second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set;

step S704, calculating a confidence level of the video to be classified as the second category corresponding to the video frame set according to the first coefficient and the first confidence level of the second category.

In step S702, performing operations such as word segmentation and keyword extraction on the titles of the videos to be classified to obtain characters or keywords matching the names of the second category, where the number of the characters or keywords may be any number. And calculating a first coefficient of the second category corresponding to the video frame set according to the number of the characters or the keywords matched with the name of the second category. Optionally, the step S702 further includes: matching the title of the video to be classified with the second category to obtain the times of the name of each second category in the title; and calculating a first coefficient of each second category according to the times, wherein the first coefficient is in inverse relation with the times. Illustratively, the first coefficient is calculated according to the following equation (8):

γ＝2/(e^t+1) (8)

wherein t represents the number of times a character or keyword matching the name of the second category appears in the title. Wherein t is not less than 0 and t is an integer. Where the larger the value of t, the smaller the value of γ.

In step S703, a first confidence of the second category is calculated, where the first confidence of the second category may be the confidence of the second category calculated in step S302 or the confidence of the second category calculated in step S504. That is, the first confidence of the second category in the step S703 includes the Vl _ result calculated by the formula (5) or the formula (7)_l。

In step S704, a confidence level of the video to be classified as the second category corresponding to the video frame set is calculated according to the first coefficient and the first confidence level of the second category. Optionally, the confidence of the second class is calculated by using the index of the first confidence of the second class of the first coefficient. Illustratively, according to: s _ result_l＝V_result_l ^γCalculating a confidence level for the second class of class i. Then, the process of determining the classification result of the video to be classified refers to the description in step S105, which is not repeated herein. The value γ of the first coefficient becomes smaller as t becomes larger due to Vl _ result_lIs a value less than 1, so V _ result_l ^γThe value of (d) becomes larger as γ becomes smaller, that is, as t becomes larger. That is, the greater the number of times a character or keyword matching the second category appears in the title, the higher the confidence of the corresponding second category.

In the above optional embodiment, the classification result is influenced by the header, so that the classification result can be strengthened by the information contained in the header, the robustness of the final classification result is strengthened, and the classification is more accurate.

Fig. 8 is an application scenario of the video classification method in the above embodiment. As shown in fig. 8, the application scene is a classification of the car systems of the videos, which classifies the videos by car appearances and car center control video frames in the videos. As shown in fig. 8, first, video information including a video and a title of the video is acquired; performing frame extraction and preprocessing on the video, performing word segmentation on the title, counting the times of the name of each train appearing in the title, and calculating a first coefficient; classifying the preprocessed video frames into appearance frames and central control frames, inputting the appearance frames into an appearance recognition model to perform feature extraction and classification to obtain first classification vectors of the appearance frames, and inputting the central control frames into a central control recognition model to perform feature extraction and classification to obtain first classification vectors of the central control frames; meanwhile, the appearance frame is input into a car logo detection module for car logo detection, and the car logo confidence coefficient of the car logo in the appearance frame is output; and when the video classification result is calculated, calculating by taking the car logo confidence degree, the first classification vector and the first coefficient as calculation factors to obtain a final classification result. In the application scene, the appearance characteristics of the automobile and the central control characteristics of the automobile in the video are combined, the classification result is enhanced through the automobile logo characteristics and the characteristics in the title, and the recall rate and the accuracy rate of video classification are increased.

The embodiment of the disclosure discloses a video classification method, which comprises the following steps: acquiring a plurality of video frames of a video to be classified; classifying the plurality of video frames to obtain a first category of the plurality of video frames, wherein the first category comprises an object external video frame and an object internal video frame; identifying the object according to the object external video frame to obtain a first classification vector of the object external video frame; identifying the object according to the object internal video frame to obtain a first classification vector of the object internal video frame; and determining a classification result of the video to be classified according to the first classification vector. The method solves the technical problem of low recall rate of video classification by combining the object external video frame and the object internal video frame.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and they may also be performed in other sequences such as reverse, parallel, and cross, and other sequences may also be added on the basis of the above steps, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

Fig. 9 is a schematic structural diagram of an embodiment of a video classification apparatus according to an embodiment of the present disclosure. As shown in fig. 9, the apparatus 900 includes: a video frame obtaining module 901, a first classification module 902, an outer first classification vector obtaining module 903, an inner first classification vector obtaining module 904, and a classification result determining module 905.

Wherein, the first and the second end of the pipe are connected with each other,

a video frame acquiring module 901, configured to acquire a plurality of video frames of a video to be classified;

a first classification module 902, configured to classify the plurality of video frames to obtain a first category of the plurality of video frames, where the first category includes an object external video frame and an object internal video frame;

an external first classification vector obtaining module 903, configured to identify the object according to the object external video frame to obtain a first classification vector of the object external video frame;

an internal first classification vector obtaining module 904, configured to identify the object according to the object internal video frame to obtain a first classification vector of the object internal video frame;

a classification result determining module 905, configured to determine a classification result of the video to be classified according to the first classification vector.

Further, the first classification module 902 is further configured to:

performing target detection on the video frame to obtain at least one target detection frame;

calculating the comprehensive confidence of the target frame according to the confidence of the target detection frame, the distance between the target detection frame and the center point of the video frame and the occupation ratio of the area of the target detection frame in the video frame;

and taking the category corresponding to the target detection box with the maximum comprehensive confidence as the first category of the video frame.

Furthermore, each first element in the first classification vector corresponds to a second class, and a value of the first element represents a confidence level that the video frame is of the second class corresponding to the first element.

Further, the classification result determining module 905 is further configured to:

classifying the video frame corresponding to the first classification vector into at least one second class according to the first classification vector; wherein each second category corresponds to a set of video frames;

calculating the confidence coefficient of the video to be classified into a second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set;

and determining the classification result of the video to be classified according to the confidence coefficient of the second category.

acquiring a clustering threshold value;

comparing the value of each first element in the first classification vector to the magnitude of the clustering threshold;

in response to the value of the first element being greater than the clustering threshold, classifying the video frame corresponding to the first classification vector into a second category corresponding to the first element.

Further, the video classification apparatus further includes:

the identification confidence determining module is used for identifying the identification of the object according to the external video frame of the object to obtain the identification confidence of the identification of the object in the external video frame of the object;

the weight calculation module is used for calculating the weight value of the external video frame of the object according to the identification confidence;

wherein, the classification result determining module 905 is further configured to:

according to the weight value of the object external video frame, carrying out weighted calculation on a first classification vector of the object external video frame in a video frame set to obtain a weighted first classification vector;

and calculating the confidence coefficient of the video to be classified as the second category corresponding to the video frame set according to the weighted first classification vector.

Further, the weight calculation module is further configured to:

acquiring a weight threshold;

when the identification confidence is greater than or equal to the weight threshold, calculating a weight value of the external video frame of the object according to the weight threshold and the identification confidence;

when the identification confidence is smaller than the weight threshold, setting a preset weight value as the weight value of the external video frame of the object; wherein the preset weight value is less than or equal to the first weight value.

Further, the video classification apparatus further includes:

the title acquisition module is used for acquiring the title of the video to be classified;

a first coefficient calculation module for calculating a first coefficient according to the title, a value of the first coefficient being related to a number of times the name of the second category appears in the title;

calculating a first confidence coefficient of the video to be classified into a second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set;

and calculating the confidence coefficient of the video to be classified into the second category corresponding to the video frame set according to the first coefficient and the first confidence coefficient of the second category.

Further, the first coefficient calculation module is further configured to:

matching the title of the video to be classified with the second category to obtain the times of the name of each second category in the title;

and calculating a first coefficient of each second category according to the times, wherein the first coefficient is in inverse relation with the times.

obtaining a classification threshold value;

The apparatus shown in fig. 9 can perform the method of the embodiment shown in fig. 1-8, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 1-8. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 8, and are not described herein again.

Referring now to FIG. 10, a block diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the above-described video classification method is performed.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a video classification method including:

acquiring a plurality of video frames of a video to be classified;

Further, the classifying the plurality of video frames to obtain the first category of the plurality of video frames includes:

Furthermore, each first element in the first classification vector corresponds to a second class, and a value of the first element represents a confidence that the video frame is of the second class to which the first element corresponds.

Further, the determining a classification result of the video to be classified according to the first classification vector includes:

calculating the confidence coefficient of the video to be classified as a second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set;

Further, the classifying the video frame corresponding to the first classification vector into at least one second category according to the first classification vector includes:

acquiring a clustering threshold value;

Further, the method further comprises:

identifying the identifier of the object according to the external video frame of the object to obtain an identifier confidence coefficient of the identifier of the object in the external video frame of the object;

calculating a weight value of the external video frame of the object according to the identification confidence;

the calculating the confidence of the video to be classified into the second category corresponding to the video frame set according to the first classification vector of each video frame in the video frame set includes:

Further, the calculating a weight value of the video frame outside the object according to the identification confidence includes:

acquiring a weight threshold;

Further, the method further comprises:

acquiring the title of the video to be classified;

calculating a first coefficient from the title, the value of the first coefficient being related to the number of times the name of the second category appears in the title;

Further, the calculating the first coefficient according to the title includes:

Further, the determining the classification result of the video to be classified according to the confidence of the second category includes:

obtaining a classification threshold value;

According to one or more embodiments of the present disclosure, there is provided a video classification apparatus including:

Further, the first classification module is further configured to:

calculating the comprehensive confidence coefficient of the target frame according to the confidence coefficient of the target detection frame, the distance between the target detection frame and the center point of the video frame and the proportion of the area of the target detection frame in the video frame;

Further, the classification result determining module is further configured to:

acquiring a clustering threshold value;

Further, the video classification apparatus further includes:

wherein the classification result determining module is further configured to:

Further, the weight calculation module is further configured to:

acquiring a weight threshold;

Further, the video classification apparatus further includes:

wherein the classification result determining module is further configured to:

Further, the first coefficient calculation module is further configured to:

Further, the classification result determining module is further configured to:

obtaining a classification threshold value;

According to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of video classification of any of the preceding first aspects.

According to one or more embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions for causing a computer to perform the video classification method of any of the foregoing first aspects.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for video classification, comprising:

acquiring a plurality of video frames of a video to be classified;

identifying the object according to the video frame inside the object to obtain a first classification vector of the video frame inside the object;

2. The video classification method of claim 1, wherein said classifying the plurality of video frames into a first class of the plurality of video frames comprises:

3. The method for video classification according to claim 1, wherein each first element in said first classification vector corresponds to a second class, and a value of said first element represents a confidence level that said video frame is in said second class to which said first element corresponds.

4. The video classification method according to claim 3, wherein said determining a classification result of the video to be classified according to the first classification vector comprises:

5. The video classification method according to claim 4, wherein said classifying the video frame corresponding to the first classification vector into at least one second class according to the first classification vector comprises:

acquiring a clustering threshold value;

6. The video classification method of claim 4, characterized in that the method further comprises:

the calculating, according to the first classification vector of each video frame in the video frame set, the confidence that the video to be classified is of the second category corresponding to the video frame set includes:

7. The video classification method of claim 6, wherein said calculating a weight value for the video frame outside the object according to the identification confidence comprises:

acquiring a weight threshold;

8. The video classification method of claim 4, characterized in that the method further comprises:

acquiring the title of the video to be classified;

9. The video classification method according to claim 8, wherein said calculating a first coefficient from said title comprises:

10. The video classification method according to any one of claims 4 to 9, wherein said determining a classification result of the video to be classified according to the confidence of the second class comprises:

obtaining a classification threshold value;

11. A video classification apparatus, comprising:

12. An electronic device, comprising:

a memory for storing computer readable instructions; and

a processor for executing the computer readable instructions such that the processor when executed implements the method of any of claims 1-10.

13. A non-transitory computer readable storage medium storing computer readable instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-10.