CN113569610A

CN113569610A - Video content identification method and device, storage medium and electronic equipment

Info

Publication number: CN113569610A
Application number: CN202110176761.2A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-10-29

Abstract

The invention discloses a video content identification method and device, a storage medium and electronic equipment. Wherein, the method comprises the following steps: fusing multi-dimensional features extracted from video content of an object video to be identified to obtain a multi-modal video feature vector corresponding to the object video; acquiring a first-level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag; and acquiring a second grade identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, and determining the quality grade of the target content matched with the object video according to the first grade identification parameter set and the second grade identification parameter set. The invention solves the technical problem of low accuracy of video content identification.

Description

Video content identification method and device, storage medium and electronic equipment

Technical Field

The invention relates to the field of computers, in particular to a video content identification method and device, a storage medium and electronic equipment.

Background

Nowadays, more and more users upload and publish self-made short videos or small videos to a content distribution platform through personal accounts. In the distribution recommendation strategy provided in the related art, recommendation conditions are usually set according to data traffic of each client, for example, by setting an upper limit value (i.e., a limit) for limiting play traffic, the purpose of controlling the exposure of videos uploaded and published by each client is achieved.

However, the quality of the video content of the videos distributed by these users is not uniform. Some of the low-end videos with lower quality levels do not actually meet the recommendation conditions for sharing to other users through the content distribution platform. For example, for some videos with poor definition of picture content and low playing completion, the videos are not suitable for being distributed and recommended for multiple times.

At present, labels of the low-end videos in a content distribution platform can be identified only through manual work, so that videos which do not accord with recommended conditions are easily omitted, and the problem of low video content identification accuracy is caused.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a video content identification method and device, a storage medium and electronic equipment, and aims to at least solve the technical problem of low video content identification accuracy.

According to an aspect of an embodiment of the present invention, there is provided a video content identification method, including: fusing multi-dimensional features extracted from video content of an object video to be identified to obtain a multi-modal video feature vector corresponding to the object video; acquiring a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, wherein the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the first identification tag; acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag; and determining the target content quality level of the object video according to the first level identification parameter set and the second level identification parameter set.

According to another aspect of the embodiments of the present invention, there is also provided a video content recognition apparatus, including: the fusion unit is used for fusing multi-dimensional features extracted from video contents of an object video to be recognized to obtain a multi-modal video feature vector corresponding to the object video; a first obtaining unit, configured to obtain a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, where the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used to indicate a probability that the target video is classified into each content quality level according to the first identification tag; a second obtaining unit, configured to obtain a second level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, where the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second level identification parameter set is used to indicate a probability that the target video is classified into each content quality level according to the second identification tag; and a determining unit, configured to determine a target content quality level for the object video according to the first level identification parameter set and the second level identification parameter set.

According to a further aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned video content identification method when running.

According to still another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the video content identification method through the computer program.

In the embodiment of the invention, multi-dimensional features extracted from video contents of an object video to be identified are fused to obtain a multi-modal video feature vector corresponding to the object video; acquiring a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, wherein the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the first identification tag; acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag; according to the first-level identification parameter set and the second-level identification parameter set, the quality level of the target content matched with the object video is determined, multi-dimensional features extracted from the video content of the object video to be identified are fused to obtain multi-mode video feature vectors corresponding to the object video, the first-level identification parameter set and the second-level identification parameter set are obtained according to the multi-mode video feature vectors, the quality level of the target content matched with the object video is determined, and the purpose of prompting the low-end identification capability of the video is achieved, so that the coverage and the accuracy of low-end identification are improved, the manual identification marking cost is reduced, the overall video quality of a platform is improved, the impression of a user on the platform video is improved, and the technical problem of low accuracy of video content identification is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative video content recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application environment of an alternative video content recognition method according to an embodiment of the invention;

FIG. 3 is a flow diagram of an alternative video content identification method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a video low-end level recognition architecture of an alternative video content recognition method according to an embodiment of the present invention;

FIG. 5 is a structural diagram of a video-based multi-dimensional low-end content identification model of an alternative video content identification method according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of low-end identification based on multi-dimensional video content according to an alternative video content identification method in an embodiment of the present invention;

fig. 7 is a structural diagram of a low-end recognition model based on video distribution user behavior according to an alternative video content recognition method in an embodiment of the present invention;

fig. 8 is a flow chart of an alternative video content identification method for performing low-end video identification based on user behavior according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an alternative video content recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, a video content identification method is provided, and optionally, as an optional implementation manner, the video content identification method may be applied, but not limited, to an application environment as shown in fig. 1. The application environment comprises: the terminal equipment 102, the network 104 and the server 106 are used for human-computer interaction with the user. The user 108 and the terminal device 102 can perform human-computer interaction, and a video content identification application client is operated in the terminal device 102. The terminal device 102 includes a human-machine interaction screen 1022, a processor 1024, and a memory 1026. The human-computer interaction screen 1022 is used to present video content of the object video to be recognized. The processor 1024 is configured to fuse the multi-dimensional features extracted from the video content of the target video to be identified to obtain a multi-modal video feature vector corresponding to the target video; the memory 1026 is used for storing the video content of the object video to be identified and for storing the multi-modal video feature vector corresponding to the object video.

In addition, the server 106 includes a database 1062 and a processing engine 1064, where the database 1062 is used to store the multi-modal video feature vectors corresponding to the object videos, and is used to store a first level identification parameter set and a second level identification parameter set; and also for storing the target content quality rating for the object video match. The processing engine 1064 is configured to obtain a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on the first identification tag; acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag; and determining the target content quality level matched with the object video according to the first level identification parameter set and the second level identification parameter set.

The specific process comprises the following steps: assuming that a video content identification application client is running in the terminal device 102 shown in fig. 1, the user 108 operates the human-computer interaction screen 1022 to manage and operate the video content, as in step S102, fusing the multi-dimensional features extracted from the video content of the target video to be identified to obtain a multi-modal video feature vector corresponding to the target video; then, step S104 is executed to send the multi-modal video feature vector to the server 106 through the network 104. After receiving the request, the server 106 executes steps S106-S108, and obtains a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, where the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used for indicating a probability that the target video is divided into content quality levels according to the first identification tag; acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag; and determining the target content quality level matched with the object video according to the first level identification parameter set and the second level identification parameter set. And notifies the terminal apparatus 102 via the network 104 as by step S112, returning the target content quality level of the object video matching.

As another alternative, the video content identification method described above in this application may be applied to the application environment shown in fig. 2. As shown in fig. 2, a human-computer interaction may be performed between a user 202 and a user device 204. The user equipment 204 includes a memory 206 and a processor 208. The user device 204 in this embodiment may refer to, but is not limited to, performing the operation performed by the terminal device 102 to obtain the target route matching the target route.

Alternatively, the terminal device 102 and the user device 204 may be, but not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, and the like, and the network 104 may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: WIFI and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The server 106 may include, but is not limited to, any hardware device capable of performing computations. The above is merely an example, and this is not limited in this embodiment.

The low-end video affects the overall content quality of the video platform, the video platform generally needs to identify the low-end video to reduce the influence, and the related technology basically adopts modes based on text mining, image identification and the like, and does not combine multi-dimensional content of the video and video recommendation and distribution user behaviors for full identification.

In order to solve the above technical problem, optionally, as an optional implementation manner, as shown in fig. 3, the video content identification method includes:

s302, fusing multi-dimensional features extracted from video contents of the object video to be recognized to obtain multi-modal video feature vectors corresponding to the object video;

s304, acquiring a first grade identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, wherein the first identification tag is a grade identification tag generated according to grade definition, and the first grade identification parameter set is used for indicating the probability of dividing the object video into each content quality grade according to the first identification tag;

s306, acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag;

s308, determining the quality grade of the target content matched with the object video according to the first grade identification parameter set and the second grade identification parameter set.

In step S302, in actual application, the object video to be identified may include, but is not limited to, a movie, a tv show, various long and short videos, and the like on any video platform, where the multidimensional features may include, but are not limited to, text features, image features, or audio features of the video; and are not limited herein. The multi-modal video feature vector includes, but is not limited to, a word vector sequence obtained by performing word segmentation and vector conversion processing on a text included in text features, and a processing result vector obtained by performing encoding processing on the word vector sequence; in the image characteristics, each main body key frame inputs the characteristics obtained by an image recognition model with time sequence fusion capability; in the audio features, each audio frame is input into a feature obtained in an audio recognition model with time series fusion capability.

In step S304, in practical application, the first identification tag may include, but is not limited to, preset levels of different levels, and the first level identification parameter set is used to indicate the probability that the target video is classified into each content quality level according to the first identification tag; for example, the first identification tag may be ranked 1-5, 5; the 5 levels correspond to low-end level probabilities of [0.12, 0.52, 0.36, 0.08, 0.19], that is, the 1-level identification tag corresponds to a low-end level probability of 0.12, the 2-level identification tag corresponds to a low-end level probability of 0.52, the 3-level identification tag corresponds to a low-end level probability of 0.36, the 4-level identification tag corresponds to a low-end level probability of 0.08, and the 5-level identification tag corresponds to a low-end level probability of 0.19.

In step S306, in practical application, the second identification tag may include, but is not limited to, a rating identification tag generated according to the user playing behavior coefficient is defined by counting the playing rate (playing times/exposure times) and the playing completion rate (total playing duration/duration viewed by the user) of the video in the recommendation pool.

Here, c1+ c2 is defined as a video distribution behavior score, where c1 and c2 are weights, c1+ c2 is 1, the behavior score of the video is divided into K behavior low-end level intervals, for example, [0, 0.2] is a low-end K level, that is, a probability range corresponding to the low-end K level is from 0 to 0.2, and [0.8, 1.0] is a low-end 1 level; the probability range corresponding to the 1 level at the low end is from 0.8 to 1. And the second level identification parameter set comprises a set of different probabilities from a low level 1 to a low level K according to the level identification label generated by the user playing behavior coefficient.

In step S308, in practical application, the target content quality level of the object video matching may be achieved by, but is not limited to, the following method; and the video low-end grade probability is x1 based on the video multi-dimensional content identification model low-end probability + x2 based on the low-end identification model low-end probability of the user behavior, wherein x1+ x2 is 1, and the fused video low-end grade probability is taken as the final low-end grade of the video. Here, the video multi-dimensional content recognition model based low end probability may include, but is not limited to, a first level recognition parameter set, and the user behavior based low end recognition model based low end probability may include, but is not limited to, a second level recognition parameter set.

In one embodiment, step S302 includes: obtaining a weighted summation result between the multi-modal video feature vector and each weight value in the first weight set to obtain a first level identification parameter set; here, the text feature, the image feature, or the audio feature corresponding to the multimodal video feature vector may be used to find a weighted sum result between the weight values according to the respective corresponding weights, and the weighted sum result is used as a first level identification parameter set; for example, the feature vector value corresponding to the text feature in the video content of the video a is 2, and the weight is set to 0.3; the feature vector value corresponding to the image feature is set to be 5, and the weight is set to be 0.4; the feature vector corresponding to the audio feature is set to 3 and the weight is set to 0.3. The first rank identification parameter corresponding to video a is 2 × 0.3+5 × 0.4+3 × 0.3 — 3.5; the feature vector value corresponding to the text feature in the video content of the current object video B to be processed is 3, and the weight is set to be 0.3; setting a feature vector value corresponding to the image feature to be 4, and setting the weight to be 0.4; the feature vector corresponding to the audio feature is set to 2, and the weight is set to 0.3. The first rank identification parameter corresponding to video B is 3 × 0.3+4 × 0.4+2 × 0.3 — 3.1; then the first level identification parameter set may be [3.5, 3.1], and furthermore, the parameter set may be normalized, resulting in a first level identification parameter set of [0.35, 0.31 ]; here, the process of acquiring the first-level identification parameter set is merely an example, and is not limited herein.

Step S304 includes: obtaining a weighted summation result between the multi-modal video feature vector and each weight value in the second weight set to obtain a second level identification parameter set; for example, it may include, but is not limited to, counting the video's play rate (number of plays/number of exposures) and play completion rate (total play duration/duration seen by the user) for videos in the recommendation pool. Here, the c1 × playback rate + c2 × playback completion rate is defined as a second hierarchical identification parameter set, where c1 and c2 are weights, and c1+ c2 is 1.

For example, the number of playing times of the video a in the video recommendation pool is 3000, the number of exposure times is 5000, the total playing time is 6000 hours, the time length viewed by the user is 8000 hours, then the playing rate of the video a is 0.6, and the playing completion rate is 0.75, where when the value of c1 is 0.4 and the value of c2 is 0.6, then the second hierarchical identification parameter corresponding to the video a may be 0.4 × 0.6+0.75 × 0.6 — 0.67; the playing times of the video B in the video recommendation pool are 2000 times, the exposure times are 4000, the total playing time is 3000 hours, the time length viewed by the user is 4000 hours, then the playing rate of the video B is 0.5, and the playing completion rate is 0.75, where when the value of c1 is 0.4 and the value of c2 is 0.6, then the second hierarchical identification parameter corresponding to the video B may be 0.4 x 0.5+0.6 x 0.75-0.65; then here, the second level identification parameter set may be [0.67, 0.65 ]. Here, the process of acquiring the second-level identification parameter set is merely an example, and is not limited herein.

In an embodiment, before fusing the multi-dimensional features extracted from the video content of the target video to be recognized to obtain the multi-modal video feature vector corresponding to the target video, the method further includes: acquiring a first sample video set; configuring a first identification tag for each first sample video in the first sample video set according to the grade definition; inputting the first sample video set and the corresponding first identification label into the initialized content grade identification model for training to obtain a training output result, wherein in each training process of the content grade identification model, the first sample content quality grade corresponding to the first sample video is determined according to the multidimensional characteristics extracted from the video content of the first sample video; and under the condition that the training output result indicates that a first convergence condition is reached, determining to obtain a target content grade identification model for acquiring a first grade identification parameter set, wherein the first convergence condition is used for indicating that the difference degree between the determined first sample content quality grade and the content quality grade indicated by the first identification label is less than or equal to a first threshold value.

Here, the first sample video set may be different types of video files extracted from a video recommendation pool in the video platform. Configuring a first identification tag for each first sample video in the first sample video set according to the level definition, inputting the first sample video set and the corresponding first identification tag into the initialized content level identification model for training, and obtaining a training output result.

The text features of the corresponding model training set can be obtained by jointly using a video title, a subtitle and a dialogue text, wherein the subtitle can be extracted from the video through an Optical Character Recognition (OCR) model, such as Google text, the dialogue text can be recognized through an Automatic Speech Recognition (ASR) Recognition model, and the words corresponding to the dialogue section in the video can be recognized through the ASR Recognition model. The method comprises the steps of splicing a video title, a dialogue and a subtitle to form a text of a video, then carrying out word segmentation operation on the text of the video, inquiring a word vector of each word, then inputting a word vector sequence into an ALBERT Encoder model, and using the output of the model as the text representation of the video. The first vector output by the ALBERT can be used for representing the whole input text, then constructing word segmentation, processing the output word vector, and calculating the distance or the similarity between the two texts by using the ALBERT Encoder.

Here, the processing procedure corresponding to the video image, which may include but is not limited to the image feature of the video, is to extract the key frames related to the video topic, and characterize the video with these key frames. The extraction of the video key frame is a sequence labeling model, namely, 0 and 1 labeling is carried out on each frame in the video, and 1 represents that the frame is a key frame. The method comprises the steps of manually marking a video set, marking 0 and 1 to each frame, constructing a training data set, training a video frame sequence mark on the data set, enabling a model to input a video, and outputting a video key frame sequence. By inputting a pre-trained EfficientNet model into each key frame of a video, the last hidden layer before the model classification layer is output, such as a 1024-dimensional floating point type vector, to be used as the representation of the frame. And after obtaining the key frame representation, sequentially inputting each key frame into a model layer with time sequence fusion capability, such as a NetXVlad model to construct a video image side representation. The NetXVlad model algorithm can be divided into the following steps: 1, extracting a sift descriptor of an image; 2, training a codebook by using the extracted SIFT descriptor (so SIFT of the training image), wherein the training method can be K-means; 3, distributing all SIFT descriptors of one image to a codebook according to the nearest neighbor principle (namely distributing all SIFT descriptors to K clustering centers); 4, performing residual sum on each clustering center (namely subtracting the clustering centers from all SIFT belonging to the current clustering centers and then summing up); 5, the residual is normalized by L2 and then concatenated into a long vector of K x 128, 128 being the length of a single SIFT.

The audio characteristic representation of the video is similar to an image mode, firstly, a VGGish model is utilized to model an audio frame to obtain the representation of the audio frame, and then, time sequence fusion is carried out on the representations of a plurality of audio frames through NetXVlad to obtain the video audio side representation.

In each training process of the content grade identification model, determining a first sample content quality grade corresponding to a first sample video according to multi-dimensional features extracted from the video content of the first sample video; for example, different content quality levels, classified as 1-K levels, may be obtained.

Under the condition that the training output result indicates that the first convergence condition is reached, determining to obtain a target content grade recognition model for acquiring a first grade recognition parameter set; the first convergence condition is used for indicating that the difference degree between the determined first sample content quality level and the content quality level indicated by the first identification label is less than or equal to a first threshold value. That is, when the degree of difference between the content quality level obtained by the model training and the content quality level indicated by the first label is within the preset range, that is, the convergence condition is reached, the training process of the content level recognition model is stopped.

In an embodiment, before fusing the multi-dimensional features extracted from the video content of the target video to be recognized to obtain the multi-modal video feature vector corresponding to the target video, the method further includes: acquiring a second sample video set; configuring a second identification label for each second sample video in the second sample video set according to the user playing behavior coefficient; inputting a second sample video set and a corresponding second identification label into the initialized behavior level identification model for training to obtain a training output result, wherein in each training process of the behavior level identification model, the quality level of the content of a second sample corresponding to the second sample video is determined according to the multi-dimensional features extracted from the video content of the second sample video and the user playing behavior coefficient corresponding to the second sample video; and under the condition that the training output result indicates that a second convergence condition is reached, determining to obtain a target behavior grade identification model for obtaining a second grade identification parameter set, wherein the second convergence condition is used for indicating that the difference degree between the determined second sample content quality grade and the content quality grade indicated by the second identification label is less than or equal to a second threshold value.

Here, for example, the behavior low end level training video low end recognition model may be that, by counting the video playing rate (playing times/exposure times) and playing completion rate (total playing duration/duration viewed by the user) in the recommendation pool, c1+ c2 is defined as the distribution behavior score of the video, where c1 and c2 are weights, c1+ c2 is 1, the behavior score of the video is divided into K behavior low end level intervals, such as [0, 0.2] as the low end K level, and [0.8, 1.0] as the low end 1 level. And then training a video low-end recognition model based on the videos in the recommendation pool and the corresponding behavior low-end levels thereof, wherein the input features of the model are multi-dimensional features of the videos, the classified output target is the distribution behavior low-end level of the videos, the multi-dimensional features can comprise multi-dimensional features of texts, image frames, audio frames and the like of the videos, and the video low-end levels predicted based on the user behaviors and the corresponding level probabilities are output through the model.

And under the condition that the training output result indicates that a second convergence condition is reached, determining to obtain a target behavior grade identification model for obtaining a second grade identification parameter set, wherein the second convergence condition is used for indicating that the difference degree between the determined second sample content quality grade and the content quality grade indicated by the second identification label is less than or equal to a second threshold value. That is, when the degree of difference between the content quality level obtained by the model training and the content quality level indicated by the second label is within the preset range, that is, the convergence condition is reached, the training process of the content level recognition model is stopped.

In an embodiment, the configuring, for each second sample video in the second sample video set, the second identification tag according to a user play behavior coefficient includes: sequentially taking each second sample video in the second sample video set as a current sample video, and executing the following operations: counting the playing rate and the playing completion rate of the current sample video, wherein the playing rate is used for indicating the ratio of the playing times and the exposure times of the current sample video actually played at a playing client, and the playing completion rate is used for indicating the ratio of the time length of the current sample video actually played at the playing client to the total playing time length of the current sample video; determining a current user playing behavior coefficient matched with the current sample video according to the playing rate and the playing completion rate; determining the current content quality grade corresponding to the current user playing behavior coefficient according to the grade interval divided for the user playing behavior coefficient; configuring the second identification tag corresponding to the current content quality level for the current sample video.

For example, the playing time of the video a in the current sample video is 3000 times, the exposure time is 5000, the total playing time is 6000 hours, the time viewed by the user is 8000 hours, then the playing rate of the video a is 0.6, and the playing completion rate is 0.75, where when the value of c1 is 0.4 and the value of c2 is 0.6, then the second hierarchical identification parameter corresponding to the video a may be 0.4 × 0.6+0.75 × 0.6 — 0.67; the playing times of the video B in the video recommendation pool are 2000 times, the exposure times are 4000, the total playing time is 3000 hours, the time length viewed by the user is 4000 hours, then the playing rate of the video B is 0.5, and the playing completion rate is 0.75, where when the value of c1 is 0.4 and the value of c2 is 0.6, then the second hierarchical identification parameter corresponding to the video B may be 0.4 x 0.5+0.6 x 0.75-0.65; then here, the second level identification parameter set may be [0.67, 0.65 ]. The level corresponding to 0.65 is level 1, and the level corresponding to 0.65 is level 2, so that the level identification tag generated by the video a according to the user playing behavior coefficient is level 1, and the level identification tag generated by the video a according to the user playing behavior coefficient is level 2. Here, the process of acquiring the second-level identification tag is only an example and is not limited herein.

In an embodiment, step S308 comprises: traversing each content quality grade, taking each content quality grade as the current content quality grade in sequence, and executing the following operations in sequence: acquiring a first-level identification parameter corresponding to the current content quality level from the first-level identification parameter set, and acquiring a second-level identification parameter corresponding to the current content quality level from the second-level identification parameter set; carrying out weighted summation on the first grade identification parameter and the second grade identification parameter to obtain a current grade identification parameter corresponding to the current content quality grade; and under the condition of obtaining the grade identification parameters corresponding to the content quality grades, determining the maximum grade identification parameter value, and determining the content quality grade corresponding to the maximum grade identification parameter value as the target content quality grade. Here, a video multidimensional content recognition model and a low-end recognition model based on user behaviors may be used in combination, where the video low-end level probability is x1 and the video multidimensional content recognition model low-end probability + x2 is the low-end recognition model low-end probability based on user behaviors, where x1+ x2 is 1, and the fused video low-end level probability is taken as the target content quality level. When a certain low end level is exceeded, the video may not be distributed.

In one embodiment, step S302 includes: extracting at least one of the following features from the video content of the subject video: text features, image features, audio features; under the condition that each text message contained in the object video is identified, splicing each text message to obtain an object text to be processed corresponding to the object video; performing word segmentation and vector conversion processing on the object text to obtain a word vector sequence; coding the word vector sequence to obtain the text characteristics; under the condition that each topic key frame contained in the object video is identified, inputting each topic key frame into an image identification model with time sequence fusion capability to obtain the image characteristics; and under the condition that each audio frame contained in the object video is identified, inputting each audio frame into an audio identification model with time series fusion capability to obtain the audio characteristics.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Based on the above embodiment, in an application embodiment, the video content identification method further includes:

the overall architecture of the video low-end level recognition method based on video multi-modal content and user behavior is shown in fig. 4: step S402, obtaining a video to be identified, and going to step S404, performing low-end identification based on video content from the video to be identified, then performing low-end estimation of user behavior in step S406, then putting the processed video into a video recommendation pool in step S408, performing model learning in step S410 through video distribution exposure, clicking and the like, and in step S412, obtaining a low-end identification model based on user behavior in step S414; step S416, estimating the low end of the user behavior, generating a low end video library after marking the generated low end video, then entering step S418 to learn the model, and step S420 to obtain a low end identification model based on the video content.

Based on the above embodiments, in an application embodiment, the process of performing low-end level identification based on multi-dimensional content of a video includes the following steps:

the video platform predefines K low-end levels, wherein K is the number of low-end categories, the low-end levels from 1 to K are gradually increased, low-end category labeling is carried out on a large number of videos in a video library to construct a low-end data set, a low-end recognition model based on multi-mode contents of the videos is trained on the data set, the model structure is shown in fig. 5, and after a user uploads the videos, the video platform recognizes the low-end levels of the videos based on a multi-mode deep learning model. The video-based multi-modal content low-end recognition model is as follows:

the video text side representation jointly uses a video title, a subtitle and a dialog text, wherein the subtitle can be extracted from the video through a common OCR model such as Google Tesseract, and the dialog text can be recognized through a common ASR recognition model. The method comprises the steps of splicing a video title, a dialogue and a subtitle to form a text of the video, segmenting a video text, inquiring a word vector of each word, inputting a word vector sequence into an ALBERT Encoder model, and enabling the output of the model to be used as the text representation of the video.

The video image side is used for representing the video by extracting key frames related to the video theme. The extraction of the video key frame is a sequence labeling model, namely, 0 and 1 labeling is carried out on each frame in the video, and 1 represents that the frame is a key frame. The method comprises the steps of manually marking a video set, marking 0 and 1 to each frame, constructing a training data set, training a video frame sequence mark on the data set, enabling a model to input a video, and outputting a video key frame sequence. By inputting a pre-trained EfficientNet model into each key frame of a video, the last hidden layer before the model classification layer is output, such as a 1024-dimensional floating point type vector, to be used as the representation of the frame. And after the key frame representation is obtained, sequentially inputting each key frame into a model layer with time sequence fusion capability, such as NetXVlad to construct video image side representation. The video and audio side representation is similar to an image mode, firstly, a VGGish model is utilized to model audio frames to obtain the representation of the audio frames, and then, time sequence fusion is carried out on a plurality of audio frame representations through NetXVlad to obtain the video and audio side representation. The method comprises the steps of constructing a low-end grade classification recognition model through video multi-dimensional feature fusion, splicing the constructed video text, image and audio features, then performing multi-dimensional feature fusion expression through a full-connection network, constructing a low-end classification output layer through the video multi-dimensional fusion expression to classify the low-end grade of the video, and training the model on a pre-labeled low-end training set based on content to enable the model to have the capacity of outputting the low-end grade of the video.

When performing low-end level recognition on a video uploaded by a user, as shown in fig. 6, the method includes the following steps: step S602, acquiring a video to be identified; step S604, performing low-end identification based on the content of the video to be identified, entering step S606, and performing labeling confirmation on the identified content to obtain a low-end video library; then, the process proceeds to step S608 to perform model learning; in step S610, a low-end recognition model based on video content is obtained. In this embodiment, the text, image frame, and audio frame features are extracted based on the above scheme, and then the model outputs the low-end level of the video and the corresponding level probability, for example, the number K of the low-end levels is 5, the probability of the model for each of the low-end levels 1 to K of the current video outputs is [0.05432093, 0.53563935, 0.18928528, 0.13303354, 0.0877209], if it is determined from the model based on the low-end level identification of the video content only, the low-end level of the video is 2, and the corresponding probability is 0.53563935.

Based on the above embodiment, in an application embodiment, the video content identification method further includes: the low end level identification process based on video distribution user behavior may include the following:

if the video content boundary is fuzzy only based on the video content classification, the model and the manual work are difficult to accurately identify the low-end level, if the low-end level of the video is judged to meet the recommended distribution threshold based on the content in the early stage, after the video is recommended and distributed for multiple times, if the video has higher play rate, play completion degree and the like, the video does not belong to the low-end video, and otherwise, the video may have potential low-end risk.

By counting the video playing rate (playing times/exposure times) and playing completion rate (total playing time/time viewed by the user) of the videos in the recommendation pool, c1+ c2 is defined as the distribution behavior score of the videos, wherein c1 and c2 are weights, c1+ c2 is 1, the behavior score of the videos is divided into K behavior low-end level intervals, for example, [0, 0.2] is a low-end K level, and [0.8, 1.0] is a low-end 1 level. Then training a video low-end recognition model based on videos in the recommendation pool and corresponding behavior low-end grades thereof, wherein input features of the model are multi-dimensional features of the videos, a classification output target is the distribution behavior low-end grade of the videos, and corpora input in the model are grades of corresponding scores of the videos; the goal of the model output is the probability that the output video is a low-end video classification.

In the present embodiment, as shown in fig. 7, the probability of the low-end video classification may be obtained by, step S702, inputting video data; step S704, extracting multi-dimensional characteristics of the video; step S706, dividing the video into K behavior low-end levels through a full connection layer; step S708, obtaining a video multi-modal vector; step S710, a low-end classification of the distributed user behavior video is obtained.

When the low-end level recognition is performed on the video uploaded by the user, multi-dimensional features such as texts, image frames and audio frames can be extracted, then the low-end level of the video predicted based on the user behavior and corresponding level probability are output through the model, if the number K of the low-end levels is 5, the probability of the model for each of the low-end levels from 1 to K of the current video output is [0.06821877999999992, 0.08823278, 0.09661109, 0.19159416, 0.55534319], if the low-end level recognition model only based on the user behavior judges, the low-end level of the video is 5, and the corresponding probability is 0.55534319. In one embodiment, when performing low-end level recognition on the video uploaded by the user, as shown in fig. 8, the following steps may be performed: step S802, acquiring videos from a video recommendation pool; step S804, the video is distributed, exposed, clicked and the like; then, the process proceeds to step S806 to perform model learning; step S808, obtaining a low-end identification model based on video content; step S810, acquiring a video to be identified; step S812, performing low-end estimation on user behaviors in the video to be recognized based on the video content low-end recognition model; and step S814, performing labeling confirmation on the estimated video to obtain a low-end video library.

the low-end level identification of the video by integrating the multi-dimensional content of the video and the user behavior comprises the following steps: in order to improve the coverage rate and the accuracy rate of low-end recognition, a video multi-dimensional content recognition model and a user behavior-based low-end recognition model can be used in combination, wherein the video low-end level probability is x1, the video multi-dimensional content recognition model low-end probability + x2 is the user behavior-based low-end recognition model low-end probability, wherein x1+ x2 is 1, and the fused video low-end level probability is taken as the final low-end level of the video. When a certain low end level is exceeded, the video may not be distributed.

The embodiment of the invention constructs a low-end level recognition model based on the multi-dimensional content recognition of the video and in combination with the video distribution condition, improves the low-end recognition capability of the video through the user behavior prediction analysis, reduces the manual recognition marking cost, and further improves the overall video quality of a video platform.

According to another aspect of the embodiment of the present invention, there is also provided a video content identification apparatus for implementing the above video content identification method. As shown in fig. 9, the apparatus includes:

a fusion unit 902, configured to fuse multi-dimensional features extracted from video content of an object video to be identified to obtain a multi-modal video feature vector corresponding to the object video;

a first obtaining unit 904, configured to obtain a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, where the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used to indicate a probability that the target video is classified into each content quality level according to the first identification tag;

a second obtaining unit 906, configured to obtain a second level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, where the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second level identification parameter set is used to indicate a probability that the object video is divided into content quality levels according to the second identification tag;

a determining unit 908, configured to determine a target content quality level of the object video according to the first level identification parameter set and the second level identification parameter set.

In the embodiment of the present invention, the object video to be identified may include, but is not limited to, a movie, a tv show, various long and short videos, etc. on any video platform, where the multidimensional features may include, but are not limited to, text features, image features, or audio features of the video; and are not limited herein. The multi-modal video feature vector includes, but is not limited to, a word vector sequence obtained by performing word segmentation and vector conversion processing on a text included in text features, and a processing result vector obtained by performing encoding processing on the word vector sequence; in the image characteristics, each main body key frame inputs the characteristics obtained by an image recognition model with time sequence fusion capability; in the audio features, each audio frame is input into a feature obtained in an audio recognition model with time series fusion capability.

In the embodiment of the present invention, the first identification tag may include, but is not limited to, preset levels of different levels, and the first level identification parameter set is used to indicate the probability that the target video is classified into each content quality level according to the first identification tag; for example, the first identification tag may be ranked 1-5, 5; the 5 levels correspond to low-end level probabilities of [0.12, 0.52, 0.36, 0.08, 0.19], that is, the 1-level identification tag corresponds to a low-end level probability of 0.12, the 2-level identification tag corresponds to a low-end level probability of 0.52, the 3-level identification tag corresponds to a low-end level probability of 0.36, the 4-level identification tag corresponds to a low-end level probability of 0.08, and the 5-level identification tag corresponds to a low-end level probability of 0.19.

In the embodiment of the present invention, the second identification tag may include, but is not limited to, a rating identification tag generated according to the user play behavior coefficient by counting the play rate (play times/exposure times) and the play completion rate (total play duration/duration viewed by the user) of the video in the recommendation pool. Here, c1+ c2 is defined as a video distribution behavior score, where c1 and c2 are weights, c1+ c2 is 1, the behavior score of the video is divided into K behavior low-end level intervals, for example, [0, 0.2] is a low-end K level, that is, a probability range corresponding to the low-end K level is from 0 to 0.2, and [0.8, 1.0] is a low-end 1 level; the probability range corresponding to the 1 level at the low end is from 0.8 to 1. And the second level identification parameter set comprises a set of different probabilities from a low level 1 to a low level K according to the level identification label generated by the user playing behavior coefficient.

In the embodiment of the present invention, the target content quality level of the object video matching can be realized by, but not limited to, the following methods; and the video low-end grade probability is x1 based on the video multi-dimensional content identification model low-end probability + x2 based on the low-end identification model low-end probability of the user behavior, wherein x1+ x2 is 1, and the fused video low-end grade probability is taken as the final low-end grade of the video. Here, the video multi-dimensional content recognition model based low end probability may include, but is not limited to, a first level recognition parameter set, and the user behavior based low end recognition model based low end probability may include, but is not limited to, a second level recognition parameter set.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the video content identification method, where the electronic device may be a terminal device or a server shown in fig. 1. The present embodiment is described by taking the electronic device as an example. As shown in fig. 10, the electronic device comprises a memory 1002 and a processor 1004, the memory 1002 having stored therein a computer program, the processor 1004 being arranged to execute the steps of any of the method embodiments described above by means of the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, fusing multi-dimensional features extracted from the video content of the object video to be recognized to obtain multi-modal video feature vectors corresponding to the object video;

s2, acquiring a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, wherein the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used for indicating the probability that the object video is divided into each content quality level according to the first identification tag;

s3, acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag;

and S4, determining the target content quality level matched with the object video according to the first level identification parameter set and the second level identification parameter set.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 10 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 10 is a diagram illustrating a structure of the electronic device. For example, the electronics may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The memory 1002 may be used to store software programs and modules, such as program instructions/modules corresponding to the video content identification method and apparatus in the embodiments of the present invention, and the processor 1004 executes various functional applications and data processing by running the software programs and modules stored in the memory 1002, that is, implementing the video content identification method described above. The memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1002 may further include memory located remotely from the processor 1004, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1002 may be specifically, but not limited to, used for storing information such as a multi-modal video feature vector corresponding to the object video. As an example, as shown in fig. 10, the memory 1002 may include, but is not limited to, the fusion unit 902, the first obtaining unit 904, the second obtaining unit 906, and the determination unit 908 in the video content recognition apparatus. In addition, the video content recognition apparatus may further include, but is not limited to, other module units in the video content recognition apparatus, which is not described in detail in this example.

Optionally, the above-mentioned transmission device 1006 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1006 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices so as to communicate with the internet or a local area Network. In one example, the transmission device 1006 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In addition, the electronic device further includes: a display 1008 for displaying the multi-modal video feature vector information; and a connection bus 1010 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. Nodes can form a Peer-To-Peer (P2P, Peer To Peer) network, and any type of computing device, such as a server, a terminal, and other electronic devices, can become a node in the blockchain system by joining the Peer-To-Peer network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer readable storage medium, and the processor executes the computer instruction, so that the computer device executes the display method of the surface simulation picture. Wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying video content, comprising:

fusing multi-dimensional features extracted from video content of an object video to be identified to obtain a multi-modal video feature vector corresponding to the object video;

acquiring a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, wherein the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the first identification tag;

acquiring a second-level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, wherein the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second-level identification parameter set is used for indicating the probability of dividing the object video into each content quality level according to the second identification tag;

and determining the target content quality level matched with the object video according to the first level identification parameter set and the second level identification parameter set.

2. The method of claim 1,

the obtaining a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag includes: obtaining a weighted sum result between the multi-modal video feature vector and each weight value in the first weight set to obtain the first level identification parameter set;

the obtaining a second level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag includes: and obtaining a weighted summation result between the multi-modal video feature vector and each weight value in the second weight set to obtain the second level identification parameter set.

3. The method according to claim 2, wherein before the fusing the multi-dimensional features extracted from the video content of the target video to be identified to obtain the multi-modal video feature vector corresponding to the target video, the method further comprises:

acquiring a first sample video set;

configuring the first identification tag for each first sample video in the first sample video set according to a hierarchy definition;

inputting the first sample video set and the corresponding first identification label into an initialized content grade identification model for training to obtain a training output result, wherein in each training process of the content grade identification model, a first sample content quality grade corresponding to the first sample video is determined according to multi-dimensional features extracted from video content of the first sample video;

and under the condition that the training output result indicates that a first convergence condition is reached, determining to obtain a target content level identification model for acquiring the first level identification parameter set, wherein the first convergence condition is used for indicating that the difference degree between the determined first sample content quality level and the content quality level indicated by the first identification label is less than or equal to a first threshold value.

4. The method according to claim 2, wherein before the fusing the multi-dimensional features extracted from the video content of the target video to be identified to obtain the multi-modal video feature vector corresponding to the target video, the method further comprises:

acquiring a second sample video set;

configuring the second identification tag for each second sample video in the second sample video set according to a user playing behavior coefficient;

inputting the second sample video set and the corresponding second identification label into an initialized behavior grade identification model for training to obtain a training output result, wherein in each training process of the behavior grade identification model, the quality grade of the content of a second sample corresponding to the second sample video is determined according to the multidimensional characteristics extracted from the video content of the second sample video and the user playing behavior coefficient corresponding to the second sample video;

and under the condition that the training output result indicates that a second convergence condition is reached, determining to obtain a target behavior level identification model for obtaining the second level identification parameter set, wherein the second convergence condition is used for indicating that the difference degree between the determined second sample content quality level and the content quality level indicated by the second identification label is less than or equal to a second threshold value.

5. The method of claim 4, wherein configuring the second identification tag for each second sample video in the second sample video set according to a user play behavior coefficient comprises:

sequentially taking each second sample video in the second sample video set as a current sample video, and executing the following operations:

counting the playing rate and the playing completion rate of the current sample video, wherein the playing rate is used for indicating the ratio of the playing times and the exposure times of the current sample video actually played at a playing client, and the playing completion rate is used for indicating the ratio of the time length of the current sample video actually played at the playing client to the total playing time length of the current sample video;

determining a current user playing behavior coefficient matched with the current sample video according to the playing rate and the playing completion rate;

determining the current content quality grade corresponding to the current user playing behavior coefficient according to the grade interval divided for the user playing behavior coefficient;

configuring the second identification tag corresponding to the current content quality level for the current sample video.

6. The method of any of claims 1-5, wherein determining the target content quality level for the object video match based on the first level identification parameter set and the second level identification parameter set comprises:

traversing each content quality grade, taking each content quality grade as the current content quality grade in sequence, and executing the following operations in sequence: acquiring a first-level identification parameter corresponding to the current content quality level from the first-level identification parameter set, and acquiring a second-level identification parameter corresponding to the current content quality level from the second-level identification parameter set; carrying out weighted summation on the first grade identification parameter and the second grade identification parameter to obtain a current grade identification parameter corresponding to the current content quality grade;

and under the condition that the grade identification parameters corresponding to the content quality grades are obtained, determining the maximum grade identification parameter value, and determining the content quality grade corresponding to the maximum grade identification parameter value as the target content quality grade.

7. The method according to any one of claims 1 to 5, wherein the fusing the multi-dimensional features extracted from the video content of the target video to be identified to obtain the multi-modal video feature vector corresponding to the target video comprises:

extracting at least one of the following features from the video content of the subject video: text features, image features, audio features;

under the condition that each text message contained in the object video is identified, splicing each text message to obtain an object text to be processed corresponding to the object video; performing word segmentation and vector conversion processing on the object text to obtain a word vector sequence; coding the word vector sequence to obtain the text characteristics;

under the condition that each topic key frame contained in the object video is identified, inputting each topic key frame into an image identification model with time sequence fusion capability to obtain the image characteristics;

and under the condition that each audio frame contained in the object video is identified, inputting each audio frame into an audio identification model with time series fusion capability to obtain the audio characteristics.

8. A video content recognition apparatus, comprising:

the fusion unit is used for fusing multi-dimensional features extracted from video content of an object video to be recognized to obtain a multi-modal video feature vector corresponding to the object video;

a first obtaining unit, configured to obtain a first level identification parameter set according to the multi-modal video feature vector and a first weight set determined based on a first identification tag, where the first identification tag is a level identification tag generated according to a level definition, and the first level identification parameter set is used to indicate a probability that the target video is classified into each content quality level according to the first identification tag;

a second obtaining unit, configured to obtain a second level identification parameter set according to the multi-modal video feature vector and a second weight set determined based on a second identification tag, where the second identification tag is a level identification tag generated according to a user playing behavior coefficient, and the second level identification parameter set is used to indicate a probability that the object video is classified into each content quality level according to the second identification tag;

and the determining unit is used for determining the target content quality level matched with the object video according to the first level identification parameter set and the second level identification parameter set.

9. The apparatus of claim 8,

the first acquisition unit includes: the first obtaining module is used for obtaining a weighted summation result between the multi-modal video feature vector and each weight value in the first weight set to obtain the first level identification parameter set;

the second acquisition unit includes: and the second obtaining module is used for obtaining a weighted summation result between the multi-modal video feature vector and each weight value in the second weight set to obtain the second level identification parameter set.

10. The apparatus of claim 9, further comprising:

a third obtaining unit, configured to obtain a first sample video set;

a first configuration unit, configured to configure the first identification tag for each first sample video in the first sample video set according to a hierarchy definition;

a first training unit, configured to input the first sample video set and the corresponding first identification tag into an initialized content level identification model for training, so as to obtain a training output result, where in each training process of the content level identification model, a first sample content quality level corresponding to the first sample video is determined according to a multi-dimensional feature extracted from video content of the first sample video;

a first determining unit, configured to determine that a target content quality level identification model for acquiring the first level identification parameter set is obtained when the training output result indicates that a first convergence condition is reached, where the first convergence condition is used to indicate that a degree of difference between the determined first sample content quality level and the content quality level indicated by the first identification tag is less than or equal to a first threshold.

11. The apparatus according to claim 9, wherein before the fusing the multi-dimensional features extracted from the video content of the target video to be recognized to obtain the multi-modal video feature vector corresponding to the target video, the apparatus further comprises:

a fourth obtaining unit, configured to obtain a second sample video set;

a second configuration unit, configured to configure the second identification tag for each second sample video in the second sample video set according to a user playing behavior coefficient;

a second training unit, configured to input the second sample video set and the corresponding second identification tag into an initialized behavior level identification model for training, so as to obtain a training output result, where in each training process of the behavior level identification model, a second sample content quality level corresponding to the second sample video is determined according to a multidimensional feature extracted from video content of the second sample video and a user playing behavior coefficient corresponding to the second sample video;

and a second determining unit, configured to determine, when the training output result indicates that a second convergence condition is reached, to obtain a target behavior level identification model for obtaining the second level identification parameter set, where the second convergence condition is used to indicate that a degree of difference between the determined second sample content quality level and the content quality level indicated by the second identification label is less than or equal to a second threshold.

12. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 7.

13. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.