CN112188306B

CN112188306B - Label generation method, device, equipment and storage medium

Info

Publication number: CN112188306B
Application number: CN202011014223.5A
Authority: CN
Inventors: 杨田雨; 姜文浩; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-06-21
Anticipated expiration: 2040-09-23
Also published as: CN112188306A

Abstract

The embodiment of the application provides a label generation method, a device, equipment and a storage medium, wherein the label generation method comprises the following steps: the method comprises the steps that tag generation equipment obtains a target video, and extracts a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation; converting the audio frame into a spectrogram, calling a video type discrimination model to perform type identification processing on the spectrogram and a video frame image to obtain a video type of a target video; and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the frequency spectrogram and the video frame image to obtain a video content label of the target video. By the method, the label generating device generates the label corresponding to the video according to the content of the video, so that the accuracy of the generated video label is improved.

Description

Label generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for generating a tag.

Background

With the development of the internet, since the advent of media, nowadays, videos on the internet are rapidly growing, and the types of videos are also eight-door. In this case, the video tags play an important role in video recommendation, distribution, and search. The technology of how to generate video tags has also become one of the current research hotspots.

At present, the video tag generation technology mainly analyzes and generates tags based on existing text information (such as titles, video descriptions, and the like), but it is difficult to effectively generate reliable tags under the condition that the video is accompanied by little text information (such as no title or description) or the text information is inaccurate. Therefore, how to generate video tags efficiently and reliably becomes a hot spot to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a label generation method, a label generation device, label generation equipment and a storage medium, so that the accuracy of a generated video label is improved.

An aspect of the present embodiment provides a tag generation method, including:

acquiring a target video, and extracting a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation;

converting the audio frame into a spectrogram, and calling a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video;

if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the spectrogram and the video frame image to obtain a video content label of the target video.

An aspect of an embodiment of the present application provides a tag generation apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target video and extracting a video frame image and an audio frame from the target video, and the video frame image and the audio frame have a corresponding relation;

the processing unit is used for converting the audio frame into a spectrogram and calling a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video;

the processing unit is further configured to, if the video type of the target video is the target video type, call a video classification model to perform content tag identification processing on the spectrogram and the video frame image to obtain a video content tag of the target video.

An aspect of an embodiment of the present application provides a tag generation device, where the tag generation device includes:

a memory for storing a computer program;

a processor running the computer program; the label generation method is realized.

An aspect of the embodiments of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions, which, when executed by a processor, cause the processor to execute the above-mentioned tag generation method.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a computer device, the computer instructions perform the methods in the embodiments described above.

According to the embodiment of the application, the label generation equipment acquires a target video, and extracts a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation; converting the audio frame into a spectrogram, calling a video type discrimination model to perform type identification processing on the spectrogram and a video frame image to obtain a video type of a target video; and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the frequency spectrogram and the video frame image to obtain a video content label of the target video. By the method, the label generating device generates the label corresponding to the video through the content of the video, so that the accuracy of the generated video label can be improved, the video label can be generated without additional text data of the video, the data requirement on the video can be reduced, and the application range is expanded.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic system architecture diagram of a tag generation method according to an embodiment of the present application;

fig. 2 is a general framework diagram of video processing by a tag generation device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a processing flow of an arbiter provided in an embodiment of the present application;

fig. 4 is a schematic flowchart of a tag generation method provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of another tag generation method provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a label generation apparatus provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a label generation device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The tag generation method provided by the embodiment of the application further relates to:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The application relates to a Computer Vision technology and machine learning belonging to the artificial intelligence technology, wherein the Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

With the embodiment of the application, the target video is processed by a computer vision technology: acquiring a target video, extracting a video frame image and an audio frame from the target video, and converting the audio frame into a spectrogram; and then, processing the spectrogram and the video frame image by combining machine learning: calling a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video; and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the frequency spectrogram and the video frame image to obtain a video content label of the target video. Through the embodiment of the application, the label generating device generates the label corresponding to the video through the content of the video, so that the accuracy of the generated video label can be improved, the video label can be generated without additional text data of the video, the data requirement on the video can be reduced, and the application range is expanded.

Referring to fig. 1, fig. 1 is a schematic system architecture diagram of a tag generation method according to an embodiment of the present application, where the system architecture diagram may include a plurality of clients 101, a tag generation device 102, and a server 103. The client 101 and the server 103 can be connected to each other in communication, and the tag generation device 102 and the server 102 can also be connected to each other in communication.

The client 101 mainly displays an input interface to a user in a visual interface mode, and the interface of the client 101 can display input text information of the user; the tag generation device 102 (which may also be a server) is mainly used for classifying videos and generating tags for the videos in the present application; the server 103 is mainly used for storing the model trained by the tag generation device 102, storing the video uploaded by the client 101, and the like.

In a possible implementation manner, after detecting input text information of a user, the client 101 sends the input text information to the server 103, the server 103 matches the received input text information with a plurality of video content tags stored in the server 103, where the video content tags are determined by the tag generation device 102 by invoking a trained model, if the server 103 finds a video content tag matching the input text information among the plurality of video content tags, a video corresponding to the video content tag is sent to the client 101, and the client 101 presents the video to the user in a visual form.

In this embodiment of the present application, the tag generation device 102 is mainly used for distinguishing videos and generating video tags, and a general flow of the tag generation device 102 may be as shown in fig. 2, including distinguishing (video category distinguishing model) and classifying (video classification model) videos, where the video category distinguishing model outputs a result of classification, that is, whether it is or not; the output of the video classification model is a multi-classification result, and for the video tags, a plurality of video content tags are obtained. Specifically, the method comprises the following steps: the image and audio of the video are input into a discriminator to judge the type, wherein the judgment is mainly vertical field judgment, namely, whether the video belongs to a certain type is judged, such as judging whether the video is a game, judging whether the video is a movie, and the like. If the result output by the discriminator is yes, the image and the audio of the video are input into the classifier, and the classifier discriminates the label of the video. The flow may be as shown in fig. 2, and fig. 2 is an overall framework diagram of the tag generation device on video processing: the method comprises the steps of inputting images and audios, firstly inputting the images and the audios into a discriminator (namely a video type discrimination model corresponding to the application), discriminating whether the images and the audios are of a target video type, and if so, inputting the images and the audios into a classifier (namely a video classification model corresponding to the application) to obtain a video content label.

There is also a specific processing flow for the classifier and the discriminator, taking the classifier as an example, the flow can be as shown in fig. 3. The input of the audio convolution neural network is K image segments and K audio segments, wherein the image convolution neural network is used for carrying out convolution processing on the image segments, and the audio convolution neural network is used for carrying out convolution processing on the audio segments. Further, both the image segments and the audio segments require pre-processing before being input into the convolutional neural network. And inputting the image segment processed by the image convolution neural network and the audio segment processed by the audio convolution neural network into a full-connection network layer, and performing prediction fusion to finally obtain a video content label of the video. If the classification is for a game, a specific game name is output, and if the classification is for a drama, a specific drama name is output. Correspondingly, the processing flow of the discriminator and the flow of the classifier are of different types, which are not described in detail herein.

As used herein, a "client" includes, but is not limited to, a user device, a handheld device with wireless communication capabilities, an in-vehicle device, a wearable device, or a computing device, among others. Illustratively, the user terminal may be a mobile phone (mobile phone), a tablet computer or a computer with wireless transceiving function. The client may also be a Virtual Reality (VR) terminal, an Augmented Reality (AR) terminal, a wireless terminal in industrial control, a wireless terminal in unmanned driving, a wireless terminal in telemedicine, a wireless terminal in a smart grid, a wireless terminal in a smart city (smart city), a wireless terminal in a smart home (smart home), and so on. In the embodiment of the present application, the device for implementing the function of the client may be a terminal; it may also be an apparatus, such as a system-on-chip, capable of supporting the terminal device to implement the function, and the apparatus may be installed in the terminal device. In the technical solution provided in the embodiment of the present application, a device for implementing a function of a client is taken as an example, and the technical solution provided in the embodiment of the present application is described.

Referring to fig. 4, fig. 4 is a schematic flowchart of a tag generation method provided in an embodiment of the present application, and the embodiment of the present application mainly describes how to distinguish a video category and how to generate a video content tag, which may include the following steps:

s401, the label generating device obtains a target video, and extracts a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation.

Wherein the target video may refer to a video set acquired from a server. Therefore, after the tag generation device acquires the target video, corresponding sampling needs to be performed on the target video in order to make the video length uniform. In this application, a segmentation cutting method may be adopted to sample and extract a target video, specifically: and averagely dividing the target video into K segments, and randomly extracting M frames of video frame images and M frames of audio frames from each segment. When M is equal to 1, a frame of video frame image and audio frame is extracted, and when M is greater than 1, a plurality of frames are extractedThe video frame image sequence formed by combining the video frame images and the audio frame sequence formed by combining a plurality of audio frames, the video frame images and the audio frames always have one-to-one correspondence, for example, the ith frame video frame image corresponds to the ith frame audio frame. The resulting sequence can be represented as { I }_k,m,H_k,mWhere, K is {1,2, …, K }, and M is {1,2, …, M }, where I is_k,mImage representing video, H_k,mThe values of K and M for audio, which represents video, are set depending on the size of GPU memory that can be tolerated. In the present application, K may be 10, M may be 3, and for a video with a video frame length smaller than K × M, the video may be resampled and then decimated.

S402, the label generating device converts the audio frame into a spectrogram, and calls a video type distinguishing model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video.

The input of the model for video type discrimination of the target video provided by the embodiment of the application is a picture, so that after an audio frame is obtained, the audio frame needs to be subjected to spectrogram conversion. And after the audio frame is extracted by the label generation equipment, windowing is carried out on the audio frame signal, and Fourier transform is carried out on the audio frame subjected to windowing to obtain the frequency domain audio signal. And resampling the frequency domain audio signal, and then calling a Mel filter to filter the sampled audio signal to obtain a Mel spectrogram. Wherein, the sampling frequency can be set to 16kHz, and the number of the Mel filter banks is set to 64. After obtaining the spectrogram, the size of the spectrogram is scaled to be fixed to 64 × 128, and each audio frame is processed in the same way, mainly to ensure that the sizes of the input spectrograms are the same, so as to avoid large errors. If no audio is present in the target video, the processing portion of the audio frame is zeroed.

In a possible implementation manner, the video type discrimination model includes a first image convolution neural network, a first audio convolution neural network, and a first full-connection network, where the first image convolution neural network, the first audio convolution neural network, and the first full-connection network are mainly used for distinguishing the video type discrimination model from the video classification model with respect to a second image convolution neural network, a second audio convolution neural network, and a second full-connection network in the video classification model, and then the tag generation device calls the video type discrimination model to perform type identification processing on the spectrogram and the video frame image after obtaining the spectrogram, so as to obtain the video type of the target video, and the main implementation manner is:

the label generation device calls a first image convolution neural network to perform convolution processing on the video frame image to obtain first video frame image characteristics of the video frame image, wherein the first image convolution neural network can be Inceptionv3, Inceptionv4, ResNet50, ResNet101 and the like, and different dimensions can be set for different convolution neural networks. If the neural network employed is Inceptiov 3, the dimension may be set to 2048. If a frame of video frame image is input, a video frame image characteristic is output. If the input is a video frame image sequence, the corresponding output is also a video frame image feature sequence (the first image convolution neural network determines the feature of each video frame image in the video frame image feature sequence in the same way). If input is processed I_k,mThe corresponding output is also a sequence, which can be used

Where N represents the dimension of the video frame image feature.

Meanwhile, the label generation equipment calls a first audio convolution neural network to carry out convolution processing on the spectrogram to obtain a first spectrogram feature, and the first video frame image feature and the first spectrogram feature are spliced into a first video feature of the target video. The first audio convolutional neural network may be VGGish, and if the adopted neural network is VGGish, the dimensionality of the first audio convolutional neural network may be set to 128, so that the dimensionalities of the obtained spectrogram features are all 128. Wherein VGGish specifically calculates log (Mel spectrogram +0.01) to obtain stable Mel spectrum, and the offset of 0.01 is added to avoid pairing 0The features are then framed in 0.96s duration and there is no overlap of frames, each containing 64 mel-bands, of duration 10ms (i.e., 96 frames in total). For the sequence, if the sequence is the sequence of K × M long audio frames, then a second audio convolution neural network is called to convolve the sequence of spectrogram to obtain a sequence of spectrogram features, for example, the obtained spectrogram features can be used

P corresponds to the dimension of the spectrogram feature.

Further, the tag generation device combines the first video frame image feature and the first spectrogram feature into a first video feature of the target video, specifically: and the label generating equipment fuses the first video frame image characteristic and the first spectrogram characteristic in a characteristic vector series connection mode to obtain a first video characteristic of the target video. If the first video frame image feature and the first spectrogram feature are both one, a series-connected fused first video feature is obtained. If the first video frame image feature sequence and the first spectrogram feature sequence, such as the first spectrogram feature sequence

And a first sequence of video frame image features

Correspondingly, the obtained fused first video feature is also a sequence, and can be specifically used

Wherein N + P is the dimension corresponding to the video feature.

Still further, the tag generation device invokes the first fully connected network to determine a type matching probability of the first video feature and the target video type. For example, if the target video type is a game, the type matching probability of the game video is obtained, and if the target video type is a movie, the type matching probability of the movie video is obtained. If so obtainedAnd if the type matching probability is larger than the first probability threshold, determining the video type of the original video as the target video type. Can use S_bType match probability, T, representing target video type_bIndicating a first probability threshold, if S_b>T_bDetermining the video type of the original video as the target video type, wherein the video type discrimination model generally outputs a second-class result, generally T_bMay be set to 0.5.

In a possible implementation manner, after the tag generation device extracts the video frame image of the target video, the tag generation device performs appropriate data augmentation on the video frame image, and the purpose of the tag generation device is to prevent an over-fitting phenomenon. The method mainly comprises the steps of scaling the size of a video frame image, carrying out random color processing, carrying out denoising processing, carrying out Gaussian blur and the like, and finally fixing the size of an audio frame image. In the embodiment, the size of the video frame image is fixed to 299. The same processing is required for each frame of video frame image to ensure the input consistency of the convolutional neural network so as not to bring larger errors.

And S403, if the video type of the target video is the target video type, the label generation equipment calls a video classification model to perform content label identification processing on the spectrogram and the video frame image to obtain a video content label of the target video.

The video classification model comprises a second image convolution neural network, a second audio convolution neural network and a second full-connection network.

In a possible implementation manner, the tag generation device calls a video classification model to perform content tag identification processing on the spectrogram and the video frame image to obtain a video content tag of the target video, and the specific implementation manner is as follows:

the label generating device calls a second audio convolution neural network to perform convolution processing on the spectrogram to obtain a second spectrogram feature, then calls a second image convolution neural network to perform convolution processing on the video frame image to obtain a second video frame image feature, and splices the second spectrogram feature and the second video frame image feature into a second video feature, wherein the second image convolution neural network can also be Inceptionv3, Inceptionv4, ResNet50, ResNet101 and the like. Further, the label generating device calls a second full-connection network to determine a label matching probability set of the second video feature and the plurality of video content labels, and determines the video content labels of the target video according to the label matching probability set.

In one possible implementation, the input video classification model is a sequence of spectrogram images and a sequence of video frame images, and the result of the concatenation is a sequence of video features, which can be denoted as F_k,m＝[V_k,m,A_k,m]. In this case, the tag generation device invokes a second fully connected network pair F of video classification models_k,m＝[V_k,m,A_k,m]The prediction process is performed to obtain a prediction sequence, which can be expressed as

C denotes the type of video content tag. Since there are K × M inputs, there are K × M outputs correspondingly, each output corresponds to a plurality of categories of video content tags and a tag matching probability set corresponding to each video content tag, here, an averaging method may be adopted to obtain a target tag matching probability set, and a target tag corresponding to a maximum target tag matching probability is used as a video content tag of a target video. In the embodiment of the present application, the video content tag of the target video may be expressed as: l is_final＝∑_k,mL_k,mThis is indicated as being averaged. Further, it is also necessary to determine whether the maximum target tag matching probability is greater than a second probability threshold, and if so, the maximum target tag matching probability is used as the video content tag of the target video. The threshold judgment is adopted to ensure the stability and reliability of the video content label prediction.

For example, if the target video a corresponds to 3 video features, the matching probability between the existing video feature a and the video content tag a is 0.1, the matching probability between the video feature a and the video content tag B is 0.2, and the matching probability between the video feature a and the video content tag C is 0.7; the matching probability between the video feature B and the video content label A is 0.2, the matching probability between the video feature B and the video content label B is 0.3, and the matching probability between the video feature B and the video content label C is 0.5; the matching probability between the video feature C and the video content label A is 0.1, the matching probability between the video feature C and the video content label B is 0.1, and the matching probability between the video feature A and the video content label C is 0.8; as can be known from the averaging method, the average tag matching probability between the target video a and the video content tag a is (0.1+0.2+0.1)/3 ═ 0.133; the average tag matching probability between the target video a and the video content tag B is (0.2+0.3+0.1)/3 is 0.2; the average tag matching probability between the target video a and the video content tag C is (0.7+0.5+ 0.8)/3-0.667; therefore, if the obtained target tag matching probability set value is {0.133,0.2,0.667}, and the maximum target tag matching probability in the target tag matching probability set is 0.667, it can be determined that the video content tag C corresponding to 0.667 is the video content tag of the target video a.

In addition to determining the video content tag of the target video by means of the average value, the video content tag of the target video is also determined by means of the following method:

in a possible implementation manner, assuming that there are N tag matching probability sets, the N tag matching probability sets are averaged to obtain a target tag matching probability set, where the set includes a plurality of target tag matching probability sets, and a maximum target tag matching probability is extracted from the target tag matching probability sets. And meanwhile, judging a threshold value of the maximum target label matching probability, and if the maximum target label matching probability is greater than a second threshold value, determining the video content label corresponding to the maximum target label matching probability as the video content label of the target video. The matching probability of one target label can represent a non-target type, so that the category of the video can be further judged to ensure the reliability of the generated label. And (3) the matching probability of the target labels in the target label matching probability set is assumed, and then the video content label with the maximum matching probability of the target labels in the 3 sets is selected as the video content label of the target video.

For example, if the target video a corresponds to 3 video features, the matching probability between the existing video feature a and the video content tag a is 0.1, the matching probability between the video feature a and the video content tag B is 0.6, and the matching probability between the video feature a and the video content tag C is 0.3; the matching probability between the video feature B and the video content label A is 0.2, the matching probability between the video feature B and the video content label B is 0.3, and the matching probability between the video feature B and the video content label C is 0.5; the matching probability between the video feature C and the video content tag a is 0.1, the matching probability between the video feature C and the video content tag B is 0.1, and the matching probability between the video feature a and the video content tag C is 0.8. The maximum label matching probability corresponding to the video feature a is a video content label B, the maximum label matching probability corresponding to the video feature B is a video content label C, and the maximum label matching probability corresponding to the video feature C is a video content label C, so that it can be seen that the number of the video content labels B is 1, and the number of the video content labels C is 2, and therefore, the video content label C is taken as the video content label of the target video a.

In one possible implementation, the input video classification model is a spectrogram and a video frame image. In this case, the label generation device invokes the second full-connection network of the video classification model to perform prediction processing on the video features obtained by the series connection, so as to obtain label matching probabilities between the target video and the plurality of video content labels in the video classification model, and the video content label corresponding to the maximum label matching probability is used as the video content label of the target video.

For example, the matching probability between the existing video feature a and the video content tag a is 0.1, the matching probability between the target video a and the video content tag B is 0.2, and the matching probability between the target video a and the video content tag C is 0.7. Since 0.7 > 0.2 > 0.1, the video content tag C corresponding to 0.7 is the video content tag of the target video.

In the embodiment of the application, the label generation equipment acquires a target video, and extracts a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation; converting the audio frame into a spectrogram, calling a video type discrimination model to perform type identification processing on the spectrogram and a video frame image to obtain a video type of a target video; and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the frequency spectrogram and the video frame image to obtain a video content label of the target video. By the method, the label generating device generates the label corresponding to the video through the content of the video, so that the accuracy of the generated video label can be improved, the video label can be generated without additional text data of the video, the data requirement on the video can be reduced, and the application range is expanded.

Referring to fig. 5, a schematic flow chart of another label generation method provided in the embodiment of the present application is shown, where the embodiment of the present application mainly describes training of a model and a specific application scenario, and the method may include the following steps:

s501, the label generation equipment obtains a sample video, and the sample video carries a video description text.

The sample video is obtained to train the initialization model, and the sample video may include a plurality of videos. Further, the sample video may be any type of video in games, movies, sports, etc., and the type of video is not limited herein, but all sample videos participating in model training belong to the same type.

S502, the label generating device obtains a plurality of video content labels, and searches for the video content label to be determined which is matched with the video description text in the plurality of video content labels.

In a possible implementation manner, the tag generation device acquires a plurality of video content tags, such as specific game names corresponding to games, specific movie names corresponding to movies, and the like, from the internet or cloud storage by using technical means such as crawling; further, the video content labels to be determined, which are matched with the video description texts of the sample videos, are searched in the plurality of video content labels through a method of word segmentation and synonym mapping.

For example, the plurality of video content tags are respectively tag a, tag b, tag c and tag d, and the tag hand generating device finds out that the video description text of the sample video has the same synonym or similar synonym as tag a by a synonym or a synonym mapping method, and determines tag a as the to-be-determined video content tag of the sample video.

S503, the label generating device takes the video content label to be determined as a sample content label of the sample video.

S504, the label generating device extracts a sample video frame image and a sample audio frame from the sample video, and the sample video frame image and the sample audio frame have a corresponding relation.

The processing flow of this step is consistent with step S401, and is not described here again.

And S505, the label generation equipment converts the sample audio frame into a sample spectrogram, calls an initialization model to process the sample spectrogram and the sample video frame image, and determines a predicted content label of the sample video.

In a possible implementation manner, the tag generation device calls the initialization model to process the sample spectrogram and the sample video frame image, and similarly, the sample spectrogram and the sample video frame image need to be processed to obtain sample spectrogram features and sample video frame image features, and then the sample spectrogram features and the sample video frame image features are processed by calling the initialization model to obtain a predicted content tag of the sample video, where the predicted content tag may be the same as or very different from the sample video content tag of the sample video.

It should be noted that the initial model called by the application is a classification model which is already trained in other fields, that is, when training is started, the model parameters of the initial model are not random numbers, so that the application only needs to perform finetune (fine tuning) on the model parameters of the initial model through a sample video, and the convergence speed of the model can be accelerated; furthermore, the model of the present application uses a 2dcnn structure, which is faster to compute than i3d (the dilated convolutional neural network), and can recognize video tag content faster.

S506, the label generating equipment trains an initialization model according to the predicted content labels and the sample content labels to obtain a video classification model.

Further, after obtaining the predicted content label, the label generation device trains the initialization model again according to the predicted content label and the sample content label, compares the predicted video type with the real video content label, feeds back the evaluation result to the video classification model, performs characteristic adjustment and optimization on the parameter of the initialization model, and obtains the video classification model when the parameter tends to be stable.

The training process of the video type discrimination model is as follows: the tag generation device acquires reference videos, wherein the reference videos can be various types of videos, such as games, movies and sports videos, and the like, and each reference video also carries a corresponding video description text; meanwhile, the tag generation device acquires 2 video types, namely a target video type and a non-target video type, for any reference video, a video type (called as a to-be-determined video type) matched with the video description text is searched in the 2 video types, and then the to-be-determined video type is determined as the reference video type of the reference video. After the reference video and the reference video type are determined, the training process of the video type discrimination model is the same as that of the steps S504 to S506, that is, the label generation equipment trains the initialization model based on the reference video and the reference video type to obtain the video type discrimination model.

And S507, the label generating equipment acquires the target video, and extracts a video frame image and an audio frame from the target video, wherein the video frame image and the audio frame have a corresponding relation.

And S508, converting the audio frame into a spectrogram by the label generation equipment, and calling a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video.

And S509, if the video type of the target video is the target video type, the label generation equipment calls a video classification model to perform content label identification processing on the spectrogram and the video frame image to obtain a video content label of the target video.

Step S507 to step S509 are the same as step S401 to step S403, and are not described in detail here.

S510, the label generating device obtains an input text of a user, obtains a plurality of videos to be matched and video content labels of the videos to be matched, searches matched video content labels matched with the input text from the video content labels of the videos to be matched, and outputs the videos to be matched corresponding to the matched video content labels.

Through the steps, if N videos exist, N video content labels can be obtained, and the obtained video content labels are stored in the server.

In a possible implementation manner, after the video is trained by the video type discrimination model and the video classification model, the labels of the videos are obtained. The label generation equipment acquires input text of a user, wherein the input text is input by the user through a client interface, and the input mode can be a typing mode or a voice input mode. And meanwhile, the label generating equipment acquires a plurality of stored videos to be matched and the video content label of each video to be matched from the server, wherein the target video belongs to the videos in the videos to be matched. The label generating equipment searches a matched video content label matched with the input text from the video content labels of the videos to be matched, and if the matched video content label is found, the video to be matched corresponding to the matched video content label is output.

In one possible implementation, the target video is tested, that is, the input text information through the client is a video content tag generated by the target video, and the tag generation device directly detects the video content tag. And searching the corresponding video through the video content label, comparing the video with the target video to see whether the video is the target video, and if so, determining that the video content label of the generated target video is reliable.

Wherein, the whole network can be trained by using momentum gradient descent (MomentumSGD), wherein the learning rate is set to 0.002, every 30 epochs, the learning rate is attenuated to 0.1 times of the former, and 90 epochs are learned in total, and then the training of the model is stopped.

Or determining whether to stop the training of the model according to the evaluation results of the prediction result and the real result. For example, if the error between the predicted video content tag and the actual video content tag is smaller than the set threshold, the training of the model is stopped.

In the embodiment of the application, in addition to the steps illustrated in fig. 4, a training process of the tag generation device on the video type discrimination model and the video classification model is also illustrated, after the video content tags are generated, a scene test is performed, the tag generation device acquires an input text of a user, acquires a plurality of videos to be matched and video content tags of each video to be matched, searches matching video content tags matched with the input text from the video content tags of the plurality of videos to be matched, and outputs the videos to be matched corresponding to the matching video content tags. By the method, the video type distinguishing model and the video classification model can be trained, the performance of the models is guaranteed, and scene testing can be performed, so that the accuracy and reliability of video content labels are guaranteed, videos can be efficiently processed without depending on text information provided by the videos, meanwhile, due to the adoption of the video type distinguishing model and the video classification model, the video type distinguishing model and the video classification model can be free from subjective influences of people, the obtained result is objective, the accuracy of video searching of a user can be improved, and the user experience is improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a tag generation apparatus according to an embodiment of the present application, where the tag generation apparatus 60 includes: the acquiring unit 601, the processing unit 602, and the determining unit 603 are mainly configured to perform:

an obtaining unit 601, configured to obtain a target video, and extract a video frame image and an audio frame from the target video, where the video frame image and the audio frame have a corresponding relationship;

a processing unit 602, configured to convert the audio frame into a spectrogram, and call a video type discrimination model to perform type identification processing on the spectrogram and the video frame image, so as to obtain a video type of the target video;

the processing unit 602 is further configured to, if the video type of the target video is a target video type, invoke a video classification model to perform content tag identification processing on the spectrogram and the video frame image, so as to obtain a video content tag of the target video.

In one possible implementation, the video type discrimination model includes a first image convolution neural network, a first audio convolution neural network, and a first fully-connected network;

the processing unit 602, invoking a video type discrimination model to perform type identification processing on the spectrogram and the video frame image, to obtain a video type of the target video, includes (for):

calling the first image convolution neural network to carry out convolution processing on the video frame image to obtain first video frame image characteristics of the video frame image;

calling the first audio convolution neural network to carry out convolution processing on the frequency spectrogram to obtain a first frequency spectrogram characteristic, and splicing the first video frame image characteristic and the first frequency spectrogram characteristic into a first video characteristic of the target video;

calling the first full-connection network to determine the type matching probability of the first video characteristic and a target video type;

and if the type matching probability is larger than a first probability threshold, determining the video type of the target video as the target video type.

In one possible implementation, the video classification model includes a second audio convolutional neural network, a second image convolutional neural network, and a second fully connected network;

the processing unit 602, invoking a video classification model to perform content tag identification processing on the spectrogram and the video frame image, to obtain a video content tag of the target video, includes (is used for):

calling the second audio convolution neural network to carry out convolution processing on the spectrogram to obtain a second spectrogram characteristic;

calling the second image convolution neural network to carry out convolution processing on the video frame image to obtain a second video frame image characteristic, and splicing the second frequency spectrum characteristic and the second video frame image characteristic into a second video characteristic;

invoking the second fully-connected network to determine a set of tag matching probabilities for the second video feature and a plurality of video content tags;

and determining the video content label of the target video according to the label matching probability set.

In one possible implementation, the number of the tag matching probability sets is N, the N tag matching probability sets are determined by N video frame images and N audio frames, and N is an integer greater than 1;

a determining unit 603, configured to determine a video content tag of the target video according to the tag matching probability set, where the determining unit includes:

averagely processing the N label matching probability sets to obtain a target label matching probability set;

extracting the maximum target label matching probability from the target label matching probability set;

and if the maximum target label matching probability is greater than a second probability threshold, taking the video content label corresponding to the maximum target label matching probability as the video content label of the target video.

In a possible implementation manner, the obtaining unit 601 is further configured to obtain a sample video and a sample content tag of the sample video, and extract a sample video frame image and a sample audio frame from the sample video, where the sample video frame image and the sample audio frame have a corresponding relationship;

the processing unit 602 is further configured to convert the sample audio frame into a sample spectrogram, call an initialization model to process the sample spectrogram and the sample video frame image, and determine a predicted content tag of the sample video;

the determining unit 603 is further configured to train the initialization model according to the predicted content label and the sample content label, so as to obtain the video classification model.

In a possible implementation manner, the obtaining unit 601 is configured to obtain a sample video and a sample content tag of the sample video, and includes:

obtaining the sample video, wherein the sample video carries a video description text;

acquiring the plurality of video content tags, and searching for a video content tag to be determined matched with the video description text in the plurality of video content tags;

and taking the video content label to be determined as a sample content label of the sample video.

In one possible implementation, the audio frame belongs to a time-domain audio signal;

the processing unit 602 is configured to convert the audio frame into a spectrogram, and includes:

windowing the audio frame, and performing Fourier transform on the audio frame subjected to windowing to obtain a frequency domain audio signal;

and calling a filter to convert the frequency domain audio signal to obtain the spectrogram.

In a possible implementation manner, the obtaining unit 601 is further configured to obtain an input text of a user, and obtain a plurality of videos to be matched and a video content tag of each video to be matched, where the target video belongs to the plurality of videos to be matched;

the processing unit 602 is further configured to search a matching video content tag matched with the input text from the video content tags of the multiple videos to be matched, and output the video to be matched corresponding to the matching video content tag.

In the embodiment of the application, the obtaining unit 601 obtains a target video, and extracts a video frame image and an audio frame from the target video, where the video frame image and the audio frame have a corresponding relationship; the processing unit 602 converts the audio frame into a spectrogram, and invokes a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain a video type of the target video; and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the frequency spectrogram and the video frame image to obtain a video content label of the target video. By the method, the label generating device generates the label corresponding to the video according to the content of the video, so that the accuracy of the generated video label is improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram illustrating a tag generation apparatus provided in an embodiment of the present application, where the tag generation apparatus 70 at least includes a processor 701 and a memory 702. The processor 701 and the memory 702 may be connected by a bus or other means. The memory 702 may comprise a computer-readable storage medium, the memory 702 for storing a computer program comprising computer instructions, the processor 701 for executing the computer instructions stored by the memory 702. The processor 701 (or CPU) is a computing core and a control core of the label generating apparatus 70, and is adapted to implement one or more computer instructions, and specifically, adapted to load and execute the one or more computer instructions so as to implement the corresponding method flow or the corresponding function.

An embodiment of the present application also provides a computer-readable storage medium (Memory), which is a Memory device in the tag generation device 70 and is used for storing programs and data. It is understood that the memory 702 herein may include both the built-in storage medium of the label generating device 70 and, of course, the extended storage medium supported by the label generating device 70. The computer-readable storage medium provides a storage space that stores an operating system of the label producing device 70. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 701. Here, the Memory 702 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory; and optionally at least one computer readable storage medium located remotely from the processor 701.

In one implementation, the tag generation device 70 may be the tag generation device 102 in the tag generation system shown in fig. 1; the memory 702 has first computer instructions stored therein; the first computer instructions stored in the memory 702 are loaded and executed by the processor 701 to implement the corresponding steps in the method embodiments shown in fig. 4 and 5; in a specific implementation, the first computer instruction in the memory 702 is loaded by the processor 701 and executes the following steps:

and if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the spectrogram and the video frame image to obtain a video content label of the target video.

the processor 701 calls a video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video, and the method includes:

the processor 701 calls a video classification model to perform content tag identification processing on the spectrogram and the video frame image to obtain a video content tag of the target video, and the method includes:

invoking the second fully connected network to determine a set of tag matching probabilities for the second video feature and a plurality of video content tags;

the processor 701 determines the video content tag of the target video according to the tag matching probability set, including:

In a possible implementation manner, the processor 701 is further configured to:

obtaining a sample video and a sample content label of the sample video, and extracting a sample video frame image and a sample audio frame from the sample video, wherein the sample video frame image and the sample audio frame have a corresponding relation;

converting the sample audio frame into a sample spectrogram, calling an initialization model to process the sample spectrogram and the sample video frame image, and determining a predicted content label of the sample video;

and training the initialization model according to the predicted content label and the sample content label to obtain the video classification model.

In a possible implementation manner, the processor 701 is configured to obtain a sample video and a sample content tag of the sample video, and includes:

the processor 701 converts the audio frame into a spectrogram, including:

acquiring an input text of a user, and acquiring a plurality of videos to be matched and a video content label of each video to be matched, wherein the target video belongs to the plurality of videos to be matched;

and searching a matched video content label matched with the input text from the video content labels of the videos to be matched, and outputting the videos to be matched corresponding to the matched video content label.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to the flowcharts of fig. 4 and 5, and therefore, the detailed description thereof will not be repeated here.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of tag generation, the method comprising:

if the video type of the target video is the target video type, calling a video classification model to perform content label identification processing on the spectrogram and the video frame image to obtain a video content label of the target video;

wherein the extracting of the video frame image and the audio frame from the target video comprises: averagely dividing the target video into K video segments, extracting M frames of video frame images from each video segment in the K video segments and extracting M frames of audio frames matched with the M frames of video frame images; k and M are positive integers greater than or equal to 1;

wherein the video classification model comprises a second audio convolutional neural network, a second image convolutional neural network and a second fully connected network; the calling of the video classification model to perform content tag identification processing on the spectrogram and the video frame image to obtain a video content tag of the target video includes:

calling the second audio convolution neural network to carry out convolution processing on the spectrogram sequence to obtain a second spectrogram characteristic sequence; the spectrogram sequence is formed by spectrograms respectively corresponding to the extracted K-M frame audio frames, and each spectrogram feature included in the second spectrogram feature sequence respectively corresponds to one spectrogram in the spectrogram sequence;

calling the second image convolution neural network to carry out convolution processing on the video frame image sequence to obtain a second video frame image characteristic sequence; the video frame image sequence is formed by extracted K x M frame video frame images, and each video frame image feature included in the second video frame image feature sequence corresponds to one video frame image in the video frame image sequence;

splicing the second spectrogram feature sequence and the second video frame image feature sequence into a second video feature sequence; each video feature included in the second video feature sequence is obtained by splicing a pair of spectral features and video frame image features which have a corresponding relationship in the second spectrogram feature sequence and the second video frame image feature sequence;

calling the second fully-connected network to determine a label matching probability set of each video feature in the second video feature sequence and a plurality of video content labels; the number of the label matching probability sets is K x M, and one video feature corresponds to one label matching probability set;

averagely processing the K x M label matching probability sets to obtain a target label matching probability set; extracting the maximum target label matching probability from the target label matching probability set; and if the maximum target label matching probability is greater than a second probability threshold, taking the video content label corresponding to the maximum target label matching probability as the video content label of the target video.

2. The method of claim 1, wherein the video type discrimination model comprises a first image convolutional neural network, a first audio convolutional neural network, and a first fully connected network;

the calling of the video type discrimination model to perform type identification processing on the spectrogram and the video frame image to obtain the video type of the target video comprises:

3. The method of claim 1, further comprising:

4. The method of claim 3, wherein obtaining the sample video and the sample content tag of the sample video comprises:

5. The method of claim 1, wherein the audio frame belongs to a time-domain audio signal;

the converting the audio frame into a spectrogram comprises:

6. The method of claim 1, further comprising:

7. A label generation apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target video and extracting a video frame image and an audio frame from the target video, and the video frame image and the audio frame have a corresponding relation;

the processing unit is further configured to, if the video type of the target video is a target video type, call a video classification model to perform content tag identification processing on the spectrogram and the video frame image to obtain a video content tag of the target video;

8. A label generation device, characterized in that the label generation device comprises:

a memory for storing a computer program;

a processor running the computer program; implementing the label generation method of any one of claims 1-6.