CN114297439B

CN114297439B - Short video tag determining method, system, device and storage medium

Info

Publication number: CN114297439B
Application number: CN202111560398.0A
Authority: CN
Inventors: 袁征
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-05-23
Anticipated expiration: 2041-12-20
Also published as: CN114297439A

Abstract

The invention discloses a short video label determining method, a system, a device and a storage medium, wherein the method comprises the following steps: acquiring audio information of a first short video, and performing video-audio analysis on the first short video to obtain a first audio tag; acquiring key frame information of a first short video, and performing video content analysis on the first short video to obtain a first scene tag, a first object tag and a first character tag; acquiring title information, video description information and subtitle information of a first short video, and performing video semantic analysis on the first short video to obtain a first semantic tag; and carrying out weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag. The invention improves the efficiency of generating the short video label and improves the accuracy, the comprehensiveness and the reliability of the short video label. The invention can be widely applied to the technical field of video processing.

Description

Short video tag determining method, system, device and storage medium

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system, a device and a storage medium for determining a short video label.

Background

The existing video tag mainly classifies and tags long videos (the time exceeds 60 seconds), and the video tag is generally marked by analyzing the content of the videos, and the method mainly comprises the following steps: 1) Manually auditing the whole video content in a manual editing mode, and marking a classification label for the video by combining subjective judgment and understanding; 2) Through the AI recognition technology, face, scene and object recognition is performed on frames of video content, and labels of corresponding classes, such as stars, foods, places and the like, are extracted.

The existing video tag labeling method has the following defects:

1) The manually-edited classified label belongs to labor-intensive work, requires editors to have higher aesthetic ability and tolerance, and has the problems of low efficiency, low speed, high subjectivity of label quality, low video frame coverage rate and the like.

2) The AI content recognition technology has high requirement on the content of the video itself, requires that the video frame must be relatively simple, and cannot generate too many interference pictures, such as complicated streets, special effects of light and the like, so that a higher classification availability can be achieved through AI recognition, and therefore, the problem of limited application range exists.

3) The AI content recognition technology can only acquire the effective information of the content of the video, other key information such as music type MV is easy to ignore under the condition that the video content is insufficient to represent the key information of the video, and aiming at the main singing of a non-star type, classification labels such as singing, performing, singer and the like can only be output, so that the problems of incomplete labels and insufficient effectiveness exist, and more meaningful video labels can not be provided for the actual service demands.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent.

Therefore, an object of the embodiments of the present invention is to provide a short video tag determination method, which obtains multi-dimensional tag information by performing video audio analysis, video content analysis and video semantic analysis on a short video, and then generates a short video tag by weight decision analysis, so as to overcome the problems of low manual labeling efficiency, incomplete AI content identification labeling tag, small application range and the like in the prior art, improve the efficiency of short video tag generation, and also improve the accuracy, comprehensiveness and reliability of the short video tag.

It is another object of an embodiment of the present invention to provide a short video tag determination system.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides a method for determining a short video tag, including the following steps:

acquiring audio information of a first short video, and performing video-audio analysis on the first short video according to the audio information to obtain a first audio tag;

acquiring key frame information of the first short video, wherein the key frame information comprises scene information, object information and character information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene tag, a first object tag and a first character tag;

acquiring title information, video description information and subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag;

and carrying out weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag.

Further, in an embodiment of the present invention, the step of obtaining the audio information of the first short video, and performing video-audio analysis on the first short video according to the audio information to obtain a first audio tag specifically includes:

determining source address information of the first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information in the video physical file;

determining an audio fingerprint according to the audio information, and matching a first audio with similarity higher than a preset first threshold value in a preset audio library according to the audio fingerprint;

and inputting the first audio to a pre-constructed audio knowledge graph for matching to obtain the first audio tag.

Further, in an embodiment of the present invention, the step of obtaining key frame information of the first short video, where the key frame information includes scene information, object information, and person information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene tag, a first object tag, and a first person tag specifically includes:

Determining source address information of the first short video, acquiring a video physical file of the first short video according to the source address information, and determining a plurality of video frame files according to the video physical file;

extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and duplication removal on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with the variability larger than a preset second threshold value;

and respectively inputting the second key frame into a pre-trained scene recognition model, an object recognition model and a character recognition model, and determining the first scene tag, the first object tag and the first character tag according to recognition results.

Further, in an embodiment of the present invention, the step of extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and duplication removal on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with a difference greater than a preset second threshold value specifically includes:

performing fragment key frame extraction on the video frame file to obtain a plurality of first key frames, and determining a first key frame matrix;

Performing binarization processing on the first key frame to obtain a pixel feature matrix of the first key frame;

and performing dimension reduction and duplication removal on the first key frame matrix through a hierarchical clustering method according to the pixel feature matrix to obtain a plurality of second key frames with pixel feature differences larger than a preset second threshold.

Further, in an embodiment of the present invention, the step of obtaining the title information, the video description information and the subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag specifically includes:

determining source address information, title information and video description information of the first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information and subtitle information in the video physical file;

inputting the title information and the video description information into a pre-constructed video knowledge graph for matching to obtain a first derivative label;

performing voice recognition on the audio information to obtain text information, and performing NLP semantic analysis on the title information, the video description information, the subtitle information and the text information to obtain first semantic information;

And inputting the first semantic information into a pre-constructed semantic knowledge graph for matching to obtain a first semantic tag.

Further, in one embodiment of the present invention, the step of performing NLP semantic analysis on the title information, the video description information, the subtitle information, and the text information to obtain first semantic information specifically includes:

determining a first information matrix according to the title information, the video description information, the subtitle information and the text information;

determining part-of-speech tags of each word in the first information matrix through a GRU neural network, and determining a key entity matrix according to the part-of-speech tags and the first information matrix;

inputting the key entity matrix into a pre-trained semantic prediction model, outputting to obtain a semantic prediction result and a confidence coefficient matrix, and further determining first semantic information according to the semantic prediction result and the confidence coefficient matrix.

Further, in one embodiment of the present invention, the step of performing weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag, and the first semantic tag to generate a first short video tag specifically includes:

Determining content characteristic information of the first short video, wherein the content characteristic information comprises video duration information, video frame number information and resolution information, and classifying the content characteristic information through a random forest algorithm to obtain first content quality;

determining semantic feature information of the first short video, wherein the semantic feature information comprises text special symbol quantity information, text length information and OCR recognition result duty ratio information, and classifying the semantic feature information through a random forest algorithm to obtain first semantic quality;

determining audio characteristic information of the first short video, wherein the audio characteristic information comprises audio length information and audio spectrum information, and classifying the audio characteristic information through a random forest algorithm to obtain first audio quality;

and determining weights of the first scene tag, the first object tag and the first character tag according to the first content quality, determining weights of the first semantic tag and the first derivative tag according to the first semantic quality, determining weights of the first audio tag according to the first audio quality, and further screening and sorting all tags according to weight values to obtain a first short video tag.

In a second aspect, an embodiment of the present invention provides a short video tag determining system, including:

the video and audio analysis module is used for acquiring audio information of a first short video, and carrying out video and audio analysis on the first short video according to the audio information to obtain a first audio tag;

the video content analysis module is used for acquiring key frame information of the first short video, wherein the key frame information comprises scene information, object information and character information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene tag, a first object tag and a first character tag;

the video semantic analysis module is used for acquiring the title information, the video description information and the subtitle information of the first short video, and carrying out video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag;

and the decision analysis module is used for carrying out weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag.

In a third aspect, an embodiment of the present invention provides a short video tag determining apparatus, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement a short video tag determination method as described above.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is configured to perform a short video tag determination method as described above.

The advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

According to the embodiment of the invention, the audio information, the scene information, the object information, the character information, the title information, the video description information and the subtitle information of the first short video are acquired, the video audio analysis is carried out according to the audio information to obtain a first audio tag, the video content analysis is carried out according to the scene information, the object information and the character information to obtain a first scene tag, a first object tag and a first character tag, the video semantic analysis is carried out according to the title information, the video description information and the subtitle information to obtain a first semantic tag, and the weight decision analysis is carried out according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate the first short video tag. According to the embodiment of the invention, the multi-dimensional label information is obtained by respectively carrying out video and audio analysis, video content analysis and video semantic analysis on the short video, and then the short video label is generated by weight decision analysis, so that the problems of low manual labeling efficiency, incomplete AI content identification labeling label, small application range and the like in the prior art are solved, the efficiency of generating the short video label is improved, and the accuracy, the comprehensiveness and the reliability of the short video label are also improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will refer to the drawings that are needed in the embodiments of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity to describe some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for those skilled in the art.

FIG. 1 is a flowchart illustrating steps of a method for determining a short video tag according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for determining a short video tag according to an embodiment of the present invention;

FIG. 3 is a block diagram of a short video tag determination system according to an embodiment of the present invention;

fig. 4 is a block diagram of a short video tag determining apparatus according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present invention, the plurality means two or more, and if the description is made to the first and second for the purpose of distinguishing technical features, it should not be construed as indicating or implying relative importance or implicitly indicating the number of the indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

With the rapid development of short video services, the number of short videos is also coming into burst, but considering some characteristics below the short videos, the traditional method for labeling the classification labels of the long videos is not suitable for the short videos:

1) The short video content is more precise and short, the time is generally about 60 seconds, and the short video content does not contain a large amount of different information like a long video, so that a large amount of even complete video frames are not required to be analyzed like the conventional method of the long video, and a large optimization space exists for classifying and predicting time.

2) The information of the short video is more standard and clear, and has high mining value. Because short videos belong to the latter-beginning shows, in the current era of data analysis, clear and standard descriptive information such as { singer } _ { song name } _ { MV/eversion }, and the like are commonly owned, and the effective characteristics are ignored by the traditional long video classification label method, so that an optimization space exists for label accuracy.

Referring to fig. 1, an embodiment of the present invention provides a method for determining a short video tag, which specifically includes the following steps:

s101, acquiring audio information of a first short video, and performing video-audio analysis on the first short video according to the audio information to obtain a first audio tag.

Specifically, the embodiment of the invention obtains the audio information of the short video, matches the audio information in a preset audio library to obtain similar audio, and then classifies the audio information through an audio knowledge graph to obtain a first audio tag. The step S101 specifically includes the following steps:

s1011, determining source address information of a first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information in the video physical file;

s1012, determining an audio fingerprint according to the audio information, and matching a first audio with similarity higher than a preset first threshold value in a preset audio library according to the audio fingerprint;

s1013, inputting the first audio to a pre-constructed audio knowledge graph for matching to obtain a first audio tag.

Specifically, as shown in fig. 2, which is a specific flow chart of a short video tag determining method provided by the embodiment of the present invention, source address information of a first short video is first determined, a video physical file is obtained according to the source address information, and audio in the video physical file is extracted; generating corresponding audio fingerprints aiming at the audio, matching in an audio library, and finding out first audio with the similarity exceeding 80%; and matching the first audio input audio knowledge graph to obtain a first audio tag, wherein the first audio tag is used for carrying out decision calculation in the subsequent decision analysis step.

S102, acquiring key frame information of a first short video, wherein the key frame information comprises scene information, object information and character information, and analyzing video content of the first short video according to the key frame information to obtain a first scene tag, a first object tag and a first character tag.

Specifically, the embodiment of the invention determines key frame information for representing scene information, object information and character information through key frame extraction, and classifies the scene information, the object information and the character information through a scene recognition model, a first object label and a first character label. The step S102 specifically includes the following steps:

s1021, determining source address information of a first short video, acquiring a video physical file of the first short video according to the source address information, and determining a plurality of video frame files according to the video physical file;

s1022, extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and duplication removal on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with the variability larger than a preset second threshold value;

s1023, respectively inputting the second key frame into the pre-trained scene recognition model, the object recognition model and the character recognition model, and determining a first scene label, a first object label and a first character label according to the recognition result.

Specifically, as shown in fig. 2, the source address information of the first short video is first determined, then a video physical file is obtained through the source address information, and a video frame file is obtained by cutting video frames through a CV2 library; extracting a preliminary first key frame by using a slicing key frame extraction method, then performing dimension reduction and duplication removal on the first key frame by using a hierarchical clustering algorithm based on content characteristics, and selecting a plurality of groups of second key frames with the largest difference; and respectively inputting the second key frame into a scene recognition model, an object recognition model/target detection model and a person recognition model for recognition, and determining a first scene label, a first object label and a first person label according to the recognition result, wherein the first scene label, the first object label and the first person label are used for decision calculation in the subsequent decision analysis step.

Further as an optional implementation manner, a step S1022 of extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and duplication removal on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with differences greater than a preset second threshold value specifically includes:

s10221, carrying out fragment key frame extraction on a video frame file to obtain a plurality of first key frames, and determining a first key frame matrix;

S10222, performing binarization processing on the first key frame to obtain a pixel feature matrix of the first key frame;

s10223, performing dimension reduction and duplication removal on the first key frame matrix through a hierarchical clustering method according to the pixel feature matrix to obtain a plurality of second key frames with pixel feature differences larger than a preset second threshold.

Specifically, firstly, the video frame file is subjected to slice key frame extraction, and the total frame number of the video is assumed to be frame _i Setting the segments=60, and per segment truncated video frame persegmentframe=1, the first key frame matrix can be obtained as follows:

then advance the first key frameLine graying processing, using 0-255 to represent all picture pixels, and determining foreground pixel Foregroud and background pixel background, foreground color ratio as

The background color is of a ratio of

The foreground mean and variance are FA and FV respectively, the background mean and variance are BA and BV respectively, intra-class difference id=f×fv ² +B×BV ² Inter-class difference od=f×b× (FA-BA) ² And then taking Min (ID) as a pixel threshold value, comparing the pixel threshold value with each pixel point, setting the pixel point larger than or equal to the pixel threshold value as 1, and setting the pixel point smaller than the pixel threshold value as 0, thereby obtaining a first pixel characteristic matrix.

Reducing the dimension of a 60-dimension first pixel feature matrix by adopting a hierarchical clustering method, finding out 10 pictures with the largest difference, and designating 10 clusters as the specified clusters, wherein the method comprises the following steps of:

1) Taking a 60-dimensional first pixel feature matrix as an initial sample, and forming 60 content feature classes from one class: g1 (0.) G60 (0), the Single-link distance between the classes is calculated to obtain a distance matrix of 60 x 60, with "0" representing the initial state.

2) Assuming that a distance matrix D (n) is obtained (n is the number of times of successive clustering and merging), finding out the minimum element in the D (n), and merging two corresponding classes into one class. Thereby creating a new classification: g1 (n+1), G2 (n+1).

3) And calculating the distance between the new categories after combination to obtain D (n+1).

4) Jumping to step 2), repeating the calculation and the combination.

5) And stopping traversing after the step of reducing to G10, and taking the first picture in each cluster as the key frame after the dimension reduction, so as to obtain the second key frame of the 10 frames after the dimension reduction.

It can be appreciated that the embodiment of the invention extracts the preliminary key frames by using the slicing key frame extraction method on the premise of keeping the video key information as much as possible, then uses the hierarchical clustering algorithm based on the content characteristics for key frame dimension reduction, achieves the effect of reducing the prediction time of the video content by using the minimum key frames, and improves the efficiency of generating the short video labels.

And S103, acquiring the title information, the video description information and the subtitle information of the first short video, and carrying out video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag.

Specifically, the embodiment of the invention acquires the title information, the video description information and the caption information of the short video, performs video semantic analysis according to the title information, the video description information and the caption information, and then obtains a first semantic tag by matching in a semantic knowledge graph according to the result of the semantic analysis. Step S103 specifically includes the following steps:

s1031, determining source address information, title information and video description information of a first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information and subtitle information in the video physical file;

s1032, inputting the title information and the video description information into a pre-constructed video knowledge graph for matching to obtain a first derivative label;

s1033, performing voice recognition on the audio information to obtain text information, and performing NLP semantic analysis on the title information, the video description information, the subtitle information and the text information to obtain first semantic information;

S1034, inputting the first semantic information into a pre-constructed semantic knowledge graph for matching to obtain a first semantic tag.

Specifically, as shown in fig. 2, first, determining source address information of a first short video and structural information such as title information, video description information and the like, then acquiring a video physical file through the source address information, and extracting audio information and subtitle information; matching the video knowledge graph through structural information such as title information, video description information and the like to obtain a first derivative label; performing voice recognition on the audio information to generate corresponding text information, and performing NLP semantic analysis on the subtitle information, the title information, the video description information and the text information to obtain first semantic information; combining the first semantic information with the semantic knowledge graph and the text importance to obtain a first semantic tag, wherein the first semantic tag is used for carrying out decision calculation in the subsequent decision analysis step.

Further as an optional implementation manner, the step S1033 of performing NLP semantic analysis on the title information, the video description information, the subtitle information, and the text information to obtain the first semantic information specifically includes:

s10331, determining a first information matrix according to the title information, the video description information, the subtitle information and the text information;

S10332, determining part-of-speech labels of all words in the first information matrix through the GRU neural network, and determining a key entity matrix according to the part-of-speech labels and the first information matrix;

s10333, inputting the key entity matrix into a pre-trained semantic prediction model, outputting to obtain a semantic prediction result and a confidence coefficient matrix, and further determining first semantic information according to the semantic prediction result and the confidence coefficient matrix.

Specifically, since the short video has good structured information, the title information and the video description information of the short video can be directly obtained, the title information and the text information can be further obtained, and the information can be cut according to special characters such as _ or _, and the like to obtain an information set { T } _name1 ....T _namej And forming a first information matrix.

And for the first information matrix, learning the features by using the GRU neural network, accessing the learned features into the CRF decoding layer to finish sequence marking, and outputting word boundaries and parts of speech corresponding to each word in the first information matrix and the relation among entity categories. The output tags contain 24 part-of-speech tags (lowercase letters), 4 specialty category tags (uppercase letters), person names by case labels (PER/LOC/ORG/TIME and nr/ns/nt/t), place names, organization names and TIME, wherein lowercase indicates information on person names for low confidence etc. By deleting non-key entities such as adjectives, pronouns, prepositions and adverbs, a key entity matrix is finally output as follows:

Wherein, the Entity is _i The key entity IMP representing the ith de-duplication obtained from the first information matrix _i Indicating the corresponding degree of importance of the device,

word frequency f _Tnamej,name Representing Entity _i The number of occurrences/i, df in the first information matrix _Tnamej The representation contains Entity _i Tname+1 of (C).

The key entity matrix input semantic prediction model is obtained, and the corresponding label and confidence coefficient matrix are output as follows:

wherein Tag _iz Representation according to Entity _i Acquired tag, CL _iz Representation according to Entity _i The obtained confidence coefficient finishes the matrix compression according to the data under the same Tag, and finally generates a text-free and accurate rate

And (3) sequencing the predictive label matrix PredTag from high to low, and returning the predictive semantic label of Top K, namely the first semantic label.

It can be appreciated that the semantic tag prediction technology provided by the embodiment of the invention can still generate the classification tag with higher availability under the condition of low quality of video audio and video content or insufficient effective information.

And S104, performing weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag.

Specifically, multiple types of tags are obtained according to the steps, weights of the various types of tags are obtained through calculation of video content quality, video semantic quality and video audio quality, tag information of Top K is calculated, and the tag information is assembled into a JSON body to be output. The step S104 specifically includes the following steps:

S1041, determining content characteristic information of a first short video, wherein the content characteristic information comprises video duration information, video frame number information and resolution information, and classifying the content characteristic information through a random forest algorithm to obtain first content quality;

s1042, determining semantic feature information of a first short video, wherein the semantic feature information comprises text special symbol quantity information, text length information and OCR recognition result duty ratio information, and classifying the semantic feature information through a random forest algorithm to obtain first semantic quality;

s1043, determining audio characteristic information of the first short video, wherein the audio characteristic information comprises audio length information and audio spectrum information, and classifying the audio characteristic information through a random forest algorithm to obtain first audio quality;

s1044, determining weights of a first scene tag, a first object tag and a first character tag according to the first content quality, determining weights of the first semantic tag and a first derivative tag according to the first semantic quality, determining weights of the first audio tag according to the first audio quality, and further screening and sorting all tags according to the weight values to obtain a first short video tag.

Specifically, acquiring the duration, the total frame number and the resolution of a short video as video content characteristics, classifying the video content quality by adopting an RF random forest algorithm, and outputting three tags of CQ= { High, medium, low }; acquiring the number, length and OCR recognition result duty ratio (OCR recognition result/frame number) of a title special symbol of a short video as characteristics, classifying the quality of a video text by adopting an RF random forest algorithm, and outputting three tags of TQ= { High, medium and Low }; acquiring the audio length of a short video and the frequency spectrum of the audio as video and audio characteristics, classifying the video and audio quality by adopting an RF random forest algorithm, and outputting three tags of AQ= { High, medium and Low }; and selecting the weight of a corresponding tag (CT is a content tag, AT is an audio tag, TT is a text tag) according to the quality function matrix, if the High corresponding weight is 1, the medium corresponding weight is 0.5, the low corresponding weight is 0, and finally screening according to the weight, sorting from High to low, and outputting Top K tag information.

The method steps of the embodiments of the present invention are described above. It can be understood that the embodiment of the invention obtains multi-dimensional tag information by respectively carrying out video and audio analysis, video content analysis and video semantic analysis on the short video, and then generates the short video tag by weight decision analysis, thereby overcoming the problems of low manual labeling efficiency, incomplete AI content identification labeling tag, small application range and the like in the prior art, improving the efficiency of short video tag generation and improving the accuracy, comprehensiveness and reliability of the short video tag. Compared with the prior art, the embodiment of the invention has the following advantages:

1) The short video label prediction method realizes the flow design of short video label prediction based on semantic, content, audio and other media information, and provides the short video classification label prediction capability of multimode fusion.

2) The classification label prediction method based on the video text information is provided, and good classification label prediction can be provided when the video content is not rich enough.

3) The video key frame extraction technology combining the slicing key frames and the hierarchical clustering algorithm based on the content characteristics is realized, and the prediction time based on the video content prediction model is greatly saved.

4) A multi-type label decision method based on semantic features, content features and audio features is provided to achieve the effect of generating classification labels of comprehensive confidence indexes (including importance and accuracy) under the condition of multi-mode fusion.

In addition, the embodiment of the invention also has the following functions:

1) Short video management function based on video classification labels: the application layer provides a video classification label prediction entrance, can be used for a user to upload short videos and edit video text information, and simultaneously carries out manual correction aiming at a prediction result and finally stores the prediction result in a database, and the next time the short videos are opened, so that the label information of the videos can be referred to in real time, and the number and the operation condition of various classification labels of the current music library are counted and displayed. The flow chart of the newly added video is presented as follows:

A1, uploading and editing the short video. The user can upload the short video by selecting a local or URL mode on the application layer page, and support text information editing of short video titles, notes, singers and the like.

And A2, calling a classification label prediction interface and displaying a prediction result.

A3, manually correcting the prediction result, performing negative feedback on the inaccurate label, and manually removing the unsuitable label.

2) Short video search function based on category labels: and (3) processing the existing short video resources of the song library in batches offline, generating a corresponding short video classification tag library, and returning a search hit result by adding a tag matching strategy. The flow of searching for the synchronization index is described as follows:

b1, offline batch processing of short video resources of a song library: and carrying out off-line processing of T+1 on the new video every day, generating corresponding tag information, and storing the tag information into a database.

B2, synchronizing to a search index base: the video tag information is incrementally synchronized to the search index base at daily timings.

3) Short video recommendation function based on category labels: and the existing short video resources of the song library are processed in batches offline, a corresponding short video classification tag library is generated, and rich tag information is provided for short video recommendation. The following describes the flow of recommending the construction of user preference models and video similarity models:

And C1, processing the short video resources of the library in batches offline.

And C2, constructing a user preference model: and according to the past user behavior of the user, associating the label information of the video, and generating a corresponding user preference model.

And C3, constructing a video similarity model: and constructing a video similarity matrix according to the label relevance and similarity between videos, and providing more video sources with the same label for video recommendation.

Referring to fig. 3, an embodiment of the present invention provides a short video tag determining system, including:

the video and audio analysis module is used for acquiring the audio information of the first short video, and carrying out video and audio analysis on the first short video according to the audio information to obtain a first audio tag;

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

Referring to fig. 4, an embodiment of the present invention provides a short video tag determining apparatus, including:

at least one processor;

at least one memory for storing at least one program;

The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.

The embodiment of the present invention also provides a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is for performing the above-described one short video tag determination method.

The computer readable storage medium of the embodiment of the invention can execute the short video label determining method provided by the embodiment of the method of the invention, and can execute the steps of any combination implementation of the embodiment of the method, thereby having the corresponding functions and beneficial effects of the method.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the present invention has been described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features described above may be integrated in a single physical device and/or software module or one or more of the functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or a part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium upon which the program described above is printed, as the program described above may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. A short video tag determination method, comprising the steps of:

performing weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag;

the step of obtaining the title information, the video description information and the caption information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the caption information to obtain a first semantic tag specifically comprises the following steps:

inputting the first semantic information into a pre-constructed semantic knowledge graph for matching to obtain a first semantic tag;

the step of generating a first short video tag by performing weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag specifically includes:

determining weights of the first scene tag, the first object tag and the first character tag according to the first content quality, determining weights of the first semantic tag and the first derivative tag according to the first semantic quality, determining weights of the first audio tag according to the first audio quality, and screening and sorting all tags according to weight values to obtain a first short video tag;

the step of performing NLP semantic analysis on the title information, the video description information, the subtitle information, and the text information to obtain first semantic information specifically includes:

2. The method for determining a short video tag according to claim 1, wherein the step of obtaining the audio information of the first short video, and performing video-audio analysis on the first short video according to the audio information to obtain the first audio tag specifically comprises:

3. The method for determining a short video tag according to claim 1, wherein the step of obtaining key frame information of the first short video, where the key frame information includes scene information, object information, and person information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene tag, a first object tag, and a first person tag specifically includes:

4. The method for determining a short video tag according to claim 3, wherein the steps of extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and duplication removal on the first key frames by a hierarchical clustering algorithm, and selecting a plurality of second key frames with differences greater than a preset second threshold value specifically include:

5. A short video tag determination system, comprising:

The decision analysis module is used for carrying out weight decision analysis according to the first audio tag, the first scene tag, the first object tag, the first character tag and the first semantic tag to generate a first short video tag;

the video semantic analysis module is specifically used for:

inputting the key entity matrix into a pre-trained semantic prediction model, outputting to obtain a semantic prediction result and a confidence coefficient matrix, and further determining first semantic information according to the semantic prediction result and the confidence coefficient matrix;

the decision analysis module is specifically configured to:

6. A short video tag determining apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when said at least one program is executed by said at least one processor, said at least one processor is caused to implement a short video tag determination method as claimed in any one of claims 1 to 4.

7. A computer readable storage medium in which a processor executable program is stored, characterized in that the processor executable program is for performing a short video tag determination method according to any one of claims 1 to 4 when being executed by a processor.