CN114297439A

CN114297439A - Method, system, device and storage medium for determining short video label

Info

Publication number: CN114297439A
Application number: CN202111560398.0A
Authority: CN
Inventors: 袁征
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-08
Anticipated expiration: 2041-12-20
Also published as: CN114297439B

Abstract

The invention discloses a method, a system, a device and a storage medium for determining a short video label, wherein the method comprises the following steps: acquiring audio information of a first short video, and performing video and audio analysis on the first short video to obtain a first audio tag; acquiring key frame information of a first short video, and performing video content analysis on the first short video to obtain a first scene tag, a first object tag and a first character tag; acquiring title information, video description information and subtitle information of a first short video, and performing video semantic analysis on the first short video to obtain a first semantic tag; and performing weight decision analysis according to the first audio label, the first scene label, the first object label, the first human object label and the first semantic label to generate a first short video label. The invention improves the efficiency of generating the short video label and also improves the accuracy, comprehensiveness and reliability of the short video label. The invention can be widely applied to the technical field of video processing.

Description

Method, system, device and storage medium for determining short video label

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system and a device for determining a short video label and a storage medium.

Background

The current video tagging mainly performs video classification and tagging on long videos (the time exceeds 60 seconds), and generally completes the video tagging by analyzing the content of the videos, and the following two modes are mainly adopted: 1) manually checking the whole video content in a manual editing mode, and marking a classification label for the video by combining subjective judgment and understanding; 2) by means of the AI identification technology, face, scene and object identification is performed on frames appearing in the video content, and labels of corresponding classes, such as stars, food, places and the like, are extracted.

The existing video label labeling method has the following defects:

1) the classification label of manual work, it is labor intensive work, require editor to possess higher aesthetic ability and patience, there are inefficiency, speed slow, label quality subjective nature is big, and the video frame coverage is low scheduling problem.

2) The AI content recognition technology has a high requirement on the content of the video itself, requires that the video frame must be relatively simple, and cannot generate too many interfering pictures, such as a complicated street, a special effect of a light monster land departure, and the like, so that a high classification availability can be achieved through AI recognition, and therefore, the problem of limited application range exists.

3) The AI content identification technology can only obtain effective information of the content of the video, and under the condition that the video content is not enough to represent the key information of the video, other key information, such as music type MV, can be easily ignored, and only classification labels such as singing, performance, singer and the like can be output aiming at non-star type leading singing, so that the problems of incomplete labels and insufficient effectiveness exist, and more meaningful video labels cannot be provided for real business requirements.

Disclosure of Invention

The present invention aims to solve at least to some extent one of the technical problems existing in the prior art.

Therefore, an object of an embodiment of the present invention is to provide a method for determining a short video tag, where the method obtains multi-dimensional tag information by performing video and audio analysis, video content analysis, and video semantic analysis on short videos, and then generates a short video tag by weight decision analysis, so as to overcome the problems of low manual tagging efficiency, incomplete AI content identification tagging tag, and small application range in the prior art, improve the efficiency of generating a short video tag, and also improve the accuracy, comprehensiveness, and reliability of a short video tag.

It is another object of embodiments of the present invention to provide a short video tag determination system.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides a method for determining a short video tag, including the following steps:

acquiring audio information of a first short video, and performing video and audio analysis on the first short video according to the audio information to obtain a first audio tag;

acquiring key frame information of the first short video, wherein the key frame information comprises scene information, object information and character information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene label, a first object label and a first character label;

acquiring title information, video description information and subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag;

and performing weight decision analysis according to the first audio label, the first scene label, the first object label and the first semantic label to generate a first short video label.

Further, in an embodiment of the present invention, the step of obtaining audio information of the first short video, and performing video and audio analysis on the first short video according to the audio information to obtain the first audio tag specifically includes:

determining source address information of the first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information in the video physical file;

determining an audio fingerprint according to the audio information, and matching a first audio with the similarity higher than a preset first threshold value in a preset audio library according to the audio fingerprint;

and inputting the first audio into a pre-constructed audio knowledge graph for matching to obtain the first audio label.

Further, in an embodiment of the present invention, the step of acquiring key frame information of the first short video, where the key frame information includes scene information, object information, and person information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene tag, a first object tag, and a first person tag specifically includes:

determining source address information of the first short video, acquiring video physical files of the first short video according to the source address information, and determining a plurality of video frame files according to the video physical files;

extracting a plurality of first key frames from the video frame files, performing dimensionality reduction and de-duplication on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with the difference larger than a preset second threshold value;

and respectively inputting the second key frame into a pre-trained scene recognition model, an object recognition model and a character recognition model, and determining the first scene label, the first object label and the first character label according to a recognition result.

Further, in an embodiment of the present invention, the step of extracting a plurality of first key frames from the plurality of video frame files, performing dimension reduction and de-duplication on the first key frames by using a hierarchical clustering algorithm, and selecting a plurality of second key frames whose differences are greater than a preset second threshold specifically includes:

extracting the segmented key frames of the video frame file to obtain a plurality of first key frames, and determining a first key frame matrix;

performing binarization processing on the first key frame to obtain a pixel characteristic matrix of the first key frame;

and performing dimension reduction and duplicate removal on the first key frame matrix through a hierarchical clustering method according to the pixel feature matrix to obtain a plurality of second key frames with pixel feature difference larger than a preset second threshold value.

Further, in an embodiment of the present invention, the step of obtaining the title information, the video description information, and the subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information, and the subtitle information to obtain a first semantic tag specifically includes:

determining source address information, title information and video description information of the first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information and subtitle information in the video physical file;

inputting the title information and the video description information into a pre-constructed video knowledge graph for matching to obtain a first derivative label;

performing voice recognition on the audio information to obtain text information, and performing NLP semantic analysis on the title information, the video description information, the subtitle information and the text information to obtain first semantic information;

and inputting the first semantic information into a pre-constructed semantic knowledge graph for matching to obtain a first semantic label.

Further, in an embodiment of the present invention, the performing NLP semantic analysis on the header information, the video description information, the subtitle information, and the text information to obtain first semantic information specifically includes:

determining a first information matrix according to the title information, the video description information, the subtitle information and the text information;

determining part-of-speech tags of each word in the first information matrix through a GRU neural network, and determining a key entity matrix according to the part-of-speech tags and the first information matrix;

and inputting the key entity matrix into a pre-trained semantic prediction model, outputting to obtain a semantic prediction result and a confidence coefficient matrix, and determining first semantic information according to the semantic prediction result and the confidence coefficient matrix.

Further, in an embodiment of the present invention, the step of performing weight decision analysis according to the first audio tag, the first scene tag, the first object tag, and the first semantic tag to generate a first short video tag specifically includes:

determining content characteristic information of the first short video, wherein the content characteristic information comprises video duration information, video frame number information and resolution information, and classifying the content characteristic information through a random forest algorithm to obtain first content quality;

determining semantic feature information of the first short video, wherein the semantic feature information comprises text special symbol number information, text length information and OCR recognition result proportion information, and classifying the semantic feature information through a random forest algorithm to obtain first semantic quality;

determining audio characteristic information of the first short video, wherein the audio characteristic information comprises audio length information and audio spectrum information, and classifying the audio characteristic information through a random forest algorithm to obtain first audio quality;

determining the weights of the first scene label, the first object label and the first character label according to the first content quality, determining the weights of the first semantic label and the first derivative label according to the first semantic quality, determining the weight of the first audio label according to the first audio quality, and screening and sorting the labels according to the weight to obtain a first short video label.

In a second aspect, an embodiment of the present invention provides a short video tag determination system, including:

the video and audio analysis module is used for acquiring audio information of a first short video and carrying out video and audio analysis on the first short video according to the audio information to obtain a first audio label;

the video content analysis module is used for acquiring key frame information of the first short video, wherein the key frame information comprises scene information, object information and character information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene label, a first object label and a first character label;

the video semantic analysis module is used for acquiring the title information, the video description information and the subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic tag;

and the decision analysis module is used for performing weight decision analysis according to the first audio label, the first scene label, the first object label, the first human object label and the first semantic label to generate a first short video label.

In a third aspect, an embodiment of the present invention provides a short video tag determination apparatus, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement a short video tag determination method as described above.

In a fourth aspect, the present invention further provides a computer-readable storage medium, in which a program executable by a processor is stored, and when the program executable by the processor is executed by the processor, the program is used to execute a short video tag determination method as described above.

Advantages and benefits of the present invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention:

the method and the device for generating the first short video label comprise the steps of obtaining audio information, scene information, object information, character information, title information, video description information and subtitle information of a first short video, carrying out video and audio analysis according to the audio information to obtain a first audio label, carrying out video content analysis according to the scene information, the object information and the character information to obtain a first scene label, a first object label and a first character label, carrying out video semantic analysis according to the title information, the video description information and the subtitle information to obtain a first semantic label, and further carrying out weight decision analysis according to the first audio label, the first scene label, the first object label, the first character label and the first semantic label to generate the first short video label. The embodiment of the invention respectively performs video audio analysis, video content analysis and video semantic analysis on the short video to obtain multi-dimensional label information, and then generates the short video label through weight decision analysis, thereby overcoming the problems of low manual labeling efficiency, incomplete AI content identification labeling label, small application range and the like in the prior art, improving the generation efficiency of the short video label, and also improving the accuracy, the comprehensiveness and the reliability of the short video label.

Drawings

In order to more clearly illustrate the technical solution in the embodiment of the present invention, the following description is made on the drawings required to be used in the embodiment of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solution of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a short video tag determination method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for determining a short video tag according to an embodiment of the present invention;

fig. 3 is a block diagram of a short video tag determination system according to an embodiment of the present invention;

fig. 4 is a block diagram of a short video tag determination apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the description of the present invention, the meaning of a plurality is two or more, if there is a description to the first and the second for the purpose of distinguishing technical features, it is not understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

With the rapid development of short video services, the number of short videos also comes to the outbreak period, but considering some characteristics below the short videos, the traditional classification label labeling method for the long videos is not applicable to the short videos:

1) the content of the short video is more precise and shorter, the time length of the short video is about 60 seconds generally, and the short video does not contain a large amount of different information like a long video, so that a large amount of even complete video frames do not need to be analyzed like the traditional method of the long video, and a large optimization space exists in the classification prediction time.

2) The information of the short video is more standard and clear, and the mining value is very high. Since short videos belong to the future, in the era of prevailing data analysis, clear and normative descriptive information such as { singer } _ { song name } _ MV/singing } and the like is generally possessed, the effective features are ignored by the conventional classification labeling method for long videos, and an optimization space exists in the labeling accuracy.

Referring to fig. 1, an embodiment of the present invention provides a method for determining a short video tag, which specifically includes the following steps:

s101, obtaining audio information of the first short video, and performing video and audio analysis on the first short video according to the audio information to obtain a first audio label.

Specifically, the embodiment of the invention obtains the audio information of the short video, obtains the similar audio by matching the audio information in a preset audio library, and then obtains the first audio label by classifying through the audio knowledge graph. Step S101 specifically includes the following steps:

s1011, determining source address information of the first short video, acquiring a video physical file of the first short video according to the source address information, and extracting audio information in the video physical file;

s1012, determining an audio fingerprint according to the audio information, and matching a first audio with the similarity higher than a preset first threshold value in a preset audio library according to the audio fingerprint;

s1013, inputting the first audio into a pre-constructed audio knowledge graph for matching to obtain a first audio tag.

Specifically, as shown in fig. 2, which is a specific flowchart of the short video tag determination method provided by the embodiment of the present invention, first, source address information of a first short video is determined, a video physical file is obtained according to the source address information, and an audio in the video physical file is extracted; generating corresponding audio fingerprints aiming at audio, matching in an audio library, and finding out first audio with the similarity exceeding 80%; and inputting the first audio frequency into an audio frequency knowledge graph, and matching to obtain a first audio frequency label, wherein the first audio frequency label is used for decision calculation in the subsequent decision analysis step.

S102, obtaining key frame information of the first short video, wherein the key frame information comprises scene information, object information and character information, and performing video content analysis on the first short video according to the key frame information to obtain a first scene label, a first object label and a first character label.

Specifically, the embodiment of the invention determines the key frame information used for representing the scene information, the object information and the character information through key frame extraction, and then obtains the first scene label, the first object label and the first character label through classification of the scene identification model, the object identification model and the character identification model. Step S102 specifically includes the following steps:

s1021, determining source address information of the first short video, acquiring video physical files of the first short video according to the source address information, and determining a plurality of video frame files according to the video physical files;

s1022, extracting a plurality of first key frames from the plurality of video frame files, performing dimension reduction and duplication removal on the first key frames through a hierarchical clustering algorithm, and selecting a plurality of second key frames with the difference larger than a preset second threshold value;

and S1023, respectively inputting the second key frame into a scene recognition model, an object recognition model and a character recognition model which are trained in advance, and determining a first scene label, a first object label and a first character label according to the recognition result.

Specifically, as shown in fig. 2, first, source address information of a first short video is determined, then a video physical file is obtained through the source address information, and a video frame file is obtained by cutting a video frame through a CV2 library; extracting a preliminary first key frame by using a fragment key frame extraction method, then performing dimension reduction and duplication removal on the first key frame by using a hierarchical clustering algorithm based on content features, and selecting a plurality of groups of second key frames with the largest differences; and respectively inputting the second key frame into a scene recognition model, an object recognition model/target detection model and a character recognition model for recognition, and determining a first scene label, a first object label and a first character label according to recognition results, wherein the first scene label, the first object label and the first character label are used for decision calculation in the subsequent decision analysis steps.

As a further optional implementation manner, the step S1022 of extracting a plurality of first key frames from a plurality of video frame files, performing dimension reduction and deduplication on the first key frames by using a hierarchical clustering algorithm, and selecting a plurality of second key frames whose differences are greater than a preset second threshold value includes:

s10221, extracting the segmented key frames of the video frame file to obtain a plurality of first key frames, and determining a first key frame matrix;

s10222, performing binarization processing on the first key frame to obtain a pixel feature matrix of the first key frame;

s10223, according to the pixel feature matrix, performing dimension reduction and duplicate removal on the first key frame matrix by a hierarchical clustering method to obtain a plurality of second key frames with pixel feature difference larger than a preset second threshold.

Specifically, firstly, the video frame file is subjected to slice key frame extraction, and the total frame number of the video is assumed to be frame_iIf the segment number segments is set to 60 and the perSegmentFrame of each segment is set to 1, the first key frame matrix can be obtained as follows:

then, graying the first key frame, using 0-255 to represent all picture pixels, and determining foreground pixel Foregroud, background pixel background and foreground color ratio as

Background color ratio of

Foreground color mean and variance are FA and FV, respectively, background color mean and variance are BA and BV, respectively, and intra-class difference ID is FxFV²+B×BV²And the inter-class difference OD ═ F × B × (FA-BA)²Then, Min (ID) is taken as a pixel threshold, the pixel threshold is compared with each pixel point, the pixel points which are greater than or equal to the pixel threshold are set as 1, and the pixel points which are smaller than the pixel threshold are set as 0, so that a first pixel characteristic matrix is obtained.

Reducing the dimension of the 60-dimensional first pixel characteristic matrix by adopting a hierarchical clustering method, finding out 10 pictures with the largest difference, and specifying a cluster as 10 clusters as follows:

1) taking the first pixel feature matrix of 60 dimensions as an initial sample, and forming a class by itself, then forming 60 content feature classes: g1 (0.).. G60(0), Single-link distances between classes are calculated, resulting in a 60 × 60 distance matrix, with "0" representing the initial state.

2) Assuming that the distance matrix D (n) (n is the number of times of successive clustering merging) is obtained, the minimum element in D (n) is found, and the two corresponding types are merged into one type. Thereby establishing a new classification: g1(n +1), G2(n + 1).

3) And calculating the distance between the new classes after combination to obtain D (n + 1).

4) Jumping to the step 2), and repeating the calculation and the combination.

5) And stopping traversing until the image is reduced to G10, and taking the first picture in each cluster as the key frame after dimension reduction to obtain a 10-frame second key frame after dimension reduction.

It can be appreciated that in the embodiment of the invention, on the premise of keeping the video key information as much as possible, the fragment key frame extraction method is used for extracting the preliminary key frame, then the hierarchical clustering algorithm based on the content features is used for reducing the dimension of the key frame, the effect of reducing the prediction time of the video content is achieved by using the fewest key frames, and the efficiency of generating the short video label is improved.

S103, acquiring the title information, the video description information and the subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information and the subtitle information to obtain a first semantic label.

Specifically, the embodiment of the invention obtains the title information, the video description information and the subtitle information of the short video, performs video semantic analysis according to the title information, the video description information and the subtitle information, and then matches in a semantic knowledge map according to the result of the semantic analysis to obtain a first semantic label. Step S103 specifically includes the following steps:

s1031, determining source address information, title information and video description information of the first short video, obtaining a video physical file of the first short video according to the source address information, and extracting audio information and subtitle information in the video physical file;

s1032, inputting the title information and the video description information into a pre-constructed video knowledge graph for matching to obtain a first derivative label;

s1033, performing voice recognition on the audio information to obtain text information, and further performing NLP semantic analysis on the title information, the video description information, the subtitle information and the text information to obtain first semantic information;

s1034, inputting the first semantic information into a pre-constructed semantic knowledge graph for matching to obtain a first semantic label.

Specifically, as shown in fig. 2, first, source address information, title information, video description information, and other structured information of the first short video are determined, then a video physical file is obtained through the source address information, and audio information and subtitle information are extracted; matching a video knowledge graph through structured information such as title information, video description information and the like to obtain a first derivative label; performing voice recognition on the audio information to generate corresponding text information, and performing NLP semantic analysis on subtitle information, title information, video description information and the text information to obtain first semantic information; and combining the first semantic information with the semantic knowledge map and the text importance to obtain a first semantic label, wherein the first semantic label is used for decision calculation in the subsequent decision analysis step.

As a further optional implementation manner, the step S1033 of performing NLP semantic analysis on the header information, the video description information, the subtitle information, and the text information to obtain first semantic information specifically includes:

s10331, determining a first information matrix according to the title information, the video description information, the subtitle information and the text information;

s10332, determining part-of-speech tags of each word in the first information matrix through a GRU neural network, and determining a key entity matrix according to the part-of-speech tags and the first information matrix;

s10333, inputting the key entity matrix into a pre-trained semantic prediction model, outputting to obtain a semantic prediction result and a confidence coefficient matrix, and determining first semantic information according to the semantic prediction result and the confidence coefficient matrix.

Specifically, since the short video has good structured information, the title information and the video description information of the short video can be directly obtained, the subtitle information and the text information can be further obtained, and the information can be cut according to special characters such as _ or "" or |, so as to obtain an information set { T [ ("T")_name1....T_namejAnd forming a first information matrix.

And (3) learning features by using a GRU neural network for the first information matrix, accessing the learned features to a CRF decoding layer to finish sequence labeling, and outputting word boundaries and parts of speech corresponding to all words in the first information matrix and the relation between entity categories. The output labels comprise 24 part-of-speech labels (lower case letters), 4 professional category labels (upper case letters), and person names, place names, organization names and TIME labeled by upper case and lower case (PER/LOC/ORG/TIME and nr/ns/nt/t), wherein the lower case represents the information of the person names and the like for low confidence. By deleting non-key entities such as adjectives, quantifiers, pronouns, prepositions, adverbs and the like, the key entity matrix is finally output as follows:

wherein, Entity_iIndicating the ith de-duplicated key entity IMP obtained from the first information matrix_iWhich indicates the degree of importance of the correspondence,

word frequency f_Tnamej,nameRepresenting an Entity_iNumber of occurrences/i, df in the first information matrix_TnamejRepresentation contains Entity_iTname + 1.

Obtaining a key entity matrix input semantic prediction model, and outputting a corresponding label and a confidence matrix as follows:

wherein, Tag_izRepresentation according to Entity_iObtained tag, CL_izRepresentation according to Entity_iThe obtained confidence coefficient completes matrix compression according to the data under the same Tag, and finally generates the text accuracy rate without repetition

And (3) sorting the prediction tag matrix PredTag from high to low, and then returning the prediction semantic tag of Top K, namely the first semantic tag.

It can be appreciated that the semantic tag prediction technology provided by the embodiment of the present invention can still generate a classification tag with high availability under the condition of low quality of video and audio content or insufficient effective information.

S104, performing weight decision analysis according to the first audio label, the first scene label, the first object label and the first semantic label to generate a first short video label.

Specifically, multiple types of labels are obtained according to the steps, the weight of each type of label is obtained through the calculation of the video content quality, the video semantic quality and the video audio quality, then the label information of Top K is calculated and assembled into a JSON body for output. Step S104 specifically includes the following steps:

s1041, determining content characteristic information of the first short video, wherein the content characteristic information comprises video duration information, video frame number information and resolution information, and classifying the content characteristic information through a random forest algorithm to obtain first content quality;

s1042, determining semantic feature information of the first short video, wherein the semantic feature information comprises text special symbol number information, text length information and OCR recognition result proportion information, and classifying the semantic feature information through a random forest algorithm to obtain first semantic quality;

s1043, determining audio characteristic information of the first short video, wherein the audio characteristic information comprises audio length information and audio spectrum information, and classifying the audio characteristic information through a random forest algorithm to obtain first audio quality;

s1044 is that the weights of the first scene tag, the first object tag and the first character tag are determined according to the first content quality, the weights of the first semantic tag and the first derivative tag are determined according to the first semantic quality, the weight of the first audio tag is determined according to the first audio quality, and then the tags are screened and sequenced according to the weight, so that the first short video tag is obtained.

Specifically, the duration, the total frame number and the resolution of a short video are obtained as video content characteristics, an RF random forest algorithm is adopted to classify the video content quality, and three tags of CQ ═ High, Medium and Low are output; acquiring the number and length of title special symbols of a short video and video semantics of OCR (OCR result/frame number) as characteristics, classifying video text quality by adopting an RF random forest algorithm, and outputting three tags TQ (High, Medium and Low); acquiring the audio length of a short video and the frequency spectrum of the audio as video and audio characteristics, classifying the video and audio quality by adopting an RF random forest algorithm, and outputting three labels of AQ (High, Medium and Low); selecting corresponding label (CT is a content label, AT is an audio label, and TT is a text label) weights according to the quality function matrix, if the High corresponding weight is 1, the Medium corresponding weight is 0.5, and the Low corresponding weight is 0, and finally screening according to the weights and sorting from High to Low, and outputting Top K label information.

The method steps of the embodiments of the present invention are described above. It can be understood that the embodiment of the invention respectively performs video audio analysis, video content analysis and video semantic analysis on the short video to obtain multi-dimensional label information, and then generates the short video label through weight decision analysis, thereby overcoming the problems of low manual labeling efficiency, incomplete AI content identification labeling label, small application range and the like in the prior art, improving the efficiency of generating the short video label, and also improving the accuracy, comprehensiveness and reliability of the short video label. Compared with the prior art, the embodiment of the invention also has the following advantages:

1) the method realizes the flow design of short video label prediction based on media information such as semantics, content, audio and the like, and provides the multimode fusion short video classified label prediction capability.

2) The method for predicting the classification label based on the video text information is provided, and when the video content is not rich enough, good classification label prediction can be still provided.

3) The video key frame extraction technology combining the segmented key frames and the hierarchical clustering algorithm based on the content features is realized, and the prediction time based on the video content prediction model is greatly saved.

4) A multi-type label decision-making method based on semantic features, content features and audio features is provided, so that a classification label effect of comprehensive confidence indexes (including importance and accuracy) is generated under the condition of multi-mode fusion.

In addition, the embodiment of the invention also has the following functions:

1) short video management function based on video classification label: the application layer provides a video classification label prediction inlet, so that a user can upload a short video and edit video text information, meanwhile, manual correction is carried out on a prediction result and the prediction result is finally stored in a database, the short video is opened next time, the label information of the video can be consulted in real time, and the number and the operation condition of various classification labels of the current song library are counted and displayed. The flow chart of the newly added video is introduced as follows:

and A1, uploading and editing the short video. The user can upload the short video in a local or URL mode selected on the application layer page, and text information editing such as short video titles, notes, singers and the like is supported.

And A2, calling a classification label prediction interface and displaying the prediction result.

And A3, manually correcting the prediction result, carrying out negative feedback on an inaccurate label, and manually removing an improper label.

2) Short video search function based on classification tags: the existing short video resources of the song library are processed in batch in an off-line mode, a corresponding short video classification label library is generated, and a search hit result is returned by adding a label matching strategy. The following describes the process of searching the synchronization index:

b1, processing short video resources of the song library in batches in an off-line manner: and (4) performing T +1 off-line processing on the new video every day, generating corresponding label information, and storing the label information in a database.

B2, synchronizing to the search index library: and synchronizing the video tag information to the search index database at regular increments each day.

3) Short video recommendation function based on classification tags: the existing short video resources of the music library are processed in batch in an off-line mode, and a corresponding short video classification label library is generated, so that rich label information is provided for short video recommendation. The following describes a process of recommending and constructing a user preference model and a video similarity model:

and C1, performing offline batch processing on the short video resources of the music library.

C2, constructing a user preference model: and associating the label information of the video according to the past user behavior of the user to generate a corresponding user preference model.

C3, constructing a video similar model: and constructing a video similarity matrix according to the label relevance and similarity between the videos, and providing more video sources with the same label for video recommendation.

Referring to fig. 3, an embodiment of the present invention provides a short video tag determination system, including:

the video and audio analysis module is used for acquiring audio information of the first short video and carrying out video and audio analysis on the first short video according to the audio information to obtain a first audio label;

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

Referring to fig. 4, an embodiment of the present invention provides a short video tag determination apparatus, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement a method for short video tag determination as described above.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

An embodiment of the present invention further provides a computer-readable storage medium, in which a program executable by a processor is stored, and the program executable by the processor is used for executing the above-mentioned short video tag determination method when executed by the processor.

The computer-readable storage medium of the embodiment of the invention can execute the short video tag determination method provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the above-described functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The above functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer readable medium could even be paper or another suitable medium upon which the above described program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for short video tag determination, comprising the steps of:

2. The method for determining a short video tag according to claim 1, wherein the step of obtaining audio information of a first short video, performing video and audio analysis on the first short video according to the audio information to obtain a first audio tag specifically comprises:

3. The method according to claim 1, wherein the step of obtaining the key frame information of the first short video, the key frame information including scene information, object information, and person information, and analyzing the video content of the first short video according to the key frame information to obtain a first scene tag, a first object tag, and a first person tag specifically comprises:

4. The method according to claim 3, wherein the step of extracting a plurality of first key frames from the plurality of video frame files, performing dimension reduction and de-duplication on the first key frames by using a hierarchical clustering algorithm, and selecting a plurality of second key frames having differences larger than a preset second threshold specifically comprises:

5. The method according to claim 1, wherein the step of obtaining the title information, the video description information, and the subtitle information of the first short video, and performing video semantic analysis on the first short video according to the title information, the video description information, and the subtitle information to obtain the first semantic tag specifically comprises:

6. The method according to claim 5, wherein the step of performing NLP semantic analysis on the header information, the video description information, the subtitle information, and the text information to obtain first semantic information specifically comprises:

7. The method according to claim 5, wherein the step of performing weight decision analysis on the first audio tag, the first scene tag, the first object tag, the first person tag, and the first semantic tag to generate the first short video tag specifically comprises:

8. A short video tag determination system, comprising:

9. A short video tag determination apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a short video tag determination method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, in which a program executable by a processor is stored, wherein the program executable by the processor is adapted to perform a short video tag determination method as claimed in any one of claims 1 to 7 when executed by the processor.