CN110704680B

CN110704680B - Label generation method, electronic device and storage medium

Info

Publication number: CN110704680B
Application number: CN201910767754.2A
Authority: CN
Inventors: 张进; 莫东松; 钟宜峰; 张健; 赵璐; 马丹; 马晓琳
Original assignee: MIGU Culture Technology Co Ltd
Current assignee: MIGU Culture Technology Co Ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2022-10-04
Anticipated expiration: 2039-08-20
Also published as: CN110704680A

Abstract

The embodiment of the invention relates to the technical field of artificial intelligence and discloses a label generation method, electronic equipment and a storage medium. In the invention, the target audio-visual content is identified; acquiring personalized information of a user receiving the target audio-visual content, wherein the personalized information comprises real-time state information when the user receives the target audio-visual content; generating a label of the target audio-visual content according to the personalized information of the user and the identification result of the target audio-visual content; the tag comprises a user emotion tag generated according to real-time state information when the user receives the target audio-visual content and the identification result of the target audio-visual content; the accuracy of the description of the user experience by the tag can be improved.

Description

Label generation method, electronic device and storage medium

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a label generation method, electronic equipment and a storage medium.

Background

Currently, the labels of video programs are mainly conventional labels such as video genre, director, starring actors, theme, etc. Video refinement tags define and describe video using nouns (e.g., people, things, etc. in the video), verbs (what is, what is doing, etc.), adjectives (cheerful, sad, etc.) within a certain range, mainly based on the segment in the video or the content of each frame in the video. The refinement label can further describe the video content and better support the work of editing, retrieving, refining operation and the like of the video. In the related art, computer video understanding based on deep learning is already used as a main tool for video fine tags, and information with obvious characteristics such as objects, people and the like in video images can be accurately identified.

The inventor finds that at least the following problems exist in the prior art: the refined labels are labels printed by identifying the image frames of the video based on a machine learning model, and the labels are fixed, but users watching the video are different, and different users may have different feelings when watching the video, so that the content of the existing refined labels may not be matched with the feeling of the users, and the feeling of the users watching the video content cannot be accurately described.

Disclosure of Invention

An object of embodiments of the present invention is to provide a tag generation method, an electronic device, and a storage medium, which can improve accuracy of describing user experience by a tag.

In order to solve the above technical problem, an embodiment of the present invention provides a tag generation method, including the following steps: identifying the target audio-visual content; acquiring personalized information of a user receiving the target audio-visual content, wherein the personalized information comprises real-time state information when the user receives the target audio-visual content; generating a label of the target audio-visual content according to the personalized information of the user and the identification result of the target audio-visual content; the tag comprises a user emotion tag generated according to real-time state information when the user receives the target audio-visual content and the identification result of the target audio-visual content.

An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the tag generation method.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-described tag generation method.

Compared with the prior art, the method and the device have the advantages that the target audio-visual content is identified, and the label is automatically generated according to the personalized information of the user and the identification result of the target audio-visual content; associating the generated tag with a user personality; the personalized information of the user comprises real-time state information when the user receives the target audio-visual content, and the generated label comprises a user emotion label generated according to the real-time state information of the user and the basic identification information of the audio-visual content; the user emotion label is generated according to the identification result of the audiovisual content by the real-time state information of the user, so that the refined label of the audiovisual content is related to the real-time state reflected by the user, the label of the audiovisual content is prevented from being influenced by the subjective judgment of operators, and meanwhile, the description of the audiovisual content by the label is more consistent with the viewing experience of audiences, so that the label value is more consistent with the real experience of the user when the user views the audiovisual content, and the accuracy of the description of the audiovisual content label on the user experience is improved.

In addition, the real-time state information when the user receives the target audio-visual content is obtained by analyzing the real-time monitoring video of the user. By analyzing the real-time monitoring video of the user, the real-time state change of the user can be accurately obtained, and therefore the real-time state information of the user is obtained. The method for analyzing the monitoring video is adopted to obtain the real-time state information of the user, so that the watching experience of the user is not hindered, the user does not need to do anything manually, and the method is user-friendly.

In addition, the real-time monitoring video of the user is specifically analyzed in the following manner: selecting a plurality of key frames; analyzing the plurality of key frames to obtain a characteristic value of each key frame; the characteristic value of the key frame is used for representing the real-time state information of the user. The method for analyzing the real-time monitoring video of the user extracts the key frames of the image frames in the video instead of analyzing each frame, so that the calculation amount can be reduced as much as possible while the real-time state information of the user can be accurately analyzed, and resources are saved.

In addition, the feature value of each key frame is obtained by the following method: intercepting image blocks of the key frame, wherein the image blocks are image blocks which reflect the real-time state information of the user in the key frame; and extracting the characteristic value of the image block, and taking the characteristic value of the image block as the characteristic value of the key frame. When the feature values of the key frames are extracted, only the image blocks reflecting the real-time state information of the user in the image frames are processed, so that the feature values of all the key frames are obtained, on one hand, the influence of image information irrelevant to the real-time state information of the user in the key frames on the feature value result is reduced, and on the other hand, the image blocks are a certain area in the image frames, so that the calculation amount can be further reduced compared with the analysis of the whole image frame.

In addition, the personalized information of the user also comprises user portrait information of the user; the tag also includes a user portrait tag generated based on user portrait information of the user and a recognition result of the target audiovisual content. The tags related to the user portrait are marked on the target audio-visual content, so that different classifications of the target audio-visual content can be made according to different types of people.

In addition, the identifying the target audio-visual content specifically includes: classifying the target audiovisual content using at least one classifier; and identifying the target audio-visual content according to the output result of each classifier. And identifying the target audio-visual content by adopting at least one classifier, and determining the final identification result by integrating the classification results of all the classifiers so that the final identification result is more accurate.

In addition, the classification result of the classifier comprises probability scores of the target audio-visual content belonging to various types; identifying the target audio-visual content according to the output result of each classifier, which specifically comprises the following steps: calculating the probability score sum of each type, wherein the probability score sum of each type is the sum of the products of the probability scores of the types in the classification results of the classifiers and the corresponding classifier weights; and taking the probability score and the maximum type as the identification result of the target audio-visual content. A specific method for recognizing the target audio-visual content according to the output results of the classifiers is provided, and the probability scores and the types with the maximum probability are used as the final recognition results, so that the recognition results are as reliable as possible.

In addition, the label of the target audio-visual content also comprises a basic label, and the basic label is obtained according to the identification result of the target audio-visual content; various tags of the target audiovisual content are used for classifying the target audiovisual content, and the earlier the target audiovisual content is generated, the smaller the weight of the base tag is when the target audiovisual content is classified. The earlier the target viewing content is generated, the larger the change of the audience viewing experience is, so that for the audiovisual content with the earlier generation time, when the target audiovisual content is classified, the smaller the weight of the basic label is, and the more accurate the viewing description of the current user can be realized.

Drawings

One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.

Fig. 1 is a flowchart of a tag generation method provided according to a second embodiment of the present invention;

FIG. 2 is a flow chart of a method for analyzing real-time monitoring video of a user according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a specific method for identifying targeted audiovisual content provided in accordance with a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

A first embodiment of the present invention relates to a label generation method. In the present embodiment, the target audiovisual content is identified; acquiring personalized information of a user receiving the target audio-visual content, wherein the personalized information comprises real-time state information when the user receives the target audio-visual content; generating a label of the target audio-visual content according to the personalized information of the user and the identification result of the target audio-visual content; the tag comprises a user emotion tag generated according to real-time state information when the user receives the target audio-visual content and the identification result of the target audio-visual content, and the accuracy of the tag in describing the feeling of the user watching the target audio-visual content can be improved. A flowchart of a tag generation method in this embodiment is shown in fig. 1, and details of implementation of the tag generation method in this embodiment are specifically described below, and the following details are provided only for convenience of understanding and are not necessary for implementing this embodiment.

Step 101, identifying a target audiovisual content.

Specifically, when the user is watching a video or listening to an audio-visual content such as a broadcast, the electronic device may identify the audio-visual content according to the trained basic identification model, so as to obtain some basic information of the audio-visual content, where for example, the target audio-visual content is a video, the identified basic information may include an uploading time of the video, a theme of the video, and if the target audio-visual content is a movie, information of a director and a main actor of the movie.

In practical implementation, some basic information of the target audiovisual content is obtained by identifying the target audiovisual content, and the basic information may further generate basic tags of the target audiovisual content, where the basic tags may include tags of a basic type of the target audiovisual content, for example, if the target audiovisual content is a video, and the video is identified to obtain a comedy movie of 1995, the target audiovisual content may be tagged with a tag of a basic type of "comedy"; such underlying information may also be used as feature data for further analysis of the targeted audiovisual content.

In one example, the target audiovisual content is a video, and because the video adds information of a time dimension on the basis of a two-dimensional image, the video content is often recognized by using a 3D model of a deep convolutional neural network, and before the video is processed by using the 3D model to obtain a recognition result, the video is preprocessed, for example, the video can be processed into an image sequence by using tools such as Opencv and FFmpeg, and meanwhile, diversity and overfitting can be prevented by data enhancement, and the data enhancement mainly includes timing enhancement (by randomly cropping the image sequence in the time dimension) and spatial enhancement (by randomly turning the image horizontally and randomly cropping the image in multiple dimensions); after the video is preprocessed, the video is identified through an identification model (such as a 3D model of a deep convolutional neural network), so that basic identification information of the video can be obtained.

Step 102, obtaining the personalized information of the user receiving the target audio-visual content, wherein the personalized information comprises the real-time state information when the user receives the target audio-visual content.

Specifically, the electronic device may acquire the personalized information of the user through various devices capable of acquiring the personalized information of the user, login information of the user, a historical browsing record of the user, and the like, where the personalized information of the user at least includes real-time status information of the user receiving the target audiovisual content, the real-time status information of the user can intuitively reflect the experience of the user on the target audiovisual content, and acquiring the real-time status information of the user is to acquire the real-time status information of the user.

In practical implementation, the electronic device may receive the monitoring video acquired by the video acquisition device, and the real-time status information of the user when receiving the target audio-visual content is obtained by analyzing the real-time monitoring video of the user. In the process of analyzing the real-time monitoring video of the user, image feature extraction can be carried out on each video frame to obtain the feature values of each video frame, and the feature values are used for representing the real-time state information of the user. In practical implementation, in order to improve the accuracy of video image feature extraction as much as possible, a plurality of different feature extraction methods can be adopted to extract features of video frames, and each feature extraction method can obtain corresponding feature values. For example, one feature extraction method is to obtain the features of the video image by using local feature descriptors such as GIST and HOG, and the other feature extraction method is to extract the features of the video image by using a CNN model VGG19 trained on ImageNet, and when the real-time status information of the user needs to be obtained, the feature values of the video image obtained by the two methods can be simultaneously used as a reference.

In one example, if a plurality of users receive the target audiovisual content together, for example, a family member watches a certain movie together, when monitoring the real-time status information of the users through the video capture device, the real-time status information of the plurality of users needs to be monitored, and at this time, the people in the image can be analyzed one by one through the image processing technology to obtain the real-time status information of each person.

And 103, generating a label of the target audio-visual content according to the personalized information of the user and the identification result of the target audio-visual content.

Specifically, the personalized information of the user comprises real-time state information of the user, and the generated label at least comprises a user emotion label generated according to the real-time state information when the user receives the target audio-visual content and the identification result of the target audio-visual content. When generating a user emotion tag according to personalized information of a user and a recognition result of target visual content, according to a pre-selected trained classifier, taking basic information obtained by recognizing the target visual content in step 101 as feature data of the recognition result of the target visual content, and taking each feature value obtained by analyzing a real-time monitoring video of the user in step 102 as an input of the pre-trained classifier, wherein an output of the classifier is the user emotion tag, the user emotion tag is used for describing a general feeling of the user watching the target visual content, and a tag value of the video emotion tag of the user can be: happy, calm, angry, crying, etc. In this embodiment, the reason why the recognition result of the target audiovisual content is considered when generating the emotion tag of the user is mainly because the real-time status information of the user does not necessarily completely reflect the feeling of the target audiovisual content, and may be interfered by the surrounding environment factors (for example, when the user watches the video, the user is funed by a child nearby, which is not related to the video content itself).

In one example, the personalized information of the user can also comprise user portrait information, such as basic information of the user, such as sex and age, and user preference information obtained by analyzing the history of the user; accordingly, the tag of the target audiovisual content may include a user portrait tag generated based on user portrait information of the user and a recognition result of the target audiovisual content. The user portrait label is printed on the audio-visual content, which is helpful for the electronic device to classify the videos into different categories for users of different groups (for example, statistics shows that eight-member female users have a happy response when watching a certain video, and two-member children have a happy response when watching the video, so that the videos are classified into comedy categories only for the female users when being classified). The user portrait information can be obtained by analyzing the login information of the user or by analyzing through voice recognition and image recognition technologies.

In one example, the tags of the target audiovisual content may include various tags such as a base tag, a user emotion tag, and a user portrait tag, and the various tags may be used to classify the target audiovisual content, and the earlier the target audiovisual content is generated, the smaller the weight of the base tag is when the target audiovisual content is classified. The earlier the target viewing content is generated, the larger the change of the audience viewing experience is, so that for the audiovisual content with the earlier generation time, when the target audiovisual content is classified, the smaller the weight of the basic label is, and the more accurate the viewing description of the current user can be realized. For example, a huge movie in the 90 s may be common to modern people, so that the movie identified by the base label is a huge movie, which is not very accurate, and the final classification result of the movie can be more accurate by reducing the weight of the base label.

In actual implementation, the target audio-visual content may also be a scene of a real-time audio-visual performance such as a drama and a concert watched by the user, and the real-time reflection of the user is analyzed through the monitoring video, so that the reaction of the user watching the performance can be obtained, and the performer is helped to adjust the progress of the performance according to the feedback of the user.

Compared with the prior art, the embodiment identifies the target audio-visual content, and automatically generates the label according to the personalized information of the user and the identification result of the target audio-visual content; associating the generated tag with a user personality; the personalized information of the user comprises real-time state information when the user receives the target audio-visual content, and the generated label comprises a user emotion label generated according to the real-time state information of the user and the basic identification information of the audio-visual content; the user emotion label is generated according to the identification result of the audiovisual content by the real-time state information of the user, so that the refined label of the audiovisual content is related to the real-time state reflected by the user, the label of the audiovisual content is prevented from being influenced by the subjective judgment of operators, and meanwhile, the description of the audiovisual content by the label is more consistent with the viewing experience of audiences, so that the label value is more consistent with the real experience of the user when the user views the audiovisual content, and the accuracy of the description of the audiovisual content label on the user experience is improved.

A second embodiment of the present invention relates to a label generation method. The second embodiment is substantially the same as the first embodiment, and mainly differs therefrom in that: in the first embodiment, feature extraction is performed for each frame of video, whereas in the second embodiment of the present invention, image feature extraction is performed only for key frames of video. A flowchart of a method for analyzing a real-time monitoring video of a user according to this embodiment is shown in fig. 2, and is specifically described below.

Step 201, selecting a plurality of key frames.

Specifically, because multiple frames of images exist in a video every second, and the change between two adjacent frames of images is not large, a certain amount of calculation is generated when each frame of image is processed, after a real-time monitoring video of a user is obtained, the electronic device does not process each image frame, but extracts key frames in the video for processing, and the amount of calculation is reduced while the accuracy of an analysis result is ensured.

In practical implementation, when the key frame is extracted, the difference between the color features of the current image frame and the previous key frame is mainly compared, and if a preset number of color features are changed, that is, the difference between the two image frames is large, the current frame is regarded as a new key frame.

Step 202, analyzing the plurality of key frames to obtain a feature value of each key frame.

Specifically, the feature value of each key frame in each key frame can be obtained by: image block capturing is carried out on the key frame, wherein the captured image blocks are image blocks capable of reflecting the state of a user in the key frame; and performing feature analysis on the intercepted image blocks, extracting feature values of the image blocks, and taking the feature values as feature values of the key frames. In this embodiment, when extracting the feature value of the key frame, only the image block that reflects the user real-time status information in the image frame is processed to obtain the feature value of each key frame, so that on one hand, the influence of image information irrelevant to the user real-time status information in the key frame on the feature value result is reduced, and on the other hand, since the image block is a certain area in the image frame, the amount of calculation can be further reduced by analyzing the image block compared with analyzing the entire image frame.

After the feature values of the respective key frames are obtained, the user emotion tags can be obtained by combining the feature values of the respective key frames with feature data obtained by recognizing the target audiovisual content in the same manner as in the first embodiment.

Compared with the prior art, the method and the device have the advantages that the key frames are extracted from the image frames in the video instead of analyzing each frame, so that the calculation amount can be reduced as much as possible while the real-time state information of the user can be accurately analyzed, and resources are saved. In addition, when each key frame is analyzed, only the image blocks related to the user state information are analyzed, so that the influence of the image information unrelated to the user real-time state information in the key frame on the characteristic value result is reduced, and the calculation amount can be further reduced.

A third embodiment of the present invention relates to a tag generation method. The third embodiment is substantially the same as the first and second embodiments, and is mainly different in that a specific method of identifying a target viewing content is provided in the third embodiment of the present invention, and a flowchart of the specific method of identifying a target viewing content in the present embodiment is shown in fig. 3, and is specifically described below.

Step 301, at least one classifier is used to classify the target audiovisual content.

Specifically, in the present embodiment, the 3D model is used to process the target audiovisual content, and during model training, one type of training data may be used to train one type of classifier, or a plurality of different frame length data may be used as sample data for different types of training, so as to train different feature extraction models, i.e., different classifiers. In classifying the target audiovisual content, the target audiovisual content may be classified using one or more classifiers that have been trained.

Step 302, according to the output result of each classifier, identifying the target audio-visual content.

Specifically, the classification result of the classifier comprises probability scores of the target audio-visual content belonging to each type; when there is only one classifier, the type with the highest probability score in the output result of the classifier can be used as the type of the identified target audio-visual content. When a plurality of classifiers exist, calculating the probability scores of all types, wherein the probability score sum of each type is the sum of the products of the probability scores of the types in the classification results of all the classifiers and the corresponding classifier weights; and taking the probability score and the maximum type as the identification result of the target audio-visual content.

In one example, the target audiovisual content is a video, and when performing model training, model training is performed using an 8-frame long image sequence and a 16-frame long image sequence to obtain feature-extracted models M8 and M16, which in turn obtain feature-extracted models M8 and M16To two classifiers M8 and M16, where M8 output is M = [ M ] ₁ ,m ₂ ,...m _k ]B, carrying out the following steps of; m16 output is n = [ n ] ₁ ,n ₂ ，...n _k ]The k values 1 through k represent the probability that the video is of each of the k different types, and the video classification formula is:

y＝argmax _{i∈[1，2，...k]} (λm _i +(1-λ)n _i )

where y denotes the category of the target audio-visual content, and λ denotes the weight of the feature extraction model M8. For example, there are four types of classifier outputs, i.e., "comedy", "action", "love", "science fiction", and when two classifier models, i.e., M8 and M16, are used to classify the target audiovisual content, M8 is output as M = [0.1,0.4,0.2,0.3]; m16 output is n = [0.2,0.2,0.3,0.3]; λ is 0.4, then the sum of probability scores for the video as comedy type is 0.4 x 0.1+0.6 x 0.2=0.16, and the sum of probability scores for the video as action type is 0.4 x 0.4+0.6 x 0.2=0.28; the sum of probability scores for the video as type of love is 0.4 x 0.2+0.6 x 0.3=0.26; the sum of probability scores for the video being of the science fiction type is 0.4 x 0.3+0.6 x 0.3=0.30; and comparing to obtain that the probability score and the maximum type are the science fiction type, namely y is the science fiction type. After the calculation result of the category of the target audio-visual content is obtained, the basic type label can be marked on the target audio-visual content.

It should be noted that the output result of each classifier can also be used as feature data, and when generating a tag, the output result of each classifier is combined with the user personalized information, and the user emotion tag can be obtained by the same method as that in step 103 in the first embodiment. For example, when image feature extraction is performed on a monitoring video to obtain real-time status information representing each user, two methods mentioned in the first embodiment are respectively adopted to extract image features of the video, one is to obtain features of the video image by using local feature descriptors such as GIST and HOG, and the other is to extract features of the video image by using a CNN model VGG19 trained on ImageNet, feature values of the video images obtained by the two methods are respectively represented by F1 and F2, where F1 includes F1 ₁₁ 、F ₁₂ 、F ₁₃ A plurality of specific image characteristic values, F2 includes F ₂₁ 、F ₂₂ 、F ₂₃ Waiting for a plurality of specific image characteristic values, and extracting the video characteristic values F1 and F2 and the characteristic values F obtained by adopting M8 and M16 to carry out video characteristic extraction ₈ And F ₁₆ Binding (F) ₈ And F ₁₆ All comprise a plurality of specific eigenvalues), taking all the eigenvalue as the input of a pre-trained user emotion label SVM classifier, wherein the form is F = [ F ] ₁₁ ,F ₁₂ ,F ₁₃ ...，F ₂₁ ，F ₂₂ ，F ₂₃ ...，F ₈₁ ，F ₈₂ ,F ₈₃ ...,F ₁₆₁ ,F ₁₆₂ ,F ₁₆₃ ...]The output of the classifier is the user emotion label.

In addition, when the user portrait label is generated, the output result of each classifier can be used as feature data, and the user portrait label can be obtained by combining the user portrait information and the output result of each classifier. For example, the characteristic data F3 of the personalized information of the user and the characteristic value F obtained by model training by adopting M8 and M16 ₈ And F ₁₆ In combination, all feature data are used as the input of the SVM classifier trained in advance, and the form is F = [) ₃₁ ,F ₃₂ ...,F ₈₁ ,F ₈₂ ...，F ₁₆₁ ，F ₁₆₂ ...]The output of the classifier is the user portrait label.

Compared with the prior art, the method and the device adopt at least one classifier to identify the target audio-visual content, and synthesize the classification results of the classifiers to determine the final identification result, so that the final identification result is more accurate.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A fourth embodiment of the present invention relates to an electronic apparatus, as shown in fig. 4, including: at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; wherein the memory 402 stores instructions executable by the at least one processor 401 to be executed by the at least one processor 401 to enable the at least one processor 401 to perform the above tag generation method.

Where the memory 402 and the processor 401 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 401 and the memory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 401 may be transmitted over a wireless medium via an antenna, which may receive the data and transmit the data to the processor 401.

The processor 401 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 402 may be used to store data used by processor 401 in performing operations.

It should be noted that the electronic device in the present embodiment may be a server, or may be a terminal device such as a mobile phone or a computer.

A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A tag generation method, comprising:

identifying the target audio-visual content;

acquiring personalized information of a user receiving the target audio-visual content, wherein the personalized information comprises real-time state information when the user receives the target audio-visual content;

generating a label of the target audio-visual content according to the personalized information of the user and the identification result of the target audio-visual content; the tag comprises a user emotion tag generated according to real-time state information when the user receives the target audio-visual content and the identification result of the target audio-visual content.

2. The label generation method according to claim 1,

and the real-time state information of the user when receiving the target audio-visual content is obtained by analyzing the real-time monitoring video of the user.

3. The tag generation method according to claim 2, wherein the real-time monitoring video of the user is analyzed by:

selecting a plurality of key frames;

analyzing the plurality of key frames to obtain a characteristic value of each key frame;

wherein the feature values of the key frames are used for characterizing the real-time status information of the user.

4. The label generating method according to claim 3, wherein the feature value of each key frame is obtained by:

intercepting image blocks of the key frame, wherein the image blocks are image blocks which reflect real-time state information of a user in the key frame;

and extracting the characteristic value of the image block, and taking the characteristic value of the image block as the characteristic value of the key frame.

5. The label generation method according to claim 1,

the personalized information of the user also comprises user portrait information of the user;

the label also comprises a user portrait label generated according to the user portrait information of the user and the identification result of the target audio-visual content.

6. The tag generation method according to claim 1, wherein the identifying of the target audiovisual content specifically comprises:

classifying the target audiovisual content using at least one classifier;

and identifying the target audio-visual content according to the output result of each classifier.

7. The label generation method according to claim 6,

the classification result of the classifier comprises probability scores of the target audio-visual content belonging to various types;

the identifying the target audio-visual content according to the output result of each classifier specifically includes:

calculating the probability score sum of each type, wherein the probability score sum of each type is the sum of the products of the probability scores of the types and the corresponding classifier weights in the classification results of the classifiers;

and taking the probability score and the maximum type as the identification result of the target audio-visual content.

8. The tag generation method according to claim 1, wherein the tag of the target audiovisual content further includes a base tag, and the base tag is obtained according to the recognition result of the target audiovisual content;

the various tags of the target audiovisual content are used for classifying the target audiovisual content, and when the target audiovisual content is classified, the earlier the target audiovisual content is generated, the smaller the weight of the base tag is.

9. An electronic device, comprising:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the tag generation method of any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the label generation method of any one of claims 1 to 7.