CN112948635B

CN112948635B - Video analysis method and device, electronic equipment and readable storage medium

Info

Publication number: CN112948635B
Application number: CN202110219554.0A
Authority: CN
Inventors: 焦阳; 杨羿; 王璐; 刘祥; 黄晨; 李�一; 陈晓冬; 刘林; 刘波; 韩帅; 未来
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2022-11-08
Anticipated expiration: 2041-02-26
Also published as: CN112948635A

Abstract

The disclosure discloses a video analysis method, a video analysis device, electronic equipment and a readable storage medium, relates to the technical field of image processing, and is applied to the technical field of artificial intelligence and machine learning. The specific implementation scheme is as follows: the method comprises the steps of determining image tags of a target image frame in a preset target dimension, verifying the validity of the image tags of the target dimension based on the number of times that the image tags of the target dimension appear in the target image frame, the total number of the target image frame and a preset validity threshold value, obtaining a validity verification result of the image tags of the target dimension, determining video tags of a target video according to the validity verification result of the image tags of the target dimension and the image tags of the target dimension, and generating a video analysis result based on the delivery effect of the target video and the video tags of the target video. According to the method, after the image labels with multiple dimensions are obtained, the video labels are obtained by verifying the validity of the image labels, so that the accuracy and the comprehensiveness of video analysis results are improved.

Description

Video analysis method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a video analysis method, an apparatus, an electronic device, and a readable storage medium.

Background

With the development of network technology and terminal technology, various video services are emerging, and video watching has become a main way for people to acquire information and leisure and entertainment. Various media platforms also provide various video services, and the video resources available to users are extremely rich. However, only videos meeting the needs of the audience can be favored by the users.

Disclosure of Invention

The disclosure provides a video analysis method, a video analysis device, an electronic device and a readable storage medium.

According to a first aspect of the present disclosure, there is provided a video analysis method, including:

determining an image tag of a target image frame in a preset target dimension, wherein the target image frame is an image acquired based on a target video;

verifying the validity of the image tag of the target dimension based on the number of times the image tag of the target dimension appears in the target image frame, the total number of the target image frames and a preset validity threshold value to obtain a validity verification result of the image tag of the target dimension;

determining a video tag of the target video according to the image tag of the target dimension and the validity verification result of the image tag of the target dimension;

and generating a video analysis result based on the delivery effect of the target video and the video label of the target video.

According to a second aspect of the present disclosure, there is provided a video analysis apparatus comprising:

the system comprises an image tag acquisition module, a video processing module and a video processing module, wherein the image tag acquisition module is used for determining an image tag of a target image frame in a preset target dimension, and the target image frame is an image acquired based on a target video;

the verification module is used for verifying the validity of the image tag of the target dimension based on the number of times that the image tag of the target dimension appears in the target image frame, the total number of the target image frames and a preset validity threshold value, so as to obtain a validity verification result of the image tag of the target dimension;

the video tag acquisition module is used for determining the video tag of the target video according to the image tag of the target dimension and the validity verification result of the image tag of the target dimension;

and the analysis module is used for generating a video analysis result based on the delivery effect of the target video and the video label of the target video.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the video analytics methods.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the video analytics methods.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above-described video analysis methods.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a video analysis method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of verifying image tag validity in an embodiment of the present disclosure;

fig. 3 is a schematic block diagram of a video analysis apparatus provided in an embodiment of the present disclosure;

FIG. 4 is a functional block diagram of a verification module provided by embodiments of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing a video analysis method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The video can show rich contents to users directly, the manufacturing process is simple, the transmission threshold is low, and the transmission channels are diversified, so that the watching of the video becomes one of the main modes for people to enjoy leisure and entertainment and obtain information. For users, the available video resources are extremely rich, but only videos meeting the requirements of the users can obtain better delivery effects. Therefore, it is necessary to analyze the video so as to guide the production of the video according to the analysis result.

At present, a video analysis method mainly includes labeling a video with a label and analyzing the video through the label. The mainstream label labeling method comprises manual labeling, title extraction and content extraction. The manual marking needs a large amount of human resource support, and modes such as mass measurement and outsourcing cannot guarantee the safety of video information; the title extraction mode depends on the quality of the title excessively, and the extracted labels have high generality and cannot reflect the content of the video more specifically; the content extraction method cannot obtain the contents of scenes, character images and the like of the video, and is not suitable for the video without voice and characters.

In view of the above problems in video analysis, the present disclosure provides a video analysis method, which can more safely obtain image tags of a target image frame of a video in multiple dimensions, determine the video tags based on the effectiveness of the image tags, and analyze the video according to the video tags, thereby obtaining a more comprehensive and accurate video analysis result.

It should be noted that all information related in the embodiments and embodiments of the present disclosure is informed by the user, and data collected according to the minimum collection frequency, the minimum collection range and the maximum business relevance principle meets the requirements of related laws and regulations and standard specifications without violating the autonomous will of the user to obtain the user authorization.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a flowchart of a video analysis method according to an embodiment of the present disclosure. As shown in fig. 1, the video analysis method includes:

step S101, determining an image label of a target image frame in a preset target dimension.

The target image frame is an image obtained based on a target video, and the target video is a video to be analyzed. In some embodiments, to improve the comprehensiveness and accuracy of the video analysis result, a user may select a plurality of target videos as videos to be analyzed.

Wherein the target dimension is an angle at which the target image frame is analyzed. In some embodiments, the target dimension includes a hue dimension and a character dimension, wherein the hue dimension is used for analyzing the target image frame from aspects of hue, brightness, temperature and purity, and the character dimension is used for analyzing the character in the target image frame from aspects of expression, age, gender and the like.

The image label is obtained by analyzing the target image frame in the target dimension, and the image label represents the attribute feature of the target image frame in the target dimension. In some embodiments, where the target dimensions include a hue dimension and a person dimension, the corresponding image tags include a hue tag and a person tag, where the hue tag can reflect the attribute characteristics of the target image frame in terms of hue, lightness, coldness, and purity, and the person tag includes an emoji tag, an age group tag, and a gender tag, respectively reflecting the expressive characteristics, age group characteristics, and gender characteristics of the person.

It should be noted that the above examples of the target dimension and the image label are only illustrative, and other non-illustrated target dimensions and image labels are also within the scope of the disclosure, and those skilled in the art may set other target dimensions and image labels according to actual requirements.

In practical applications, the operability of directly analyzing the target video is not high, and therefore, the target image frame is acquired based on the target video, and the analysis result of the target video is obtained by analyzing the target image frame. In some embodiments, the target video is decimated at a preset decimation frequency, so as to obtain a target image frame of the target video. Generally, one target video corresponds to a plurality of target image frames, and the frame extraction frequency can be flexibly set according to requirements, which is not limited by the present disclosure.

In some embodiments, the hue label is obtained by analyzing the RGB (Red Green Blue) color of the pixels in the target image frame. For example, the RGB colors of each pixel in the target image frame are acquired, a histogram of the RGB colors of each pixel is determined, and a hue label of the target image frame in a hue dimension is determined according to the histogram of the GRB colors of each pixel.

It should be noted that the above manner for acquiring the hue label is merely an example, and those skilled in the art may also acquire the hue label in other manners, which is not limited in the present disclosure.

The tone of the target image frame is analyzed to obtain the tone label, the attribute characteristics of the target image frame in the tone dimension can be obtained, and by combining the contents such as the target video release effect and the like, the video with the tone can be known to be more popular with audiences.

In some embodiments, obtaining the person tag for the target image frame comprises: firstly, acquiring a designated area and a corresponding confidence coefficient in a target image frame, wherein the designated area is a suspected target area, the target area is an area to be analyzed, and the confidence coefficient is used for representing the credibility of the designated area as the target area; secondly, determining a target area based on the confidence coefficient of the designated area and a preset confidence coefficient threshold value, wherein the confidence coefficient threshold value is a numerical value or a value interval preset by a user; finally, the person labels of the target image frame in the person dimension are determined based on the target area. The target area is an area of key analysis in the target image frame. In some embodiments, the target area includes a face area and an item area. In practical applications, the target area may be set according to user requirements, which is not limited by the present disclosure.

In general, in a target image frame obtained based on a target video, most of image areas belong to a background area, and only a few image areas belong to a target area. Although the area of the background area is large, the information content is small, and although the area of the target area is small, the information content is rich, so that the image tag capable of representing the attribute characteristics of the target image frame can be obtained by analyzing the target area of the target image frame.

In some embodiments, determining the target region based on the confidence level of the designated region and a preset confidence level threshold comprises: and determining the designated area as the target area when the confidence of the designated area is greater than or equal to the confidence threshold. By determining the designated area which is greater than or equal to the confidence coefficient threshold value as the target area, the non-target area in the designated area can be excluded, so that the accuracy of the target area is improved, and the video analysis result is not accurate due to the fact that the non-target area is misjudged as the target area.

In some embodiments, in the case that the target area is a human face area, acquiring the human tag of the target image frame includes:

and inputting each target image frame into a preset face detection model, determining the designated area and the corresponding confidence coefficient of the target image frame by the face detection model, and determining the designated area which is greater than or equal to the confidence coefficient threshold value as a face area.

And respectively inputting the image of the face area into a preset expression recognition model, a preset gender recognition model and a preset age evaluation model to obtain a corresponding output result, and obtaining an expression label, an age group label and a gender label of a target image frame according to a sequence number of the target image frame corresponding to the image of the face area.

The face detection model, the expression recognition model, the gender recognition model and the age evaluation model can be convolutional neural network models; the preset expression in the expression recognition model comprises at least one of happiness, anger, seriousness, sadness, surprise and neutrality, and the corresponding expression label comprises at least one of happiness label, anger label, seriousness label, sadness label, surprise label and neutrality label; the preset age group in the age assessment model comprises at least one of an infant, a child, an adult and a middle-aged and elderly person, and the corresponding age group label comprises at least one of an infant label, a child label, an adult label and a middle-aged and elderly person label.

It should be noted that the above detection method for the face region and the acquisition method for the image tag are only examples, and other non-described detection methods for the face region and acquisition methods for the image tag are also within the scope of the present disclosure, and those skilled in the art may detect the face region and acquire the image tag in other ways according to actual needs.

By analyzing the character dimensions of the target image frame such as expression, gender and age, multiple attribute characteristics of the target image frame in the character dimensions can be obtained, and by combining the contents such as the delivery effect of the target video, the popularity of the audience in the video set by which kind of character is easier to obtain can be known.

In some embodiments, after determining the target region, the method further comprises: and cutting and storing the target area from the target image frame, and recording the serial number of the target image frame where the target area is located. The serial number of the target image frame is an identifier for associating the target area with the target image frame, and after the image tag is acquired based on the target area, the association relationship between the image tag acquired based on the target area and the target image frame to which the target area belongs can be established according to the serial number of the target image frame, so that the image tag is prevented from being wrongly marked to other target image frames.

Step S102, verifying the validity of the image tags with the target dimensionality based on the times of the image tags with the target dimensionality appearing in the target image frames, the total number of the target image frames and a preset validity threshold value, and obtaining a validity verification result of the image tags with the target dimensionality.

The number of times that the image tag of the target dimension appears in the target image frame is the number of the target image frames that mark the image tag, not the number of the image tag itself. For example, the target video includes a first target image frame and a second target image frame, wherein the first target image frame is labeled with two child tags, and the second target image frame is not labeled with a child tag, and the child tag appears in the target image frame only once but not twice because the child tag is labeled in the same target image frame.

Wherein the validity threshold may be set by a user as desired. In some embodiments, the user may set the validity threshold based on empirical or statistical data. It should be noted that other validity threshold setting manners are also within the protection scope of the present disclosure, and the present disclosure does not limit this.

The validity of the image label can reflect the comprehensiveness and accuracy of the attribute characteristics of the target video expressed by the image label. The higher the effectiveness of the image tag is, the more comprehensively and accurately the image tag can express the attribute characteristics of the target video, and otherwise, the more comprehensively and accurately the image tag cannot express the attribute characteristics of the target video.

The single target image frame cannot comprehensively and accurately express the content of the target video, so that the image tag obtained based on the single target image frame may not comprehensively and accurately express the attribute characteristics of the target video, the validity of the image tag needs to be verified, whether the image tag can comprehensively and accurately reflect the attribute characteristics of the target video is determined, the image tag which cannot comprehensively and accurately reflect the attribute characteristics of the target video is eliminated according to the validity verification result, and therefore the accuracy of the tag is improved.

In some embodiments, first, the number of target image frames corresponding to the image tag of the target dimension in the target video is determined based on the number of times the image tag of the target dimension appears in the target image frames; secondly, acquiring the total number of target image frames in the target video, and determining an effective value of an image label of a target dimensionality based on the number of the target image frames corresponding to the image label of the target dimensionality in the target video and the total number of the target image frames in the target video; and thirdly, determining a validity verification result of the image label of the target dimension based on the valid value of the image label of the target dimension and a preset valid threshold value.

In some embodiments, determining a validity verification result of the image tag of the target dimension based on the valid value of the image tag of the target dimension and a preset valid threshold value includes: determining that the image label of the target dimension has validity under the condition that the comparison result is that the valid value is greater than or equal to the validity threshold value, wherein the corresponding validity verification result is validity verification passing; and under the condition that the comparison result is that the effective value is smaller than the effectiveness threshold, determining that the image label of the target dimension does not have effectiveness, wherein the corresponding effectiveness verification result is that the effectiveness verification is failed.

It should be noted that, besides obtaining the validity verification result of the image tag based on the effective value of the image tag of the target dimension and the preset effective threshold, other validity verification methods are also within the scope of the present disclosure, and the present disclosure does not limit this.

Step S103, determining the video label of the target video according to the image label of the target dimension and the validity verification result of the image label of the target dimension.

The validity verification result of the image label of the target dimension comprises two conditions of passing validity verification and failing to pass validity verification, wherein the image label passing validity verification is a label capable of comprehensively and accurately reflecting the attribute characteristics of the target video, and the image label is determined to be a video label.

In some embodiments, the image tags passing the validation are determined as the video tags of the target video, and only one image tag is reserved as the video tag; and determining the image label with the validity verification result of failing to pass the verification as an invalid label, and discarding the invalid label.

It should be noted that, besides the video tags of the type screened from the image tags according to the validity verification result of the image tags, there are also video tags of the second type. The second type of video tags are tags obtained based on a plurality of target image frames of the target video, and the second type of video tags can effectively represent the attribute characteristics of the target video, so that the validity of the second type of video tags does not need to be verified.

For example, the number of people tag belongs to the second category of video tags. In some embodiments, the obtaining of the number of people tag includes: firstly, acquiring a target area image (namely a human face area image) of each target image frame of a target video; secondly, extracting image feature vectors of each target area image, and calculating the similarity between the image feature vectors; thirdly, aggregating the target area images according to the similarity between the image feature vectors and a preset similarity threshold value to obtain a clustering result; and finally, determining the number labels of the characters of the target video according to the clustering result. Therefore, the number of people tags is obtained by depending on a plurality of target image frames of the target video, the target areas corresponding to the same people can be aggregated according to the similarity of the image feature vectors of the target area images in each target image frame and the similarity threshold, the number of people appearing in the target video can be determined according to the clustering result, and the number of people in the target video can be obtained, so that the number of people tags in the target video can be obtained.

The number of the persons in the target image frame is analyzed to obtain the number of the persons label, so that the attribute characteristics of the target image frame in the number of the persons in the dimension can be obtained, and the popularity of the video can be obtained by how to set the number of the persons in the video in combination with the contents such as the delivery effect of the target video.

And step S104, generating a video analysis result based on the delivery effect of the target video and the video label of the target video.

The delivery effect can reflect the acceptance degree and the love degree of the target video of the audience. In some embodiments, the impression effect may be determined according to the display amount, click rate, conversion rate, and other indexes of the target video.

In some embodiments, the number of the target videos is multiple, after the video tag of each target video is obtained, a plurality of matched target videos are selected from the multiple target videos according to the display amount, click rate and conversion rate of the target videos, and a video analysis result is generated according to the video tags of the matched target videos.

For example, taking educational industry videos as an example, a plurality of matching target videos are selected according to the display amount, click rate and conversion rate of the target videos. Assume that the video tags of the matched target video include: warm red tone label, happy label, children's label, women's label and many people object label, then the video analysis result is for adopting warm tone such as red, shoots the many personalities scenes that the personalities include women and children, and the whole mood of personalities is more easily acquireed the user and favours for the video of happy. The video producer can select appropriate material from the video material with reference to the video analysis result, or shoot a new video with reference to the video analysis result.

It should be noted that, in this embodiment, a video analysis result with a trend may also be generated according to the time sequence of the video tags. For example, if the hue label in the video label includes a cool blue hue and a warm red hue, and the cool blue hue label precedes the warm red hue label at the time point appearing in the video, the video analysis result is that the hue of the video transitions from the cool hue to the warm hue, the cool hue may be blue, and the warm hue may be red. Video analysis results with trend generated according to the time sequence of other types of video tags are similar to the hue tags, and are not described again here.

The video analysis method provided by the embodiment of the disclosure determines the image tags of a target image frame in a preset target dimension, verifies the validity of the image tags of the target dimension based on the number of times that the image tags of the target dimension appear in the target image frame, the total number of the target image frames and a preset validity threshold value, obtains the validity verification result of the image tags of the target dimension, determines the video tags of the target video according to the validity verification result of the image tags of the target dimension and the image tags of the target dimension, generates a video analysis result based on the release effect of the target video and the video tags of the target video, obtains the video tags of the target image frame of the target video in multiple dimensions, obtains the video tags capable of comprehensively and accurately reflecting the attribute characteristics of the target video by verifying the validity of the image tags, and analyzes the video according to the video tags, so that the accuracy and the comprehensiveness of the video analysis result can be effectively improved.

In some embodiments, as shown in fig. 2, in step S102, verifying the validity of the image tag of the target dimension based on the number of times that the image tag of the target dimension appears in the target image frame, the total number of the target image frames, and a preset validity threshold, and obtaining a validity verification result of the image tag of the target dimension, including:

step S201, acquiring the number of target image frames corresponding to the image tag of the target dimension in the target video based on the number of times that the image tag of the target dimension appears in the target image frame.

In some embodiments, image tags of a target dimension in a target video are determined first, and then the number of times each image tag appears in a target image frame is counted, so as to determine the number of target image frames corresponding to the image tags.

Step S202, acquiring the total number of target image frames in the target video.

In some embodiments, the target image frame is an image frame obtained by decimating a target video. Therefore, the total number of the target image frames in the target video can be obtained according to the time length and the frame extraction frequency of the target video. For example, if the duration of the target video is T and the frame extraction frequency is f, the total number of target image frames M = T × f.

Step S203, determining an effective value of the image tag of the target dimension based on the number of the target image frames corresponding to the image tag of the target dimension in the target video and the total number of the target image frames in the target video.

If the number of times of the image tag appearing is more, the more frequently or obviously the attribute feature corresponding to the image tag is expressed in the target video, that is, the more effective the image tag is. In other words, if a certain image tag appears only once in the target video, it indicates that the attribute feature corresponding to the image tag is not frequently represented in the target video, and therefore, the effectiveness of the image tag is low for the target video.

In some embodiments, the effective value of the image tag of the target dimension is a ratio of the number of target image frames corresponding to the image tag of the target dimension in the target video to the total number of target image frames in the target video. For example, if the effective value of the image tag of the target dimension is EV, the number of target image frames corresponding to the image tag of the target dimension in the target video is N, and the total number of target image frames is M, then EV = N/M.

Step S204, based on the effective value of the image label of the target dimension and a preset effective threshold value, determining the validity verification result of the image label of the target dimension.

In some embodiments, the image label with the valid value of the image label of the target dimension greater than or equal to the preset valid threshold value has a valid verification result that the image label passes the verification, and the image label with the valid value of the image label of the target dimension smaller than the preset valid threshold value has a valid verification result that the image label does not pass the verification.

For example, the total number of target image frames of the target video is 20, and the corresponding image tags include a warm red tone tag, a female tag, a child tag, a happy tag, and an angry tag, where the number of target image frames corresponding to the warm red tone tag is 15, the number of target image frames corresponding to the female tag is 12, the number of target image frames corresponding to the child tag is 10, the number of target image frames corresponding to the happy tag is 10, and the number of target image frames corresponding to the angry tag is 2. Therefore, the effective value EV1=15/20 for the warm red tone label, EV2=12/20 for the female label, EV3=10/20 for the child label, EV4=10/20 for the happy label, and EV5=2/20 for the angry label.

Wherein the effective threshold value can be set by a user according to needs. Assuming that the preset valid threshold is 5/20, since EV1, EV2, EV3 and EV4 are greater than 5/20 and EV5 is less than 5/20, the warm red tone label, the woman label, the child label and the happy label pass the validity verification and belong to the image label with validity, and the anger label does not pass the validity verification and belong to the image label without validity. Further to the target video, the video labels thereof include a warm red tone label, a female label, a child label and a happy label.

In a second aspect, an embodiment of the present disclosure further provides a video analysis apparatus, where the video analysis apparatus may relatively safely obtain image tags of target image frames of a video in multiple dimensions, determine the video tags based on the image tags, and analyze the video according to the video tags, so as to obtain a more comprehensive and accurate video analysis result.

Fig. 3 is a schematic block diagram of a video analysis apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the video analysis apparatus 300 includes:

the image tag obtaining module 301 is configured to determine an image tag of a target image frame in a preset target dimension.

In some embodiments, the hue label is obtained by analyzing the RGB colors of the pixels in the target image frame. For example, the RGB colors of each pixel in the target image frame are acquired, a histogram of the RGB colors of each pixel is determined, and a hue label of the target image frame in the hue dimension is determined according to the histogram of the GRB colors of each pixel.

In some embodiments, obtaining the person tag for the target image frame comprises: firstly, acquiring a designated area and a corresponding confidence coefficient in a target image frame, wherein the designated area is a suspected target area, the target area is an area to be analyzed, and the confidence coefficient is used for representing the credibility of the designated area as the target area; secondly, determining a target area based on the confidence coefficient of the designated area and a preset confidence coefficient threshold value, wherein the confidence coefficient threshold value is a numerical value or a value interval preset by a user; finally, a person label of the target image frame in the person dimension is determined based on the target area. The target area is an area of key analysis in the target image frame. In some embodiments, the target area includes a face area and an item area. In practical applications, the target area may be set according to user requirements, which is not limited by the present disclosure.

In general, in a target image frame obtained based on a target video, most of an image area belongs to a background area, and only a small part of the image area belongs to a target area. Although the area of the background area is large, the information content is small, and although the area of the target area is small, the information content is rich, so that the image label which can represent the attribute characteristics of the target image frame can be obtained by analyzing the target area of the target image frame.

In some embodiments, determining the target region based on the confidence level of the designated region and a preset confidence level threshold comprises: and determining the designated region as the target region when the confidence of the designated region is greater than or equal to the confidence threshold. By determining the designated area which is greater than or equal to the confidence coefficient threshold value as the target area, the non-target area in the designated area can be excluded, so that the accuracy of the target area is improved, and the video analysis result is not accurate due to the fact that the non-target area is misjudged as the target area.

The verification module 302 is configured to verify the validity of the image tag of the target dimension based on the number of times that the image tag of the target dimension appears in the target image frame, the total number of the target image frames, and a preset validity threshold, and obtain a validity verification result of the image tag of the target dimension.

The number of times that the image tag of the target dimension appears in the target image frame is the number of the target image frames that mark the image tag, not the number of the image tag itself. For example, the target video includes a first target image frame and a second target image frame, where the first target image frame is labeled with two child tags, and the second target image frame is not labeled with a child tag, and the child tag is labeled in the same target image frame, so that the number of times of occurrence of the child tag in the target image frame is one, but not two.

Wherein the validity threshold value can be set by a user as required. In some embodiments, the user may set the validity threshold based on empirical or statistical data. It should be noted that other validity threshold setting manners are also within the protection scope of the present disclosure, and the present disclosure does not limit this.

The single target image frame cannot comprehensively and accurately express the content of the target video, so that the image tag obtained based on the single target image frame may not comprehensively and accurately express the attribute characteristics of the target video, and therefore, the validity of the image tag needs to be verified, whether the image tag can comprehensively and accurately reflect the attribute characteristics of the target video is determined, the image tag which cannot comprehensively and accurately reflect the attribute characteristics of the target video is eliminated according to the validity verification result, and therefore the accuracy of the tag is improved.

In some embodiments, first, the number of target image frames corresponding to the image tag of the target dimension in the target video is determined based on the number of times the image tag of the target dimension appears in the target image frames; secondly, acquiring the total number of target image frames in the target video, and determining the effective value of the image label of the target dimensionality based on the number of the target image frames corresponding to the image label of the target dimensionality in the target video and the total number of the target image frames in the target video; and thirdly, determining the validity verification result of the image label of the target dimension based on the valid value of the image label of the target dimension and a preset valid threshold value.

It should be noted that, besides obtaining the validity verification result of the image tag based on the effective value of the image tag of the target dimension and the preset effective threshold, other validity verification manners are also within the protection scope of the present disclosure, and the present disclosure does not limit this.

The video tag obtaining module 303 is configured to determine a video tag of the target video according to the target dimension image tag and the validity verification result of the target dimension image tag.

In some embodiments, the image tags passing the validation are determined as the video tags of the target video according to the validation result, and only one image tag is reserved as the video tag; and determining the image label with the validity verification result of failing to pass the verification as an invalid label, and discarding the invalid label.

It should be noted that, in addition to the video tags of the type screened from the image tags according to the validity verification result of the image tags, there are also video tags of the second type. The second type of video tags are tags obtained based on a plurality of target image frames of the target video, and the second type of video tags can effectively represent attribute characteristics of the target video, so that the validity of the second type of video tags does not need to be verified.

For example, the number of people tag belongs to the second category of video tags. In some embodiments, the obtaining of the number of people tag includes: firstly, acquiring a target area image (namely a human face area image) of each target image frame of a target video; secondly, extracting image feature vectors of each target area image, and calculating the similarity between the image feature vectors; thirdly, aggregating the images in the target area according to the similarity between the image feature vectors and a preset similarity threshold value to obtain a clustering result; and finally, determining the number labels of the characters of the target video according to the clustering result. Therefore, the number of people tags is obtained by depending on a plurality of target image frames of the target video, the target areas corresponding to the same people can be aggregated according to the similarity of the image feature vectors of the target area images in each target image frame and the similarity threshold, the number of people appearing in the target video can be determined according to the clustering result, and the number of people in the target video can be obtained, so that the number of people tags in the target video can be obtained.

And the analysis module 304 is configured to generate a video analysis result based on the delivery effect of the target video and the video tag of the target video.

The delivery effect can reflect the acceptance degree and the love degree of the target video of the audience. In some embodiments, the impression effect may be determined according to the target video presentation amount, the click rate, the conversion rate, and other indexes.

The video analysis device provided by the embodiment of the disclosure determines the image tags of the target image frames in the preset target dimension, verifies the validity of the image tags of the target dimension based on the number of times that the image tags of the target dimension appear in the target image frames, the total number of the target image frames and the preset validity threshold value, obtains the validity verification result of the image tags of the target dimension, determines the video tags of the target video according to the validity verification result of the image tags of the target dimension and the image tags of the target dimension, generates a video analysis result based on the release effect of the target video and the video tags of the target video, obtains the video tags of the target image frames of the target video in multiple dimensions, obtains the video tags capable of comprehensively and accurately reflecting the attribute characteristics of the target video by verifying the validity of the image tags, and analyzes the video according to the video tags, so that the accuracy and the comprehensiveness of the video analysis result can be effectively improved.

In some embodiments, as shown in FIG. 4, the verification module 400 includes:

a first obtaining unit 401, configured to obtain, based on the number of times that an image tag of a target dimension appears in a target image frame, the number of target image frames corresponding to the image tag of the target dimension in a target video.

The number of times that the image tag of the target dimension appears in the target image frame is the number of the target image frames that mark the image tag, not the number of the image tag itself.

In some embodiments, image tags of a target dimension in the target video are determined first, and then the number of times each image tag appears in the target image frame is counted, so as to determine the number of target image frames corresponding to the image tags.

A second obtaining unit 402, configured to obtain the total number of target image frames in the target video.

A first determining unit 403, configured to determine an effective value of an image tag of a target dimension based on the number of target image frames corresponding to the image tag of the target dimension in the target video and the total number of target image frames in the target video.

A second determining unit 404, configured to determine a validity verification result of the image tag of the target dimension based on the valid value of the image tag of the target dimension and a preset valid threshold.

In some embodiments, the valid value of the image tag of the target dimension is greater than or equal to the image tag of the preset valid threshold, and the valid verification result is that the image tag passes the verification, and the valid value of the image tag of the target dimension is less than the image tag of the preset valid threshold, and the valid verification result is that the image tag fails the verification.

Wherein the effective threshold value can be set by a user according to needs. Assuming that the preset validity threshold is 5/20, EV1, EV2, EV3 and EV4 are greater than 5/20 and EV5 is less than 5/20, therefore, the warm red tone label, the female label, the child label and the happy label pass the validity verification and belong to the image label with validity, and the angry label does not pass the validity verification and belong to the image label without validity. Further to the target video, the video labels thereof include a warm red tone label, a female label, a child label and a happy label.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the device 500 comprises a computing unit 501 which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the video analysis method. For example, in some embodiments, the video analysis method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the video analysis method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the video analysis method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements any one of the above video analysis methods.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A video analytics method, comprising:

determining an image tag of a target image frame in a preset target dimension, wherein the target image frame is an image obtained based on a target video, and the target dimension is an angle for analyzing the target image frame;

verifying the effectiveness of the image tags of the target dimension based on the number of times that the image tags of the target dimension appear in the target image frames, the total number of the target image frames and a preset effectiveness threshold value to obtain an effectiveness verification result of the image tags of the target dimension, wherein the number of times that the image tags appear in the target image frames refers to the number of the target image frames labeled with the image tags;

determining a video label of the target video according to the image label of the target dimension and the validity verification result of the image label of the target dimension;

and generating a video analysis result based on the target video release effect and the target video label, wherein the release effect is used for representing the acceptance degree and the love degree of the target video for the audience.

2. The method of claim 1, wherein the target dimension is a hue dimension, and the image labels of the hue dimension are hue labels;

the image tag for determining the target image frame in the preset target dimension comprises:

acquiring RGB colors of pixels in the target image frame;

determining the hue label of the target image frame in the hue dimension based on the RGB colors of the pixels.

3. The method of claim 1, wherein the target dimension is a people dimension, and the image tags of the people dimension are people tags;

the determining of the image label of the target image frame in the preset target dimension includes:

acquiring a designated area and a corresponding confidence in the target image frame;

determining a target area based on the confidence of the designated area and a preset confidence threshold;

determining the person label of the target image frame in the person dimension based on the target region.

4. The method of claim 3, wherein the determining a target region based on the confidence of the designated region and a preset confidence threshold comprises:

determining the designated region as the target region if the confidence of the designated region is greater than or equal to the confidence threshold.

5. The method of claim 3, wherein after determining the target region based on the confidence level of the designated region and a preset confidence level threshold, further comprising:

cropping the target area from the target image frame and storing the cropped target area;

and recording the serial number of the target image frame where the target area is located.

6. The method of claim 3, wherein the video tags of the target video further comprise a people number tag;

after the target region is determined based on the confidence level of the designated region and a preset confidence level threshold, the method further includes:

extracting image feature vectors of a target area image from the target image frame of the target video;

calculating the similarity between the image feature vectors;

aggregating the target area images according to the similarity between the image feature vectors and a preset similarity threshold value to obtain a clustering result;

and determining the number label of the characters of the target video according to the clustering result.

7. The method of any of claims 3-6, wherein the human dimension comprises at least one of an expression, an age group, a gender; the character label comprises at least one of an expression label, an age label and a gender label.

8. The method of claim 7, wherein the expression comprises at least one of happy, angry, serious, sad, surprised, and neutral; the emoji tags include at least one of happy tags, angry tags, serious tags, sad tags, surprised tags, and neutral tags;

the age group comprises at least one of an infant, a child, an adult, and an elderly; the age label includes at least one of an infant label, a child label, an adult label, and a senior label.

9. The method according to claim 1, wherein the verifying the validity of the image tag of the target dimension based on the number of times the image tag of the target dimension appears in the target image frame, the total number of the target image frames and a preset validity threshold to obtain a validity verification result of the image tag of the target dimension, comprises:

acquiring the number of the target image frames corresponding to the image tags of the target dimension in the target video based on the number of times that the image tags of the target dimension appear in the target image frames;

acquiring the total number of the target image frames in the target video;

determining an effective value of an image tag of a target dimension based on the number of the target image frames corresponding to the image tag of the target dimension in the target video and the total number of the target image frames in the target video;

and determining a validity verification result of the image label of the target dimension based on the valid value of the image label of the target dimension and a preset valid threshold value.

10. A video analysis apparatus comprising:

the image tag acquisition module is used for determining an image tag of a target image frame in a preset target dimension, wherein the target image frame is an image acquired based on a target video, and the target dimension is an angle for analyzing the target image frame;

the verification module is used for verifying the validity of the image tags of the target dimension based on the number of times that the image tags of the target dimension appear in the target image frames, the total number of the target image frames and a preset validity threshold to obtain a validity verification result of the image tags of the target dimension, wherein the number of times that the image tags appear in the target image frames refers to the number of the target image frames labeled with the image tags;

and the analysis module is used for generating a video analysis result based on the releasing effect of the target video and the video label of the target video, wherein the releasing effect is used for representing the receiving degree and the love degree of the target video for the audience.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.