CN108777815B

CN108777815B - Video processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN108777815B
Application number: CN201810585928.9A
Authority: CN
Inventors: 陈岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2021-04-23
Anticipated expiration: 2038-06-08
Also published as: WO2019233263A1; CN108777815A

Abstract

The application relates to a video processing method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: extracting a frame of image from every preset frame in a video, carrying out scene recognition on the extracted image to obtain a scene label and a corresponding confidence coefficient of the image, establishing a label frequency histogram according to the scene label and the corresponding confidence coefficient of the image, and determining a video label of the video according to the label frequency histogram. In the method, the video label of the video is determined by establishing the label frequency histogram according to the scene label of the image in the video, so that the accuracy of the video label can be improved.

Description

Video processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of internet technology, video becomes one of the important entertainment ways in people's daily life. People can browse different videos according to the video tags on the electronic equipment, the videos need to be classified and the video tags need to be added when the videos are uploaded to a video website by people, and the electronic equipment can acquire the video tags of the videos after identifying the videos. However, the conventional technology has the problem of inaccurate acquisition of the video label.

Disclosure of Invention

The embodiment of the application provides a video processing method and device, electronic equipment and a computer readable storage medium, which can improve the accuracy of video tags.

A video processing method, comprising:

extracting a frame of image from a video at intervals of preset frames, and carrying out scene recognition on the extracted image to obtain a scene label and a corresponding confidence coefficient of the image;

establishing a label frequency histogram according to the scene label of the image and the corresponding confidence coefficient;

and determining the video label of the video according to the label frequency histogram.

A video processing apparatus comprising:

the scene recognition module is used for extracting a frame of image from every preset frame in a video, and carrying out scene recognition on the extracted image to obtain a scene label and a corresponding confidence coefficient of the image;

the histogram establishing module is used for establishing a label frequency histogram according to the scene labels of the image and the corresponding confidence coefficients;

and the video label determining module is used for determining the video label of the video according to the label frequency histogram.

An electronic device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The video processing method and device, the electronic device and the computer readable storage medium can extract one frame of image from each preset frame in the video, perform scene recognition on the extracted image to obtain the scene label and the corresponding confidence coefficient of the image, and establish the label frequency histogram according to the scene label and the corresponding confidence coefficient of each image in the video to determine the video label of the video. The video label can be determined by establishing the label frequency histogram according to the scene label of the image in the video, so that the accuracy of the video label can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram showing an internal structure of an electronic apparatus according to an embodiment;

FIG. 2 is a flow diagram of a video processing method in one embodiment;

FIG. 3 is a schematic diagram of an embodiment of a neural network;

FIG. 4 is a flow diagram of establishing a frequency histogram of tags in one embodiment;

FIG. 5 is a flow chart of creating a frequency histogram of tags in another embodiment;

FIG. 6 is a flow diagram of adjusting confidence in one embodiment;

FIG. 7 is a flow diagram of a video processing method in one embodiment;

FIG. 8 is a block diagram showing the structure of a video processing apparatus according to one embodiment;

FIG. 9 is a schematic diagram of an information processing circuit in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a schematic diagram of an internal structure of an electronic device in one embodiment. As shown in fig. 1, the electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole electronic equipment. The memory is used for storing data, programs and the like, and the memory stores at least one computer program which can be executed by the processor to realize the wireless network communication method suitable for the electronic device provided by the embodiment of the application. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor to implement a video processing method provided in the following embodiments. The internal memory provides a cached execution environment for the operating system computer programs in the non-volatile storage medium. The network interface may be an ethernet card or a wireless network card, etc. for communicating with an external electronic device. The electronic device may be a mobile phone, a tablet computer, or a personal digital assistant or a wearable device, etc.

FIG. 2 is a flow diagram of a video processing method in one embodiment. The video processing method in this embodiment is described by taking the electronic device in fig. 1 as an example. As shown in fig. 2, the video processing method includes steps 202 to 206.

Step 202, extracting a frame of image from every preset frame in the video, and performing scene recognition on the extracted image to obtain a scene tag and a corresponding confidence of the image.

Video refers to any video on an electronic device. Specifically, the video may be a video captured by the electronic device through a camera, may also be a video stored locally in the electronic device, and may also be a video downloaded by the electronic device from a network. Video is a continuous picture composed of multiple frames of still images. The preset frame can be determined according to the actual application requirement. Specifically, the preset frame may be determined according to a video frame rate, may also be determined according to a video duration, and may also be determined according to a combination of the frame rate and the duration. For example, the preset frame may be 0 frame, and at this time, the electronic device may extract each frame of image in the video and perform scene recognition on the extracted image.

The electronic device may train a scene recognition model according to deep learning algorithms such as vgg (visual Geometry group), cnn (volumetric Neural network), ssd (single shot multi detector), Decision Tree (Decision Tree), and the like, and perform scene recognition on the image according to the scene recognition model. In particular, the electronic device may train a neural network that may output a plurality of scene labels. Specifically, in the neural network training process, a training image including a plurality of training labels or a plurality of training images including different training labels may be input into the neural network, the neural network performs feature extraction on the training image, detects the extracted image features to obtain a prediction confidence corresponding to each feature in the image, obtains a loss function according to the prediction confidence and the true confidence of the features, and adjusts parameters of the neural network according to the loss function, so that the trained neural network can subsequently and simultaneously identify scene labels corresponding to a plurality of features of the image, thereby obtaining the neural network outputting a plurality of scene labels. Confidence is the confidence level of the measured value of the measured parameter. The true confidence represents the confidence that the pre-labeled features in the training image belong to the specified scene class. The scene of the image may be a landscape, beach, blue sky, green grass, snow scene, fireworks, spotlights, text, portrait, baby, cat, dog, food, etc.

The electronic equipment adopts a neural network capable of outputting multiple labels to detect an image, specifically, an input layer of the neural network receives an input image, characteristics of the image are extracted through a basic network (such as a VGG network), the extracted characteristics of the image are input to a detection network layer to carry out scene detection, the detection network layer can adopt an SSD network, a Mobilene network and the like to detect the characteristics, confidence degrees and corresponding positions of the categories to which the characteristics belong are output through a softmax classifier at the output layer, a target category with the highest confidence degree and exceeding a confidence degree threshold value is selected as a scene label to which the characteristics belong in the image, and accordingly the scene label and the corresponding confidence degree of each characteristic in the image are output.

And step 204, establishing a label frequency histogram according to the scene labels of the image and the corresponding confidence degrees.

The label frequency histogram refers to a histogram established according to the frequency of each scene label in the video. The frequency of scene tags is determined based on the number of images that contain a scene tag and the confidence level of that scene tag in the images. Specifically, the frequency of the scene tag may be a ratio of the number of images including the scene tag in the video to the number of all extracted images, or may also be a weight value of the scene tag in the image determined according to a confidence of the scene tag in the image or a size of a position region corresponding to the scene tag, and a weighted sum or a weighted average of the images including the scene tag in the video is used as the frequency of the scene tag. In one embodiment, the electronic device uses the scene tag as an abscissa of the tag frequency histogram, and the frequency of the scene tag as an ordinate of the tag frequency histogram to establish the tag frequency histogram, so that the electronic device can obtain the frequency of each scene tag in the video according to the tag frequency histogram.

And step 206, determining the video label of the video according to the label frequency histogram.

The video tag is used for marking the video according to scenes appearing in the video, and people can probably know the main content of the video according to the video tag. The number of video tags may be 1, or may be plural, such as 2, 3, 4, etc., but is not limited thereto. Specifically, the electronic device may obtain frequencies corresponding to scene tags in the video according to the tag frequency histogram, sort the scene tags according to a preset rule according to the frequencies, and use the scene tag with the higher frequency as the video tag of the video. The electronic device may also preset a tag threshold, and use a scene tag with a frequency greater than the tag threshold as a video tag of the video. The electronic equipment can also read the scene tags and the corresponding frequencies in the frequency histogram in a circulating manner, so that the scene tag with the maximum frequency is obtained to serve as the video tag of the video. The manner of determining, by the electronic device, the video tag of the video according to the tag frequency histogram may also be a combination of the above various manners or other manners, which is not limited herein.

According to the video processing method provided by the embodiment of the application, one frame of image can be extracted from each preset frame in the video, scene identification is carried out on the extracted image, the scene label and the corresponding confidence coefficient of the image are obtained, the label frequency histogram is established according to the scene label and the corresponding confidence coefficient of each image in the video to determine the video label of the video, and the accuracy of the video label can be improved.

In an embodiment, the process of performing scene recognition on the extracted image in the provided video processing method to obtain the scene tag and the corresponding confidence of the scene may further include: the method comprises the steps of carrying out scene recognition on an image to obtain a classification label of the image, carrying out target detection on the image to obtain a target label of the image, and taking the classification label and the target label as a scene label of the image.

In particular, the electronic device may employ image classification techniques for scene recognition of images. The electronic device can pre-store image feature information corresponding to a plurality of classification labels, match the image feature information in the image needing scene recognition with the pre-stored image feature information, and acquire the classification label corresponding to the successfully matched image feature information as the classification label of the image. Similarly, the electronic device performs target detection on the image, and can match the image feature information in the image with feature information corresponding to a pre-stored target label, and acquire the target label corresponding to the feature information successfully matched as the target label of the image. The classification labels pre-stored in the electronic device may include: landscape, beach, blue sky, green grass, snow scene, night scene, darkness, backlighting, sunset, fireworks, spotlights, indoors, microspurs, text, portrait, baby, cat, dog, food, etc.; the target tag may include: portrait, baby, cat, dog, gourmet, text, blue sky, green grass, beach, firework, etc. The electronic device can use both the classification tag and the target tag as scene tags of the image, and sequentially output the scene tags of the image and the corresponding confidence degrees according to the confidence degrees.

In an embodiment, the process of performing scene recognition on the extracted image in the provided video processing method to obtain the scene tag and the corresponding confidence of the scene may further include: and carrying out scene classification and target detection on the image to obtain a classification label and a target label of the image, and taking the classification label and the target label as scene labels of the image.

In particular, the electronic device may train a neural network that may enable both scene classification and target detection. Specifically, in the neural network training process, a training image including at least one background training target and a foreground training target may be input into the neural network, the neural network performs feature extraction according to the background training target and the foreground training target, detecting a background training target to obtain a first prediction confidence coefficient, obtaining a first loss function according to the first prediction confidence coefficient and the first real confidence coefficient, detecting the foreground training target to obtain a second prediction confidence degree, obtaining a second loss function according to the second prediction confidence degree and the second real confidence degree, obtaining a target loss function according to the first loss function and the second loss function, the parameters of the neural network are adjusted, so that the trained neural network can subsequently identify scene classification and target classification at the same time, therefore, the neural network capable of detecting the foreground area and the background area of the image simultaneously is obtained. The first true confidence level represents a confidence level of a designated image class to which a background image pre-labeled in the training image belongs. The second true confidence level represents the confidence level of the specified target class to which the foreground target pre-labeled in the training image belongs.

In one embodiment, the neural network comprises at least one input layer, a base network layer, a classification network layer, a target detection network layer, and two output layers, the two output layers comprising a first output layer cascaded with the classification network layer and a second output layer cascaded with the target detection network layer; in a training stage, the input layer is used for receiving the training image, and the first output layer is used for outputting a first prediction confidence coefficient of an appointed scene category to which a background image detected by the classification network layer belongs; the second output layer is used for outputting the offset parameter of each preselected default boundary box detected by the target detection network layer relative to the real boundary box corresponding to the specified target and the second prediction confidence of the class of the specified target. FIG. 3 is a diagram of an embodiment of a neural network. As shown in fig. 3, an input layer of a neural network receives a training image with an image category label, performs feature extraction through a basic network (such as a VGG network), outputs the extracted image features to a feature layer, performs category detection on the image by the feature layer to obtain a first loss function, performs target detection on a foreground target according to the image features to obtain a second loss function, performs position detection on the foreground target according to the foreground target to obtain a position loss function, and performs weighted summation on the first loss function, the second loss function, and the position loss function to obtain a target loss function. The neural network comprises a data input layer, a basic network layer, a scene classification network layer, a target detection network layer and two output layers. The data input layer is used for receiving original image data. And the basic network layer performs preprocessing and feature extraction on the image input by the input layer. The preprocessing may include de-averaging, normalization, dimensionality reduction, and whitening processing. Deaveraging refers to centering the input data to 0 for each dimension in order to pull the center of the sample back to the origin of the coordinate system. Normalization is to normalize the amplitude to the same range. Whitening refers to normalizing the amplitude on each characteristic axis of the data. The image data is subjected to feature extraction, for example, the original image is subjected to feature extraction by using the first 5 layers of convolution layer of VGG16, and the extracted features are input into the classification network layer and the target detection network layer. The characteristics can be detected by adopting deep convolution and point convolution of a Mobilenet network in a classification network layer, then the characteristics are input into an output layer to obtain a first prediction confidence coefficient of an appointed image category to which an image scene classification belongs, and then a first loss function is obtained by subtracting a first true confidence coefficient according to the first prediction confidence coefficient; the target detection network layer can adopt an SSD network, for example, and is cascaded with convolution characteristic layers after the convolution layer of the first 5 layers of the VGG16, and a set of convolution filters are used in the convolution characteristic layers to predict the offset parameter of the preselected default bounding box corresponding to the specified target class relative to the real bounding box and the second prediction confidence corresponding to the specified target class. The region of interest is a region of a preselected default bounding box. And constructing a position loss function according to the offset parameter, and obtaining a second loss function according to the difference between the second prediction confidence coefficient and the second real confidence coefficient. And weighting and summing the first loss function, the second loss function and the position loss function to obtain a target loss function, and adjusting parameters of the neural network by adopting a back propagation algorithm according to the target loss function to train the neural network.

When the trained neural network is adopted to identify the image, the input image is received by the neural network input layer, the characteristics of the image are extracted, the image is input to the classification network layer to identify the image scene, the confidence coefficient of each appointed scene category to which the background image belongs is output by the softmax classifier at the first output layer, and the image scene with the highest confidence coefficient and exceeding the confidence coefficient threshold value is selected as the classification label to which the background image of the image belongs. Inputting the extracted features of the image into a target detection network layer for foreground target detection, outputting the confidence coefficient and the corresponding position of the specified target category to which the foreground target belongs through a softmax classifier on a second output layer, outputting each target label of the foreground target, outputting the corresponding position of the target label, and taking the obtained classification label and the target label as the scene label of the image.

In an embodiment, a process of creating a tag frequency histogram according to scene tags and corresponding confidence levels of an image in a video processing method is provided, as shown in fig. 4, including:

step 402, using the confidence corresponding to the scene label of the image as the weight value of the scene label in the image.

The weight value of the scene tag in the image refers to the importance degree of the scene tag in the image in the video tag. Under the condition that the scene labels and the weight values of other images in the video are constant, the higher the weight value of the scene label in the image is, the higher the frequency of the scene label in the video is; the lower the weight value of a scene tag in an image, the lower the frequency of the scene tag in the video.

Step 404, a tag frequency histogram is established according to the scene tags of the image and the corresponding weight values.

Specifically, the electronic device may obtain a weighted sum of the scene tags according to the scene tags of the images in the video and the corresponding weight values, and establish a tag frequency histogram according to the scene tags and the corresponding weighted sum. The electronic equipment can also obtain a weighted average value of the scene label according to the scene label of the image in the video and the corresponding weight value, and establish a label frequency histogram according to the scene label and the corresponding weighted average value. For example, the scene tag and the corresponding confidence of the image output in the video are respectively a image: baby 0.9, grass 0.8, blue sky 0.5, B image: gourmet 0.8, baby 0.6, C image: blue sky 0.7, baby 0.3, in the label frequency histogram established according to the scene label and the corresponding weighted average value, the frequency of baby is 0.6, the frequency of blue sky is 0.4, the frequencies of grass and food are all 0.27, the electronic device can use baby as the video label of the video, and can also use baby and blue sky as the video label of the video, etc.

The scene labels of the images and the corresponding confidence coefficients are used as the weight values of the scene labels in the images, and the label frequency histogram is established according to the scene labels in the videos and the corresponding weight values, so that the video labels of the videos are determined, and the accuracy of the video labels can be improved.

As shown in fig. 5, in an embodiment, the process of creating a tag frequency histogram according to scene tags and corresponding confidence levels of an image in the provided video processing method may further include:

step 502, determining the weight values of the scene tags according to the confidence values of the scene tags of the image in the confidence values of all the scene tags of the image.

Specifically, the electronic device may sequence the scene tags in the image according to the confidence level of the scene tags to obtain sequence number tags corresponding to the scene tags, that is, the scene tag with the highest confidence level is used as the first tag, and the scene tag next to the first tag is used as the second tag, and so on. For example, in a frame of image of the video, the confidence of the beach is 0.6, the confidence of the blue sky is 0.9, and the confidence of the portrait is 0.8, then the blue sky in the frame of image is the first label, the portrait is the second label, and the beach is the third label. The electronic equipment can prestore the weight values corresponding to different sequence number labels, and the weight values of the scene labels are determined according to the sequence number labels of the scene labels.

In one embodiment, when the confidence of the scene tag of the image is the highest among the confidence of all the scene tags of the image, the weight value corresponding to the scene tag is the highest among the weight values of the scene tags in the image. Specifically, when the electronic device prestores a weight value corresponding to the serial number tag, the weight value prestored in the first tag is the highest, and the second tag is the next to the first, so as to analogize. For example, the electronic device may pre-store a weight value of 0.8 for the first tag, 0.5 for the second tag, and 0.2 for the third tag, and in the above example, the weight value of the first tag blue sky is 0.8, the weight value of the second tag portrait is 0.5, and the weight value of the third tag beach is 0.2.

Step 504, a label frequency histogram is established according to the scene label of the image and the corresponding weight value.

Specifically, the electronic device may obtain a weighted sum of the scene tags according to the scene tags of the images in the video and the corresponding weight values, and establish a tag frequency histogram according to the scene tags and the corresponding weighted sum. The electronic equipment can also obtain a weighted average value of the scene label according to the scene label of the image in the video and the corresponding weight value, and establish a label frequency histogram according to the scene label and the corresponding weighted average value.

The method comprises the steps of determining a weight value of a scene label in an image according to the scene label of the image and a corresponding confidence coefficient, and establishing a label frequency histogram according to the scene label in the video and the corresponding weight value, so that the video label of the video is determined, and the accuracy of the video label can be improved.

In one embodiment, a process of creating a tag frequency histogram according to scene tags and corresponding confidence levels of an image in a video processing method provided in the provided video processing method further includes: and establishing a label frequency histogram of the video according to the scene labels with the confidence degrees larger than the threshold value and the corresponding confidence degrees.

The electronic equipment establishes a label frequency histogram according to the scene labels with the confidence degrees larger than the threshold value, and can filter the scene labels with the confidence degrees smaller than the threshold value in the image. The threshold value may be determined according to actual requirements. Specifically, the threshold value may be 0.1, 0.15, 0.2, 0.3, or the like, but is not limited thereto. The electronic equipment acquires the scene tags with the confidence degrees larger than the threshold value and the corresponding confidence degrees, determines the corresponding weight values according to the confidence degrees of the scene tags, establishes a tag frequency histogram according to the scene tags of the image and the corresponding weight values to determine the video tags of the video, can reduce the influence of the scene tags with low confidence degrees in the image on the video tags, and improves the accuracy of the scene tags. For example, in one frame of image in the video, the scene tags and the confidence degrees are respectively 0.8 for dog, 0.2 for cat, 0.7 for grass, and 0.1 for food, if the threshold is 0.3, the two scene tags of 0.2 for cat and 0.1 for food are discarded, and a tag frequency histogram is established according to the two scene tags of 0.8 for dog and 0.7 for grass. ,

as shown in fig. 6, in an embodiment, the provided video processing method may further include a process of adjusting confidence of the scene tag, and the specific steps include:

step 602, position information is obtained when the video is shot.

When the electronic device captures a video, the address information at the time of video capture can be acquired by a Global Positioning System (GPS), and the position information at the time of video capture can be obtained from the address information. For example, when the GPS detects that the address information of the video shot is north latitude 109.408984 and east longitude 18.294898, the electronic device may acquire the corresponding location information as the hainan mitsui beach according to the address information.

And step 604, adjusting the confidence corresponding to the scene label in the image according to the position information.

The electronic device can pre-store scene tags corresponding to different position information and weights corresponding to the scene tags, and the confidence corresponding to the scene tags in the image is adjusted according to the weights of the scene tags. Specifically, the weight corresponding to the scene tag may be a result obtained by performing statistical analysis on a large amount of image materials, and the corresponding scene tag and the weight corresponding to the scene tag are correspondingly matched for different position information according to the result. For example, it is obtained by performing statistical analysis on a large number of image materials, and when the position information is "beach", the weight of "beach" corresponding to the address of "beach" is 9, the weight of blue sky "is 8, the weight of landscape" is 7, the weight of snow scene is-8, the weight of green grass is-7, and the value range of the weight is [ -10,10 ]. A larger weight indicates a larger probability of the scene appearing in the image, and a smaller weight indicates a smaller probability of the scene appearing in the image. The confidence of the corresponding scene increases by 1% every time the weight is increased by 1% from 0, and similarly, the confidence of the corresponding scene decreases by 1% every time the weight is decreased by 1% from 0.

The position information is obtained according to the address information shot by the video, the weight corresponding to each scene label under the position information is obtained, and the confidence coefficient of the scene label of the image is adjusted, so that the confidence coefficient of the scene label of the image is more accurate, and the accuracy of the video label is improved.

As shown in fig. 7, in an embodiment, a process of determining a video tag of a video according to a tag frequency histogram in the provided video processing method includes:

step 702, obtaining the frequency corresponding to the scene label according to the label frequency histogram.

Specifically, the electronic device may obtain frequencies corresponding to scene tags in the video according to the tag frequency histogram, and sort the scene tags according to a preset rule according to the frequencies.

Step 704, using the preset number of scene tags with high frequency as video tags of the video.

The preset number can be determined according to the actual application scenario. Specifically, when the electronic device displays videos in a classified manner according to the video tags, the preset number of the videos may be 1; when the electronic equipment uploads the video to the video website, the electronic equipment limits the number of the video tags according to the video website to determine the preset number of the video tags. The preset number may be 1, or a plurality of such as 2, 3, 4, etc., but is not limited thereto. For example, when the electronic device uploads videos to a video website that defines 3 video tags, the preset number may be 3. The electronic equipment can sequence the scene labels from large to small according to the frequency corresponding to the scene labels, so that the preset number of scene labels with large frequency are sequentially used as video labels of the video.

In one embodiment, a video processing method is provided, and the specific steps for implementing the method are as follows:

firstly, the electronic equipment extracts one frame of image from every preset frame in the video, and carries out scene recognition on the extracted image to obtain a scene label and a corresponding confidence coefficient of the image. Video is a continuous picture composed of multiple frames of still images. The preset frame can be determined according to the actual application requirement. Specifically, the preset frame may be determined according to a video frame rate, may also be determined according to a video duration, and may also be determined according to a combination of the frame rate and the duration. The electronic equipment can train a scene recognition model according to deep learning algorithms such as VGG, CNN, SSD, decision trees and the like, and perform scene recognition on the image according to the scene recognition model. Specifically, the electronic device detects the image by using a neural network capable of outputting multiple labels, so as to output scene labels and corresponding confidence degrees of each feature in the image.

Optionally, the electronic device performs scene recognition on the image to obtain a classification tag of the image, performs target detection on the image to obtain a target tag of the image, and uses the classification tag and the target tag as a scene tag of the image. The electronic device can pre-store image feature information corresponding to a plurality of classification labels, match the image feature information in the image needing scene recognition with the pre-stored image feature information, and acquire the classification label corresponding to the successfully matched image feature information as the classification label of the image. The electronic equipment performs target detection on the image, can match the image characteristic information in the image with the characteristic information corresponding to the pre-stored target label, and obtains the target label corresponding to the successfully matched characteristic information as the target label of the image. The electronic device can use both the classification tag and the target tag as scene tags of the image, and obtain confidence degrees corresponding to the scene tags.

Optionally, the electronic device performs scene classification and target detection on the image to obtain a classification tag and a target tag of the image, and the classification tag and the target tag are used as scene tags of the image. The electronic equipment can train a neural network capable of realizing scene classification and target detection at the same time, feature extraction is carried out on an image by utilizing a basic network layer of the neural network, the extracted image features are input into a classification network and a target detection network layer, scene detection is carried out through the classification network to output the confidence coefficient of the appointed image class of the background region of the image, target detection is carried out through the target detection network layer to obtain the confidence coefficient of the appointed target class of the foreground region, and therefore classification labels and corresponding confidence coefficients of the image, the target labels and the corresponding confidence coefficients and the position of the target are output.

Optionally, the electronic device obtains position information when the video is shot, and adjusts a confidence corresponding to the scene tag in the image according to the position information. When the electronic equipment shoots the video, the address information during video shooting can be acquired through the GPS, and the position information during video shooting can be acquired according to the address information. The electronic device can pre-store scene tags corresponding to different position information and weights corresponding to the scene tags, and the confidence corresponding to the scene tags in the image is adjusted according to the weights of the scene tags.

And then, the electronic equipment establishes a label frequency histogram according to the scene labels of the image and the corresponding confidence coefficients. The label frequency histogram refers to a histogram established according to the frequency of each scene label in the video. The frequency of scene tags is determined based on the number of images that contain a scene tag and the confidence level of that scene tag in the images. Specifically, the frequency of the scene tag may be a ratio of the number of images including the scene tag in the video to the number of all extracted images, or may also be a weight value of the scene tag in the image determined according to a confidence of the scene tag in the image or a size of a position region corresponding to the scene tag, and a weighted sum or a weighted average of the images including the scene tag in the video is used as the frequency of the scene tag.

Optionally, the electronic device uses the confidence corresponding to the scene tag of the image as a weight value of the scene tag in the image. And establishing a label frequency histogram according to the scene label of the image and the corresponding weight value. The weight value of the scene tag in the image refers to the importance degree of the scene tag in the image in the video tag. The scene labels of the images and the corresponding confidence coefficients are used as the weight values of the scene labels in the images, and the label frequency histogram is established according to the scene labels in the videos and the corresponding weight values, so that the video labels of the videos are determined, and the accuracy of the video labels can be improved.

Optionally, the electronic device determines the weight values of the scene tags according to the confidence degrees of the scene tags of the image at the confidence degrees of all the scene tags of the image, and establishes a tag frequency histogram according to the scene tags of the image and the corresponding weight values. The electronic device can sequence the scene tags in the image according to the confidence degree of the scene tags to obtain the sequence number tags corresponding to the scene tags, namely, the scene tag with the maximum confidence degree is used as a first tag, the next scene tag is used as a second tag, and the like. The electronic equipment can prestore weighted values corresponding to different sequence number labels, determine the weighted value of the scene label according to the sequence number label of the scene label, and establish a label frequency histogram according to the scene label of the image and the corresponding weighted value.

Optionally, when the confidence of the scene tags of the image is the highest among the confidence of all the scene tags of the image, the weight value corresponding to the scene tag is the highest among the weight values in the image. Specifically, when the electronic device prestores a weight value corresponding to the serial number tag, the weight value prestored in the first tag is the highest, and the second tag is the next to the first, so as to analogize.

Optionally, the electronic device establishes a label frequency histogram of the video according to the scene label with the confidence level greater than the threshold value and the corresponding confidence level. The electronic equipment acquires the scene tags with the confidence degrees larger than the threshold value and the corresponding confidence degrees, determines the corresponding weight values according to the confidence degrees of the scene tags, establishes a tag frequency histogram according to the scene tags of the image and the corresponding weight values to determine the video tags of the video, can reduce the influence of the scene tags with low confidence degrees in the image on the video tags, and improves the accuracy of the scene tags.

Next, the electronic device determines a video tag of the video from the tag frequency histogram. The video tag is used for marking the video according to scenes appearing in the video, and people can probably know the main content of the video according to the video tag. The electronic equipment can obtain the frequency corresponding to each scene label in the video according to the label frequency histogram, sort the scene labels according to a preset rule according to the frequency, and use the scene label with higher frequency as the video label of the video.

Optionally, the electronic device obtains the frequency corresponding to the scene tag according to the tag frequency histogram, and uses a preset number of scene tags with high frequency as video tags of the video. The preset number can be determined according to the actual application scenario. Specifically, when the electronic device displays videos in a classified manner according to the video tags, the preset number may be 1; when the electronic equipment uploads the video to the video website, the electronic equipment limits the number of the video tags according to the video website to determine the preset number of the video tags. The electronic equipment can sequence the scene labels from large to small according to the frequency corresponding to the scene labels, so that the preset number of scene labels with large frequency are sequentially used as video labels of the video.

It should be understood that although the various steps in the flowcharts of fig. 2, 4-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4-7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

Fig. 8 is a block diagram of a video processing device according to an embodiment. As shown in fig. 8, a video processing apparatus includes a scene recognition module 802, a histogram creation module 804, and a video tag determination module 806. Wherein:

the scene recognition module 802 is configured to extract a frame of image from the video at every preset frame, perform scene recognition on the extracted image, and obtain a scene tag and a corresponding confidence of the image.

A histogram establishing module 804, configured to establish a label frequency histogram according to the scene label of the image and the corresponding confidence level.

And a video tag determination module 806, configured to determine a video tag of the video according to the tag frequency histogram.

In an embodiment, the histogram establishing module 804 may be further configured to use the confidence degree corresponding to the scene tag of the image as a weight value of the scene tag in the image, and establish a tag frequency histogram according to the scene tag of the image and the corresponding weight value.

In an embodiment, the histogram establishing module 804 may be further configured to determine a weight value of a scene tag according to the confidence of the scene tag of the image at the confidence of all scene tags of the image, and establish a tag frequency histogram according to the scene tag of the image and the corresponding weight value.

In one embodiment, the histogram establishing module 804 may be further configured to determine a weight value of the scene tag according to the confidence level of the scene tag of the image at the confidence level of all the scene tags of the image, where when the confidence level of the scene tag of the image is the maximum at all the scene tags of the image, the weight value corresponding to the scene tag is the highest in the image.

In one embodiment, the histogram establishing module 804 may be further configured to establish a label frequency histogram of the video according to the scene label with the confidence level greater than the threshold value and the corresponding confidence level.

In an embodiment, the provided video processing apparatus may further include a confidence level adjustment module 808, where the confidence level adjustment module 808 is configured to obtain location information when the video is captured, and adjust confidence levels corresponding to scene tags in the image according to the location information.

In an embodiment, the video tag determining module 806 may be further configured to obtain frequencies corresponding to scene tags according to the tag frequency histogram, and use a preset number of scene tags with a high frequency as video tags of the video.

The division of the modules in the video processing apparatus is only for illustration, and in other embodiments, the video processing apparatus may be divided into different modules as needed to complete all or part of the functions of the video processing apparatus.

For specific limitations of the video processing apparatus, reference may be made to the above limitations of the video processing method, which is not described herein again. The various modules in the video processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The implementation of each module in the video processing apparatus provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the video processing method.

A computer program product comprising instructions which, when run on a computer, cause the computer to perform a video processing method.

The embodiment of the application also provides the electronic equipment. The electronic device includes therein an Image Processing circuit, which may be implemented using hardware and/or software components, and may include various Processing units defining an ISP (Image Signal Processing) pipeline. FIG. 9 is a schematic diagram of an image processing circuit in one embodiment. As shown in fig. 9, for convenience of explanation, only aspects of the image processing technique related to the embodiments of the present application are shown.

As shown in fig. 9, the image processing circuit includes an ISP processor 940 and a control logic 950. The image data captured by the imaging device 910 is first processed by the ISP processor 940, and the ISP processor 940 analyzes the image data to capture image statistics that may be used to determine and/or control one or more parameters of the imaging device 910. The imaging device 910 may include a camera having one or more lenses 912 and an image sensor 914. Image sensor 914 may include an array of color filters (e.g., Bayer filters), and image sensor 914 may acquire light intensity and wavelength information captured with each imaging pixel of image sensor 914 and provide a set of raw image data that may be processed by ISP processor 940. The sensor 920 (e.g., a gyroscope) may provide parameters of the acquired image processing (e.g., anti-shake parameters) to the ISP processor 940 based on the type of interface of the sensor 920. The sensor 920 interface may utilize an SMIA (Standard Mobile Imaging Architecture) interface, other serial or parallel camera interfaces, or a combination of the above.

In addition, image sensor 914 may also send raw image data to sensor 920, sensor 920 may provide raw image data to ISP processor 940 based on the type of interface of sensor 920, or sensor 920 may store raw image data in image memory 930.

The ISP processor 940 processes the raw image data pixel by pixel in a variety of formats. For example, each image pixel may have a bit depth of 9, 10, 12, or 14 bits, and the ISP processor 940 may perform one or more image processing operations on the raw image data, collecting statistical information about the image data. Wherein the image processing operations may be performed with the same or different bit depth precision.

ISP processor 940 may also receive image data from image memory 930. For example, the sensor 920 interface sends raw image data to the image memory 930, and the raw image data in the image memory 930 is then provided to the ISP processor 940 for processing. The image Memory 930 may be a part of a Memory device, a storage device, or a separate dedicated Memory within an electronic device, and may include a DMA (Direct Memory Access) feature.

Upon receiving raw image data from image sensor 914 interface or from sensor 920 interface or from image memory 930, ISP processor 940 may perform one or more image processing operations, such as temporal filtering. The processed image data may be sent to image memory 930 for additional processing before being displayed. ISP processor 940 receives processed data from image memory 930 and performs image data processing on the processed data in the raw domain and in the RGB and YCbCr color spaces. The image data processed by ISP processor 940 may be output to display 970 for viewing by a user and/or further processed by a Graphics Processing Unit (GPU). Further, the output of ISP processor 940 may also be sent to image memory 930 and display 970 may read image data from image memory 930. In one embodiment, image memory 930 may be configured to implement one or more frame buffers. In addition, the output of the ISP processor 940 may be transmitted to an encoder/decoder 960 for encoding/decoding the image data. The encoded image data may be saved and decompressed before being displayed on a display 970 device. The encoder/decoder 960 may be implemented by a CPU or GPU or coprocessor.

The statistical data determined by the ISP processor 940 may be transmitted to the control logic 950 unit. For example, the statistical data may include image sensor 914 statistics such as auto-exposure, auto-white balance, auto-focus, flicker detection, black level compensation, lens 912 shading correction, and the like. The control logic 950 may include a processor and/or microcontroller that executes one or more routines (e.g., firmware) that may determine control parameters of the imaging device 910 and control parameters of the ISP processor 940 based on the received statistical data. For example, the control parameters of imaging device 910 may include sensor 920 control parameters (e.g., gain, integration time for exposure control, anti-shake parameters, etc.), camera flash control parameters, lens 912 control parameters (e.g., focal length for focusing or zooming), or a combination of these parameters. The ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (e.g., during RGB processing), as well as lens 912 shading correction parameters.

The electronic device may implement the video processing method described in the embodiments of the present application according to the image processing technology described above.

Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A video processing method, comprising:

extracting a frame of image from every preset frame in a video, extracting the features of the image by adopting a trained neural network, carrying out image scene recognition on the extracted features, outputting the confidence coefficient of each appointed scene category to which the background image of the image belongs, and selecting the image scene with the highest confidence coefficient and exceeding a confidence coefficient threshold value as a classification label to which the background image of the image belongs; performing foreground target detection on the extracted features, outputting confidence of a specified target class to which a foreground target belongs, outputting each target label of the foreground target, taking the classification label and the target label as a scene label of the image, and acquiring confidence corresponding to the scene label;

acquiring position information when the video is shot; adjusting the confidence corresponding to the scene label in the image according to the position information;

taking the confidence coefficient corresponding to the scene label of the image as the weight value of the scene label in the image; establishing a label frequency histogram according to the scene label of the image and the corresponding weight value; or determining the weight value of the scene label according to the confidence of the scene label of the image in the confidence of all the scene labels of the image; establishing a label frequency histogram according to the scene label of the image and the corresponding weight value; or determining a weight value of a scene label in the image according to the size of a position area corresponding to the scene label, performing weighted sum or weighted average on the weight value corresponding to the image containing the scene label in the video to serve as the frequency of the scene label, taking the scene label as an abscissa, and taking the frequency of the scene label as an ordinate to establish a label frequency histogram; the weight value of the scene label of the image refers to the importance degree of the scene label of the image in the video label;

2. The method of claim 1, further comprising:

when the confidence degrees of the scene labels of the image are the maximum in the confidence degrees of all the scene labels of the image, the weight value corresponding to the scene label is the highest in the image.

3. The method according to any one of claims 1 to 2, further comprising:

and establishing a label frequency histogram of the video according to the scene labels with the confidence degrees larger than the threshold value and the corresponding confidence degrees.

4. The method of claim 1, further comprising:

obtaining the frequency corresponding to the scene label according to the label frequency histogram;

and taking a preset number of scene labels with high frequency as video labels of the video.

5. A video processing apparatus, comprising:

the scene recognition module is used for extracting a frame of image from every other preset frame in the video, extracting the features of the image by adopting a trained neural network, carrying out image scene recognition on the extracted features, outputting the confidence coefficient of each appointed scene category to which the background image of the image belongs, and selecting the image scene with the highest confidence coefficient and exceeding the threshold value of the confidence coefficient as the classification label to which the background image of the image belongs; performing foreground target detection on the extracted features, outputting confidence of a specified target class to which a foreground target belongs, outputting each target label of the foreground target, taking the classification label and the target label as a scene label of the image, and acquiring confidence corresponding to the scene label;

the confidence coefficient adjusting module is used for acquiring the position information of the video during shooting; adjusting the confidence corresponding to the scene label in the image according to the position information;

the histogram establishing module is used for taking the confidence coefficient corresponding to the scene label of the image as the weight value of the scene label in the image; establishing a label frequency histogram according to the scene label of the image and the corresponding weight value; or determining the weight value of the scene label according to the confidence of the scene label of the image in the confidence of all the scene labels of the image; establishing a label frequency histogram according to the scene label of the image and the corresponding weight value; or determining a weight value of a scene label in the image according to the size of a position area corresponding to the scene label, performing weighted sum or weighted average on the weight value corresponding to the image containing the scene label in the video to serve as the frequency of the scene label, taking the scene label as an abscissa, and taking the frequency of the scene label as an ordinate to establish a label frequency histogram; the weight value of the scene label of the image refers to the importance degree of the scene label of the image in the video label;

6. An electronic device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of the video processing method according to any of claims 1 to 4.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.