WO2019233263A1 - 视频处理方法、电子设备、计算机可读存储介质 - Google Patents

视频处理方法、电子设备、计算机可读存储介质 Download PDF

Info

Publication number
WO2019233263A1
WO2019233263A1 PCT/CN2019/087557 CN2019087557W WO2019233263A1 WO 2019233263 A1 WO2019233263 A1 WO 2019233263A1 CN 2019087557 W CN2019087557 W CN 2019087557W WO 2019233263 A1 WO2019233263 A1 WO 2019233263A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
label
scene
video
confidence level
Prior art date
Application number
PCT/CN2019/087557
Other languages
English (en)
French (fr)
Inventor
陈岩
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019233263A1 publication Critical patent/WO2019233263A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • the present application relates to the field of computer technology, and in particular, to a video processing method, an electronic device, and a computer-readable storage medium.
  • a video processing method an electronic device, and a computer-readable storage medium are provided.
  • a video processing method includes:
  • Extract a frame of image from every preset frame in the video perform scene recognition on the extracted image, and obtain the scene label and corresponding confidence of the image;
  • a video label of the video is determined according to the label frequency histogram.
  • An electronic device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the following operations:
  • Extract a frame of image from every preset frame in the video perform scene recognition on the extracted image, and obtain the scene label and corresponding confidence of the image;
  • a video label of the video is determined according to the label frequency histogram.
  • a computer-readable storage medium stores a computer program thereon.
  • the computer program is executed by a processor, the following operations are implemented:
  • Extract a frame of image from every preset frame in the video perform scene recognition on the extracted image, and obtain the scene label and corresponding confidence of the image;
  • a video label of the video is determined according to the label frequency histogram.
  • the video processing method, electronic device, and computer-readable storage medium provided in the embodiments of the present application can extract a frame image from each preset frame in the video, perform scene recognition on the extracted image, and obtain a scene label and corresponding confidence of the image. Degree, according to the scene label of each image in the video and a label frequency histogram for the corresponding confidence level to determine the video label of the video. Since the label frequency histogram can be established according to the scene label of the image in the video to determine the video label, the accuracy of the video label can be improved.
  • FIG. 1 is a schematic diagram of an internal structure of an electronic device in one or more embodiments.
  • FIG. 2 is a flowchart of a video processing method in one or more embodiments.
  • FIG. 3 is a schematic structural diagram of a neural network in one or more embodiments.
  • FIG. 4 is a flowchart of establishing a label frequency histogram in one or more embodiments.
  • FIG. 5 is a flowchart of establishing a label frequency histogram in another embodiment.
  • FIG. 6 is a flowchart of adjusting the confidence level in one or more embodiments.
  • FIG. 7 is a flowchart of a video processing method in one or more embodiments.
  • FIG. 8 is a structural block diagram of a video processing apparatus in one or more embodiments.
  • FIG. 9 is a schematic diagram of an information processing circuit in one or more embodiments.
  • FIG. 1 is a schematic diagram of an internal structure of an electronic device in an embodiment.
  • the electronic device includes a processor, a memory, and a network interface connected through a system bus.
  • the processor is used to provide computing and control capabilities to support the operation of the entire electronic device.
  • the memory is used to store data, programs, and the like. At least one computer program is stored on the memory, and the computer program can be executed by a processor to implement the wireless network communication method applicable to the electronic device provided in the embodiments of the present application.
  • the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the computer program may be executed by a processor to implement a video processing method provided by each of the following embodiments.
  • the internal memory provides a cached operating environment for operating system computer programs in a non-volatile storage medium.
  • the network interface may be an Ethernet card or a wireless network card, and is used to communicate with external electronic devices.
  • the electronic device may be a mobile phone, a tablet computer, or a personal digital assistant or a wearable device.
  • FIG. 2 is a flowchart of a video processing method according to an embodiment.
  • the video processing method in this embodiment is described by using an electronic device running in FIG. 1 as an example.
  • the video processing method includes operations 202 to 206.
  • a frame image is extracted from every preset frame in the video, and scene identification is performed on the extracted image to obtain a scene label and corresponding confidence of the image.
  • Video is any video on an electronic device.
  • the video may be a video collected by the electronic device through a camera, a video stored locally on the electronic device, or a video downloaded from the network by the electronic device.
  • Video is a continuous picture composed of multiple frames of still images.
  • the preset frame can be determined according to actual application requirements. Specifically, the preset frame may be determined according to a video frame rate, or may be determined according to a video duration, and may also be determined according to a combination of a frame rate and a duration. For example, the preset frame may be 0 frames.
  • the electronic device may extract each frame image in the video, and perform scene recognition on the extracted image.
  • Electronic devices can train scene recognition models based on deep learning algorithms such as VGG (Visual Geometry Group), CNN (Convolutional Neural Network), SSD (single shot multibox detector), and Decision Tree, and perform scene recognition on images based on scene recognition models. .
  • the electronic device can train a neural network that can output multiple scene tags.
  • training images containing multiple training labels or multiple training images containing different training labels may be input to the neural network, and the neural network performs feature extraction on the training images and extracts the image features.
  • the prediction confidence corresponding to each feature in the image is detected, and the loss function is obtained according to the prediction confidence and true confidence of the feature.
  • the parameters of the neural network are adjusted according to the loss function, so that the trained neural network can subsequently identify multiple images at the same time.
  • Scene labels corresponding to each feature thereby obtaining a neural network that outputs multiple scene labels.
  • Confidence is the degree of confidence in the measured value of the parameter being measured.
  • the true confidence level indicates the confidence level of the specified scene category to which the feature labeled in advance in the training image belongs.
  • the scene of the image can be landscape, beach, blue sky, green grass, snow, fireworks, spotlight, text, portrait, baby, cat, dog, food, etc.
  • the electronic device uses a neural network that can output multiple tags to detect the image.
  • the neural network input layer receives the input image, extracts the features of the image through a basic network (such as a VGG network), and inputs the features of the extracted image to the detection.
  • the network layer performs scene detection.
  • the detection network layer can use SSD networks, Mobilenet networks, etc. to detect features.
  • the output layer uses the softmax classifier to output the confidence level and corresponding position of the category to which the feature belongs.
  • the target category of the image is used as the scene label to which the feature belongs in the image, so that the scene label and corresponding confidence of each feature in the image are output.
  • a label frequency histogram is established according to the scene label of the image and the corresponding confidence level.
  • a label frequency histogram is a histogram created based on the frequency of each scene label in the video.
  • the frequency of scene tags is determined based on the number of images containing the scene tags and the confidence of the scene tags in the image. Specifically, the frequency of scene tags may be the ratio of the number of images containing the scene tags to the number of all extracted images in the video, or it may be determined based on the confidence of the scene tags in the image or the size of the location area corresponding to the scene tags.
  • the weight value of the scene label in the medium, and the weighted sum or weighted average of the image containing the scene label in the video as the frequency of the scene label.
  • the electronic device uses the scene label as the abscissa of the label frequency histogram, and the scene label frequency as the ordinate of the label frequency histogram to establish a label frequency histogram. Then the electronic device can obtain a video according to the label frequency histogram. The frequency of each scene label in the scene.
  • Operation 206 Determine a video label of the video according to the label frequency histogram.
  • Video labeling refers to labeling videos based on the scenes in the videos. According to the video labels, people can roughly understand the main content of the video.
  • the number of video tags may be one or multiple, such as two, three, four, etc., and is not limited thereto.
  • the electronic device can obtain the frequency corresponding to each scene tag in the video according to the tag frequency histogram, sort the scene tags according to the preset rules according to the frequency, and use the scene tags with higher frequency as the video tags of the video.
  • the electronic device may also preset a tag threshold, and use scene tags with frequencies greater than the tag threshold as video tags of the video.
  • the electronic device can also read the scene tags and the corresponding frequencies in the frequency histogram, so as to obtain the scene tags with the highest frequency as the video tags of the video.
  • the manner in which the electronic device determines the video label of the video according to the label frequency histogram may also be a combination of the foregoing various manners or other manners, which is not limited herein.
  • the video processing method provided in the embodiment of the present application can extract a frame image from each preset frame in a video, perform scene recognition on the extracted image, and obtain a scene label and corresponding confidence of the image. According to the scene of each image in the video Establishing a label frequency histogram of labels and corresponding confidence to determine the video label of a video can improve the accuracy of the video label.
  • the provided video processing method performs scene recognition on the extracted image
  • the process of obtaining the scene label and corresponding confidence of the scene may further include: performing scene recognition on the image, obtaining a classification label of the image, and performing image recognition on the image.
  • the target detection is performed to obtain the target label of the image, and the classification label and the target label are used as the scene label of the image.
  • the electronic device may use image classification technology to perform scene recognition on the image.
  • the electronic device may pre-store image feature information corresponding to multiple classification tags, match the image feature information in the image requiring scene recognition with the pre-stored image feature information, and obtain the classification tag corresponding to the successfully matched image feature information as the image Classification label.
  • the electronic device performs target detection on the image, and can match the image feature information in the image with the feature information corresponding to the pre-stored target tag, and obtain the target tag corresponding to the successfully matched feature information as the target tag of the image.
  • Pre-stored classification tags in electronic devices can include: landscape, beach, blue sky, green grass, snow scene, night scene, dark, backlight, sunset, fireworks, spotlight, indoor, macro, text, portrait, baby, cat, dog, food, etc.
  • Target tags can include: portrait, baby, cat, dog, food, text, blue sky, green grass, beach, fireworks, etc.
  • the electronic device can use both the classification label and the target label as the scene label of the image, and sequentially output the scene label of the image and the corresponding confidence level according to the magnitude of the confidence level.
  • the provided video processing method performs scene recognition on the extracted image, and the process of obtaining the scene label and corresponding confidence of the scene may further include: performing scene classification and target detection on the image To obtain the classification label and target label of the image, and use the classification label and target label as the scene label of the image.
  • the electronic device can train a neural network that can simultaneously implement scene classification and target detection.
  • a training image including at least one background training target and a foreground training target may be input into the neural network, and the neural network performs feature extraction according to the background training target and the foreground training target, and the background training target is Perform detection to obtain the first prediction confidence, obtain the first loss function according to the first prediction confidence and the first true confidence, detect the foreground training target to obtain the second prediction confidence, and according to the second prediction confidence and the second true
  • the confidence level is obtained by the second loss function, the target loss function is obtained according to the first loss function and the second loss function, and the parameters of the neural network are adjusted so that the trained neural network can subsequently identify the scene classification and the target classification at the same time.
  • the foreground and background regions of the image are detected by a neural network.
  • the first true confidence level indicates the confidence level of the specified image category to which the background image previously labeled in the training image belongs.
  • the second true confidence level indicates the confidence level of the specified target category to which the foreground target previously labeled in the training image belongs.
  • the neural network includes at least one input layer, a basic network layer, a classification network layer, a target detection network layer, and two output layers, and the two output layers include a first output layer cascaded with the classification network layer. And a second output layer cascaded with the target detection network layer; wherein, during the training phase, the input layer is used to receive the training image, and the first output layer is used to output the specified scene category to which the background image detected by the classification network layer belongs The first prediction confidence of; the second output layer is used to output the offset parameter of each preselected default bounding box detected by the target detection network layer relative to the real bounding box corresponding to the specified target and the specified target category The second prediction confidence.
  • FIG. 3 is a schematic structural diagram of a neural network in an embodiment.
  • the input layer of the neural network receives training images with image category labels, performs feature extraction through a basic network (such as a VGG network), and outputs the extracted image features to a feature layer, and the feature layer pairs the image
  • a basic network such as a VGG network
  • the first loss function is obtained by performing category detection
  • the second loss function is obtained by performing target detection on the foreground target based on image characteristics
  • the position loss function is obtained by performing position detection on the foreground target according to the foreground target.
  • the first loss function, the second loss function, and the position are obtained.
  • the loss function is weighted and summed to obtain the target loss function.
  • the neural network includes a data input layer, a basic network layer, a scene classification network layer, an object detection network layer, and two output layers.
  • the data input layer is used to receive raw image data.
  • the basic network layer performs preprocessing and feature extraction on the image input from the input layer.
  • the pre-processing may include de-averaging, normalization, dimensionality reduction, and whitening processes.
  • De-averaging refers to centering all dimensions of the input data to 0 in order to pull the center of the sample back to the origin of the coordinate system.
  • Normalization is normalizing the amplitude to the same range.
  • Whitening refers to normalizing the amplitude on each characteristic axis of the data.
  • Image data is used for feature extraction. For example, the first five layers of VGG16 convolutional layers are used for feature extraction of the original image, and the extracted features are input to the classification network layer and the target detection network layer.
  • deep convolution such as Mobilenet network and point convolution can be used to detect features, and then input to the output layer to obtain the first prediction confidence level of the specified image category to which the image scene classification belongs, and then according to the first prediction confidence level and The first true confidence is calculated to obtain the first loss function.
  • the target detection network layer can be an SSD network, and the convolutional feature layers are concatenated after the first 5 layers of the VGG16 convolution layer.
  • a set of convolutional feature layers is used.
  • a convolution filter is used to predict the offset parameter of the preselected default bounding box corresponding to the specified target category from the real bounding box and the second prediction confidence corresponding to the specified target category.
  • the area of interest is the area of the preselected default bounding box.
  • a position loss function is constructed according to the offset parameter, and a second loss function is obtained according to a difference between the second predicted confidence level and the second true confidence level.
  • the target loss function is obtained by weighting and summing the first loss function, the second loss function, and the position loss function, and the back propagation algorithm is used to adjust the parameters of the neural network according to the target loss function to train the neural network.
  • the input layer of the neural network receives the input image, extracts the features of the image, and inputs it to the classification network layer for image scene recognition.
  • the background image is output through the softmax classifier. Specify the confidence level of the scene classification label, and select the image scene with the highest confidence level and exceeding the confidence threshold as the classification label to which the background image of the image belongs.
  • the features of the extracted image are input to the target detection network layer for foreground target detection.
  • the softmax classifier is used to output the confidence and corresponding position of the specified target category to which the foreground target belongs, and each target label of the foreground target is output, and The position corresponding to the target label is output, and the obtained classification label and target label are used as image scene labels.
  • the process of establishing a label frequency histogram according to the scene label of the image and the corresponding confidence in the provided video processing method, as shown in FIG. 4, includes:
  • Operation 402 Use the confidence level corresponding to the scene label of the image as the weight value of the scene label in the image.
  • the weight of scene tags in an image refers to the importance of scene tags in the image to video tags.
  • the higher the weight value of the scene tag in the image the higher the frequency of the scene tag in the video; the lower the weight value of the scene tag in the image, the more the The lower the frequency of scene labels.
  • a label frequency histogram is established according to the scene label of the image and the corresponding weight value.
  • the electronic device may obtain a weighted sum of the scene label according to the scene label and the corresponding weight value of the image in the video, and establish a label frequency histogram according to the scene label and the corresponding weighted sum.
  • the electronic device may also obtain a weighted average of the scene label according to the scene label and the corresponding weight value of the image in the video, and establish a label frequency histogram according to the scene label and the corresponding weighted average.
  • the scene label and corresponding confidence of the image output in the video are A image: baby 0.9, grass 0.8, blue sky 0.5, B image: gourmet 0.8, baby 0.6, C image: blue sky 0.7, baby 0.3, then according to the scene label
  • the histogram of the label frequency established by the corresponding weighted average value, the frequency of the baby is 0.6
  • the frequency of the blue sky is 0.4
  • the frequency of the grass and food are 0.27. Babies and blue sky as video tags for this video, etc.
  • a label frequency histogram is established according to the scene label and the corresponding weight value in the video, thereby determining the video label of the video, which can improve the accuracy of the video label .
  • a process of establishing a label frequency histogram according to an image scene label and a corresponding confidence level may further include:
  • Operation 502 Determine the weight value of the scene label according to the confidence level of all the scene labels of the image.
  • the electronic device may sort the scene tags in the image according to the confidence level of the scene tags to obtain the serial number tags corresponding to the scene tags, that is, the scene tag with the highest confidence is used as the first tag, and the second scene tag is used as the second tag
  • the labels are inferred by analogy. For example, in a frame of video, the confidence level of the beach is 0.6, the confidence level of the blue sky is 0.9, and the confidence level of the portrait is 0.8. In this frame, the blue sky is the first label, the portrait is the second label, and the beach is Third label.
  • the electronic device may pre-store weight values corresponding to different serial number tags, and determine the weight values of the scene tags according to the serial number tags of the scene tags.
  • the weight value corresponding to the scene label has the highest weight value in the image.
  • the weight value pre-stored by the first tag is the highest
  • the second tag is the second, and so on.
  • the electronic device may pre-store a weight of 0.8 for the first tag, a weight of 0.5 for the second tag, and a weight of 0.2 for the third tag.
  • the weight of the blue sky of the first tag is 0.8
  • the weight of the second tag is 0.8.
  • the weight value of the label portrait is 0.5
  • the weight value of the third label beach is 0.2.
  • a label frequency histogram is established according to the scene label of the image and the corresponding weight value.
  • the electronic device may obtain a weighted sum of the scene label according to the scene label and the corresponding weight value of the image in the video, and establish a label frequency histogram according to the scene label and the corresponding weighted sum.
  • the electronic device may also obtain a weighted average of the scene label according to the scene label and the corresponding weight value of the image in the video, and establish a label frequency histogram according to the scene label and the corresponding weighted average.
  • a label frequency histogram is established according to the scene label in the video and the corresponding weight value, thereby determining the video label of the video, which can improve the accuracy of the video label.
  • the process of establishing a label frequency histogram according to the scene label of the image and the corresponding confidence level in the provided video processing method of the provided video processing method further includes: according to the scene label whose confidence level is greater than a threshold and the corresponding Confidence builds a label frequency histogram of the video.
  • the electronic device establishes a label frequency histogram based on scene labels whose confidence is greater than a threshold, and can filter scene labels in the image whose confidence is less than a threshold.
  • the threshold can be determined according to actual needs. Specifically, the threshold may be 0.1, 0.15, 0.2, 0.3, etc., and is not limited thereto.
  • the electronic device obtains a scene label with a confidence level greater than a threshold and a corresponding confidence level, determines a corresponding weight value according to the confidence level of the scene label, and establishes a label frequency histogram based on the image scene label and the corresponding weight value to determine the video label of the video. Reduce the impact of lower-confidence scene tags on video tags in the image and improve the accuracy of scene tags.
  • the scene label and confidence in a frame of video in a video are dog 0.8, cat 0.2, grass 0.7, and food 0.1 respectively. If the threshold is 0.3, two scene labels of cat 0.2 and food 0.1 are discarded. According to dog 0.8 and grass 0.7 Two scene labels build a label frequency histogram. ,
  • the provided video processing method may further include a process of adjusting the confidence level of scene tags, and the specific operations include:
  • Operation 602 Obtain position information during video shooting.
  • the GPS Global Positioning System, Global Positioning System
  • the GPS can be used to obtain the address information at the time of video shooting, and the location information at the time of video shooting can be obtained according to the address information.
  • the GPS detects that the address information of the video shooting is 109.408984 north latitude and 18.294898 east longitude
  • the electronic device can obtain the corresponding location information according to the address information as Sanya Bay Beach in Hainan.
  • Operation 604 Adjust the confidence corresponding to the scene label in the image according to the position information.
  • the electronic device can pre-store scene labels corresponding to different position information and weights corresponding to the scene labels, and adjust the confidence corresponding to the scene labels in the image according to the weight of the scene labels.
  • the weight corresponding to the scene label may be a result obtained after statistical analysis of a large number of image materials, and the corresponding scene label and the weight corresponding to the scene label are correspondingly matched for different position information according to the result.
  • the location information is "beach"
  • the scene corresponding to the address "beach” has a weight of "sand” and a weight of "blue sky”
  • the value is 8
  • "Scenery” has a weight of 7
  • "Snow” has a weight of -8
  • "Green Grass” has a weight of -7.
  • the value range is [-10,10]. The larger the weight value, the greater the probability of the scene appearing in the image, and the smaller the weight value, the smaller the probability of the scene appearing in the image. Every time the weight value increases from 0, the confidence of the corresponding scene increases by 1%. Similarly, every time the weight value decreases from 0, the confidence of the corresponding scene decreases by 1%.
  • the confidence of the scene label of the image can be more accurate, thereby improving The accuracy of video tags.
  • a process of determining a video label of a video according to a label frequency histogram, and the specific operations include:
  • Operation 702 Obtain a frequency corresponding to the scene label according to the label frequency histogram.
  • the electronic device may obtain the frequency corresponding to each scene tag in the video according to the tag frequency histogram, and sort the scene tags according to a preset rule according to the frequency.
  • a preset number of scene tags with a high frequency are used as video tags of the video.
  • the preset number can be determined according to an actual application scenario. Specifically, when the electronic device sorts and displays the video according to the video tag, the preset number may be one; when the electronic device uploads the video to the video website, the electronic device determines the pre-defined number of video tags according to the limit of the number of video tags on the video website. Set the number.
  • the preset number may be one, and may also be a plurality, such as two, three, four, etc., and is not limited thereto. For example, when an electronic device uploads a video to a video website with a limited video tag of three, the preset number may be three.
  • the electronic device can sort the scene tags from large to small according to the frequency corresponding to the scene tags, so as to sequentially use a preset number of scene tags with high frequencies as video tags of the video.
  • a video processing method is provided, and specific operations for implementing the method are as follows:
  • the electronic device extracts a frame of image from every preset frame in the video, performs scene recognition on the extracted image, and obtains the scene label and corresponding confidence of the image.
  • Video is a continuous picture composed of multiple frames of still images.
  • the preset frame can be determined according to the actual application requirements. Specifically, the preset frame may be determined according to a video frame rate, or may be determined according to a video duration, and may also be determined according to a combination of a frame rate and a duration.
  • Electronic devices can train scene recognition models based on deep learning algorithms such as VGG, CNN, SSD, and decision trees, and perform scene recognition on images based on the scene recognition models. Specifically, the electronic device detects the image by using a neural network capable of outputting multiple labels, thereby outputting scene labels and corresponding confidence levels of each feature in the image.
  • the electronic device performs scene recognition on the image to obtain a scene label of the image, performs target detection on the image, obtains a target label of the image, and uses the scene label and the target label as a classification label of the image.
  • the electronic device may pre-store image feature information corresponding to multiple scene tags, match the image feature information in the image that needs to be identified with the pre-stored image feature information, and obtain the scene tag corresponding to the successfully matched image feature information as the image.
  • Scene tags The electronic device performs target detection on the image, and can match the image feature information in the image with the feature information corresponding to the pre-stored target tag, and obtain the target tag corresponding to the successfully matched feature information as the target tag of the image.
  • the electronic device can use both the scene label and the target label as the classification label of the image, and obtain the confidence level corresponding to the classification label.
  • the electronic device performs scene classification and target detection on the image, obtains a scene label and a target label of the image, and uses the scene label and the target label as a classification label of the image.
  • the electronic device can train a neural network that can achieve both scene classification and target detection. Use the basic network layer of the neural network to extract features from the image, input the extracted image features to the classification network and the target detection network layer, and perform scene detection through the classification network. Outputs the confidence level of the specified image category to which the background area of the image belongs, and performs target detection through the target detection network layer to obtain the confidence level of the specified target category to which the foreground area belongs. Degree and target location.
  • the electronic device obtains position information during video shooting, and adjusts a confidence level corresponding to a scene label in the image according to the position information.
  • the address information at the time of video shooting can be obtained through GPS, and the position information at the time of video shooting can be obtained according to the address information.
  • the electronic device can pre-store scene labels corresponding to different position information and weights corresponding to the scene labels, and adjust the confidence corresponding to the scene labels in the image according to the weight of the scene labels.
  • a label frequency histogram is a histogram created based on the frequency of each scene label in the video.
  • the frequency of scene tags is determined based on the number of images containing the scene tags and the confidence of the scene tags in the image. Specifically, the frequency of scene tags may be the ratio of the number of images containing the scene tags to the number of all extracted images in the video, or it may be determined based on the confidence of the scene tags in the image or the size of the location area corresponding to the scene tags.
  • the weight value of the scene label in the medium, and the weighted sum or weighted average of the image containing the scene label in the video as the frequency of the scene label.
  • the electronic device uses the confidence level corresponding to the scene label of the image as the weight value of the scene label in the image.
  • a label frequency histogram is established based on the scene label of the image and the corresponding weight value.
  • the weight of scene tags in an image refers to the importance of scene tags in the image to video tags.
  • the electronic device determines the weight value of the scene label according to the confidence level of all the scene labels of the image, and establishes a label frequency histogram according to the scene label of the image and the corresponding weight value.
  • the electronic device can sort the scene tags in the image according to the confidence level of the scene tags to obtain the serial number tags corresponding to the scene tags. That is, the scene tag with the highest confidence is used as the first tag, and the second scene tag is used as the second tag. analogy.
  • the electronic device can pre-store weight values corresponding to different serial number tags, determine the weight value of the scene label according to the serial number tag of the scene label, and establish a label frequency histogram according to the scene label of the image and the corresponding weight value.
  • the weight value corresponding to the scene label in the image has the highest weight value.
  • the weight value pre-stored by the first tag is the highest
  • the second tag is the second, and so on.
  • the electronic device establishes a label frequency histogram of the video according to a scene label whose confidence is greater than a threshold and a corresponding confidence.
  • the electronic device obtains a scene label with a confidence level greater than a threshold and a corresponding confidence level, determines a corresponding weight value according to the confidence level of the scene label, and establishes a label frequency histogram based on the image scene label and the corresponding weight value to determine the video label of the video.
  • Video labeling refers to labeling videos based on the scenes in the videos. According to the video labels, people can roughly understand the main content of the video.
  • the electronic device can obtain the frequency corresponding to each scene tag in the video according to the tag frequency histogram, sort the scene tags according to the preset rules according to the frequency, and use the scene tags with higher frequency as the video tags of the video.
  • the electronic device obtains the frequency corresponding to the scene tag according to the tag frequency histogram, and uses a preset number of scene tags with a high frequency as the video tag of the video.
  • the preset number can be determined according to an actual application scenario. Specifically, when the electronic device sorts and displays videos according to the video tags, the preset number may be one; when the electronic device uploads the video to the video website, the electronic device determines the number of video tags according to the limit of the number of video tags on the video website. Preset number.
  • the electronic device can sort the scene tags from large to small according to the frequency corresponding to the scene tags, so as to sequentially use a preset number of scene tags with high frequencies as video tags of the video.
  • FIGS. 2 and 4-7 are sequentially displayed in accordance with the instructions of the arrows, these operations are not necessarily performed sequentially in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order in which these operations can be performed, and these operations can be performed in other orders. Moreover, at least a part of the operations in FIGS. 2 and 4-7 may include multiple sub-operations or multiple phases. These sub-operations or phases are not necessarily performed at the same time, but may be performed at different times. These sub-operations Or the execution order of the phases is not necessarily performed sequentially, but may be performed in turn or alternately with other operations or at least a part of the sub-operations or phases of other operations.
  • FIG. 8 is a structural block diagram of a video processing apparatus according to an embodiment.
  • a video processing device includes a scene recognition module 802, a histogram establishment module 804, and a video label determination module 806. among them:
  • the scene recognition module 802 is configured to extract a frame image from a preset frame in a video, perform scene recognition on the extracted image, and obtain a scene label and corresponding confidence of the image.
  • a histogram establishment module 804 is configured to establish a label frequency histogram according to the scene label of the image and the corresponding confidence level.
  • the video label determining module 806 is configured to determine a video label of a video according to a label frequency histogram.
  • the histogram creation module 804 may be further configured to use the confidence level corresponding to the scene label of the image as the weight value of the scene label in the image, and establish a label frequency histogram according to the scene label of the image and the corresponding weight value.
  • the histogram creation module 804 may be further configured to determine the weight value of the scene label according to the confidence level of all the scene labels of the image, and according to the scene label of the image and the corresponding weight. Value to build a label frequency histogram.
  • the histogram creation module 804 may be further configured to determine the weight values of the scene labels according to the confidence levels of all the scene labels of the images in the image. When the confidence of all the scene labels is the largest, the weight value corresponding to the scene label has the highest weight value in the image.
  • the histogram creation module 804 may be further configured to establish a label frequency histogram of the video according to a scene label with a confidence greater than a threshold and a corresponding confidence.
  • the provided video processing apparatus may further include a confidence adjustment module 808, which is configured to obtain position information during video shooting and adjust the confidence corresponding to the scene label in the image according to the position information.
  • a confidence adjustment module 808 which is configured to obtain position information during video shooting and adjust the confidence corresponding to the scene label in the image according to the position information.
  • the video tag determination module 806 may be further configured to obtain the frequency corresponding to the scene tag according to the tag frequency histogram, and use a preset number of scene tags with a high frequency as the video tag of the video.
  • each module in the above-mentioned video processing device is only for illustration. In other embodiments, the video processing device may be divided into different modules as required to complete all or part of the functions of the above-mentioned video processing device.
  • Each module in the video processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • each module in the video processing apparatus may be in the form of a computer program.
  • the computer program can be run on a terminal or a server.
  • the program module constituted by the computer program can be stored in the memory of the terminal or server.
  • the computer program is executed by a processor, the operations of the method described in the embodiments of the present application are implemented.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform operations of a video processing method.
  • a computer program product containing instructions that, when run on a computer, causes the computer to perform a video processing method.
  • An embodiment of the present application further provides an electronic device.
  • the above electronic device includes an image processing circuit.
  • the image processing circuit may be implemented by hardware and / or software components, and may include various processing units that define an ISP (Image Signal Processing) pipeline.
  • FIG. 9 is a schematic diagram of an image processing circuit in one embodiment. As shown in FIG. 9, for ease of description, only aspects of the image processing technology related to the embodiments of the present application are shown.
  • the image processing circuit includes an ISP processor 940 and a control logic 950.
  • the image data captured by the imaging device 910 is first processed by the ISP processor 940, which analyzes the image data to capture image statistical information that can be used to determine and / or one or more control parameters of the imaging device 910.
  • the imaging device 910 may include a camera having one or more lenses 912 and an image sensor 914.
  • the image sensor 914 may include a color filter array (such as a Bayer filter). The image sensor 914 may obtain the light intensity and wavelength information captured by each imaging pixel of the image sensor 914, and provide a set of Image data.
  • the sensor 920 may provide parameters (such as image stabilization parameters) of the acquired image processing to the ISP processor 940 based on the interface type of the sensor 920.
  • the sensor 920 interface may use a SMIA (Standard Mobile Imaging Architecture) interface, other serial or parallel camera interfaces, or a combination of the foregoing interfaces.
  • SMIA Standard Mobile Imaging Architecture
  • the image sensor 914 may also send the original image data to the sensor 920, and the sensor 920 may provide the original image data to the ISP processor 940 based on the interface type of the sensor 920, or the sensor 920 stores the original image data in the image memory 930.
  • the ISP processor 940 processes the original image data pixel by pixel in a variety of formats.
  • each image pixel may have a bit depth of 9, 10, 12, or 14 bits, and the ISP processor 940 may perform one or more image processing operations on the original image data and collect statistical information about the image data.
  • the image processing operations may be performed with the same or different bit depth accuracy.
  • the ISP processor 940 may also receive image data from the image memory 930.
  • the sensor 920 interface sends the original image data to the image memory 930, and the original image data in the image memory 930 is then provided to the ISP processor 940 for processing.
  • the image memory 930 may be a part of a memory device, a storage device, or a separate dedicated memory in an electronic device, and may include a DMA (Direct Memory Access) feature.
  • DMA Direct Memory Access
  • the ISP processor 940 may perform one or more image processing operations, such as time-domain filtering.
  • the processed image data may be sent to the image memory 930 for further processing before being displayed.
  • the ISP processor 940 receives processing data from the image memory 930 and performs image data processing on the processing data in the original domain and in the RGB and YCbCr color spaces.
  • the image data processed by the ISP processor 940 may be output to the display 970 for viewing by the user and / or further processed by a graphics engine or a GPU (Graphics Processing Unit).
  • the output of the ISP processor 940 can also be sent to the image memory 930, and the display 970 can read image data from the image memory 930.
  • the image memory 930 may be configured to implement one or more frame buffers.
  • the output of the ISP processor 940 may be sent to an encoder / decoder 960 to encode / decode image data.
  • the encoded image data can be saved and decompressed before being displayed on the display 970 device.
  • the encoder / decoder 960 may be implemented by a CPU or a GPU or a coprocessor.
  • the statistical data determined by the ISP processor 940 may be sent to the control logic 950 unit.
  • the statistical data may include image information of the image sensor 914 such as auto exposure, auto white balance, auto focus, flicker detection, black level compensation, and lens 912 shading correction.
  • the control logic 950 may include a processor and / or a microcontroller that executes one or more routines (such as firmware). The one or more routines may determine the control parameters of the imaging device 910 and the ISP processing according to the received statistical data. Parameters of the controller 940.
  • control parameters of the imaging device 910 may include sensor 920 control parameters (such as gain, integration time for exposure control, image stabilization parameters, etc.), camera flash control parameters, lens 912 control parameters (such as focus distance for focusing or zooming), or these A combination of parameters.
  • ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (eg, during RGB processing), and lens 912 shading correction parameters.
  • the electronic device can implement the video processing method described in the embodiment of the present application according to the image processing technology.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM), which is used as external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDR, SDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDR dual data rate SDRAM
  • SDRAM enhanced SDRAM
  • SLDRAM synchronous Link (Synchlink) DRAM
  • SLDRAM synchronous Link (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种视频处理方法,包括:从视频中每间隔预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度,根据图像的场景标签及对应的置信度建立标签频率直方图,根据标签频率直方图确定视频的视频标签。

Description

视频处理方法、电子设备、计算机可读存储介质
相关申请的交叉引用
本申请要求于2018年06月08日提交中国专利局、申请号为2018105859289、发明名称为“视频处理方法和装置、电子设备、计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种视频处理方法、电子设备、计算机可读存储介质。
背景技术
随着互联网技术的快速发展,视频成为人们日常生活中的重要娱乐方式之一。人们可以在电子设备根据视频标签浏览不同视频,当人们将视频上传到视频网站时需要对视频进行分类并添加视频标签,电子设备可以通过对视频进行识别后获取视频的视频标签。然而,传统技术中存在获取视频标签不准确的问题。
发明内容
根据本申请的各种实施例提供一种视频处理方法、电子设备、计算机可读存储介质。
一种视频处理方法,包括:
从视频中每间隔预设帧提取一帧图像,对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度;
根据所述图像的场景标签及对应的置信度建立标签频率直方图;及
根据所述标签频率直方图确定所述视频的视频标签。
一种电子设备,包括存储器及处理器,所述存储器中储存有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:
从视频中每间隔预设帧提取一帧图像,对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度;
根据所述图像的场景标签及对应的置信度建立标签频率直方图;及
根据所述标签频率直方图确定所述视频的视频标签。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如下操作:
从视频中每间隔预设帧提取一帧图像,对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度;
根据所述图像的场景标签及对应的置信度建立标签频率直方图;及
根据所述标签频率直方图确定所述视频的视频标签。
本申请实施例提供的视频处理方法、电子设备、计算机可读存储介质,可以从视频中每个预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度,根据视频中各图像的场景标签及对对应的置信度建立标签频率直方图来确定视频的视频标签。由于可以根据视频中图像的场景标签建立标签频率直方图从而确定视频标签,可以提高视频标签的准确性。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本发明的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个或多个实施例中电子设备的内部结构示意图。
图2为一个或多个实施例中视频处理方法的流程图。
图3为一个或多个实施例中神经网络的架构示意图。
图4为一个或多个实施例中建立标签频率直方图的流程图。
图5为另一个或多个实施例中建立标签频率直方图的流程图。
图6为一个或多个实施例中调整置信度的流程图。
图7为一个或多个实施例中视频处理方法的流程图。
图8为一个或多个实施例中视频处理装置的结构框图。
图9为一个或多个实施例中信息处理电路的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中电子设备的内部结构示意图。如图1所示,该电子设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该处理器用于提供计算和控制能力,支撑整个电子设备的运行。存储器用于存储数据、程序等,存储器上存储至少一个计算机程序,该计算机程序可被处理器执行,以实现本申请实施例中提供的适用于电子设备的无线网络通信方法。存储器可包括非易失性存储介质及内存储器。非易失性存储介质存储有操作系统和计算机程序。该计算机程序可被处理器所执行,以用于实现以下各个实施例所提供的一种视频处理方法。内存储器为非易失性存储介质中的操作系统计算机程序提供高速缓存的运行环境。网络接口可以是以太网卡或无线网卡等,用于与外部的电子设备进行通信。该电子设备可以是手机、平板电脑或者个人数字助理或穿戴式设备等。
图2为一个实施例中视频处理方法的流程图。本实施例中的视频处理方法,以运行于图1中的电子设备上为例进行描述。如图2所示,视频处理方法包括操作202至操作206。
操作202,从视频中每间隔预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度。
视频是指电子设备上的任意视频。具体地,视频可以是电子设备通过摄像头采集的视频,也可以是存储在电子设备本地的视频,还可以是电子设备从网络下载的视频等。视频是由多帧静态图像组成的连续画面。预设帧可以根据实际应用需求来确定。具体地,预设帧可以根据视频帧率来确定,也可以根据视频时长来确定,还可以根据帧率和时长二者结合来确定。例如,预设帧可以为0帧,此时电子设备可以提取视频中的每一帧图像,对提取的图像进行场景识别。
电子设备可以根据VGG(Visual Geometry Group)、CNN(Convolutional Neural Network)、SSD(single shot multibox detector)、决策树(Decision Tree)等深度学习算法训练场景识别模型,根据场景识别模型对图像进行场景识别。具体地,电子设备可以训练可输出多个场景标签的神经网络。具体地,在神经网络训练过程中,可以将包含多个训练标签的训练图像或多张包含不同训练标签的训练图像输入到神经网络中,神经网络对训练图像进行特征提取,对提取的图像特征进行检测得到图像中各个特征对应的预测置信度,根据特征的预测置信度和真实置信度得到损失函数,根据损失函数对神经网络的参 数进行调整,使得训练的神经网络后续可同时识别图像的多个特征对应的场景标签,从而得到输出多个场景标签的神经网络。置信度是被测量参数的测量值的可信程度。真实置信度表示在该训练图像中预先标注的特征所属指定场景类别的置信度。图像的场景可以是风景、海滩、蓝天、绿草、雪景、烟火、聚光灯、文本、人像、婴儿、猫、狗、美食等。
电子设备采用可输出多标签的神经网络对图像进行检测,具体地,神经网络输入层接收输入的图像,通过基础网路(如VGG网络)提取图像的特征,将提取的图像的特征输入到检测网络层进行场景检测,检测网络层可采用SSD网络、Mobilenet网络等对特征进行检测,在输出层通过softmax分类器输出特征所属类别的置信度及对应的位置,选取置信度最高且超过置信度阈值的目标类别作为图像中该特征所属的场景标签,从而输出图像中各个特征的场景标签及对应的置信度。
操作204,根据图像的场景标签及对应的置信度建立标签频率直方图。
标签频率直方图是指根据视频中各场景标签的频率建立的直方图。场景标签的频率是根据包含场景标签的图像数量及图像中该场景标签的置信度来确定的。具体地,场景标签的频率可以是视频中包含该场景标签的图像数量与所有提取的图像数量的比值,也可以是根据图像中场景标签的置信度或场景标签对应的位置区域的大小确定该图像中场景标签的权重值,将视频中包含该场景标签的图像的加权和或加权平均值作为该场景标签的频率等。在一个实施例中,电子设备以场景标签作为标签频率直方图的横坐标,场景标签的频率作为标签频率直方图的纵坐标建立标签频率直方图,则电子设备根据标签频率直方图可以得出视频中各场景标签的频率。
操作206,根据标签频率直方图确定视频的视频标签。
视频标签是指根据视频中出现的场景对视频进行标记,根据视频标签人们可以大概了解到视频的主要内容。视频标签可以是1个,也可以是多个如2个、3个、4个等不限于此。具体地,电子设备可以根据标签频率直方图获取视频中各场景标签对应的频率,根据频率将场景标签按预设规则进行排序,将频率较大的场景标签作为视频的视频标签。电子设备也可以预设标签阈值,将频率大于标签阈值的场景标签作为视频的视频标签。电子设备还可以循环读取频率直方图中场景标签及对应的频率,从而获取频率最大的场景标签作为视频的视频标签。电子设备根据标签频率直方图确定视频的视频标签的方式还可以是上述各种方式的结合或其他方式,在此不做限定。
本申请实施例提供的视频处理方法,可以从视频中每个预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度,根据视频中各图像的场景标签及对应的置信度建立标签频率直方图来确定视频的视频标签,可以提高视频标签的准确性。
在一个实施例中,提供的视频处理方法中对提取的图像进行场景识别,得到场景的场景标签及对应的置信度的过程还可以包括:对图像进行场景识别,得到图像的分类标签,对图像进行目标检测,得到图像的目标标签,将分类标签和目标标签作为图像的场景标签。
具体地,电子设备可以采用图像分类技术对图像进行场景识别。电子设备可预存有多个分类标签对应的图像特征信息,将需要进行场景识别的图像中的图像特征信息与预存的图像特征信息进行匹配,获取匹配成功的图像特征信息对应的分类标签作为图像的分类标签。相似地,电子设备对图像进行目标检测,可将图像中图像特征信息与预存的目标标签对应的特征信息进行匹配,获取匹配成功的特征信息对应的目标标签作为图像的目标标签。电子设备中预存的分类标签可包括:风景、海滩、蓝天、绿草、雪景、夜景、黑暗、逆光、日落、烟火、聚光灯、室内、微距、文本、人像、婴儿、猫、狗、美食等;目标标签可包括:人像、婴儿、猫、狗、美食、文本、蓝天、绿草、沙滩、烟火等。电子设备可以将分类标签和目标标签均作为图像的场景标签,并按照置信度的大小依次输出图像的场景标签及对应的置信度。
在一个实施例中,在一个实施例中,提供的视频处理方法中对提取的图像进行场景识别,得到场景的场景标签及对应的置信度的过程还可以包括:对图像进行场景分类和目标检测,得到图像的分类标签和目标标签,将分类标签和目标标签作为图像的场景标签。
具体地,电子设备可以训练可同时实现场景分类和目标检测的神经网络。具体地,在神经网络训练过程中,可以将包含有至少一个背景训练目标和前景训练目标的训练图像输入到神经网络中,神经网络根据背景训练目标和前景训练目标进行特征提取,对背景训练目标进行检测得到第一预测置信度,根据第一预测置信度和第一真实置信度得到第一损失函数,对前景训练目标进行检测得到第二预测置信度,根据第二预测置信度和第二真实置信度得到第二损失函数,根据第一损失函数和第二损失函数得到目标损失函数,对神经网络的参数进行调整,使得训练的神经网络后续可同时识别出场景分类和目标分类,从而得到可以同时对图像的前景区域和背景区域进行检测神经网络。该第一真实置信度表示在该训练图像中预先标注的背景图像所属指定图像类别的置信度。第二真实置信度表示在该训练图像中预先标注的前景目标所属指定目标类别的置信度。
在一个实施例中,上述神经网络包括至少一个输入层、基础网络层、分类网络层、目标检测网络层和两个输出层,该两个输出层包括与该分类网络层级联的第一输出层和与该目标检测网络层级联的第二输出层;其中,在训练阶段,该输入层用于接收该训练图像,该第一输出层用于输出该分类网络层检测的背景图像所属指定场景类别的第一预测置信度;该第二输出层用于输出该目标检测网络层检测的每个预选的默认边界框所属相对于指定目标所对应的真实边界框的偏移量参数和所属指定目标类别的第二预测置信度。图3为一个实施例中神经网络的架构示意图。如图3所示,神经网络的输入层接收带有图像类别标签的训练图像,通过基础网络(如VGG网络)进行特征提取,并将提取的图像特征输出给特征层,由该特征层对图像进行类别检测得到第一损失函数,对前景目标根据图像特征进行目标检测得到第二损失函数,对前景目标根据前景目标进行位置检测得到位置损失函数,将第一损失函数、第二损失函数和位置损失函数进行加权求和得到目标损失函数。神经网络包括数据输入层、基础网络层、场景分类网络层、目标检测网络层和两个输出层。数据输入层用于接收原始图像数据。基础网络层对输入层输入的图像进行预处理以及特征提取。该预处理可包括去均值、归一化、降维和白化处理。去均值是指将输入数据各个维度都中心化为0,目的是将样本的中心拉回到坐标系原点上。归一化是将幅度归一化到同样的范围。白化是指对数据各个特征轴上的幅度归一化。图像数据进行特征提取,例如利用VGG16的前5层卷积层对原始图像进行特征提取,再将提取的特征输入到分类网络层和目标检测网络层。在分类网络层可采用如Mobilenet网络的深度卷积、点卷积对特征进行检测,然后输入到输出层得到图像场景分类所属指定图像类别的第一预测置信度,然后根据第一预测置信度与第一真实置信度求差得到第一损失函数;在目标检测网络层可采用如SSD网络,在VGG16的前5层的卷积层后级联卷积特征层,在卷积特征层使用一组卷积滤波器来预测指定目标类别所对应的预选默认边界框相对于真实边界框的偏移量参数和指定目标类别所对应的第二预测置信度。感兴趣区域为预选默认边界框的区域。根据偏移量参数构建位置损失函数,根据第二预测置信度与第二真实置信度的差异得到第二损失函数。将第一损失函数、第二损失函数和位置损失函数加权求和得到目标损失函数,根据目标损失函数采用反向传播算法调整神经网络的参数,对神经网络进行训练。
采用训练好的神经网络对图像进行识别时,神经网络输入层接收输入的图像,提取图像的特征,输入到分类网络层进行图像场景识别,在第一输出层通过softmax分类器输出背景图像所属各个指定场景分类标签的置信度,选取置信度最高且超过置信度阈值的图像场景作为该图像的背景图像所属的分类标签。将提取的图像的特征输入到目标检测网络层进行前景目标检测,在第二输出层通过softmax分类器输出前景目标所属指定目标类别的置信度及对应的位置,输出前景目标的各个目标标签,并输出目标标签对应的位置,将 得到的分类标签和目标标签作为图像的场景标签。
在一个实施例中,提供的视频处理方法中根据图像的场景标签及对应的置信度建立标签频率直方图的过程,如图4所示,包括:
操作402,将图像的场景标签对应的置信度作为图像中场景标签的权重值。
图像中场景标签的权重值是指图像中场景标签在视频标签中的重要程度。在视频中其他图像场景标签及权重值一定的情况下,图像中场景标签的权重值越高,则视频中该场景标签的频率越高;图像中场景标签的权重值越低,则视频中该场景标签的频率越低。
操作404,根据图像的场景标签及对应的权重值建立标签频率直方图。
具体地,电子设备可以根据视频中图像的场景标签及对应的权重值获取该场景标签的加权和,根据场景标签及对应的加权和建立标签频率直方图。电子设备还可以根据视频中图像的场景标签及对应的权重值获取该场景标签的加权平均值,根据场景标签及对应的加权平均值建立标签频率直方图。例如,视频中图像输出的场景标签及对应的置信度分别为A图像:婴儿0.9、草地0.8、蓝天0.5,B图像:美食0.8、婴儿0.6,C图像:蓝天0.7、婴儿0.3,则根据场景标签及对应的加权平均值建立的标签频率直方图中,婴儿的频率为0.6、蓝天的频率为0.4、草地和美食的频率均为0.27,电子设备可以将婴儿作为该视频的视频标签,也可将婴儿和蓝天作为该视频的视频标签等。
通过将图像的场景标签及对应的置信度作为图像中场景标签的权重值,根据视频中场景标签及对应的权重值建立标签频率直方图,从而确定视频的视频标签,可以提高视频标签的准确性。
如图5所示,在一个实施例中,提供的视频处理方法中根据图像的场景标签及对应的置信度建立标签频率直方图的过程,还可以包括:
操作502,根据图像的场景标签的置信度在图像的所有场景标签的置信度的大小确定场景标签的权重值。
具体地,电子设备可以根据场景标签的置信度的大小对该图像中的场景标签进行排序得到场景标签对应的序号标签,即将置信度最大的场景标签作为第一标签、其次的场景标签为第二标签依次类推。例如,在视频的一帧图像中,海滩的置信度为0.6、蓝天的置信度为0.9、人像的置信度为0.8,则该帧图像中蓝天为第一标签、人像为第二标签、海滩为第三标签。电子设备可以预存不同序号标签对应的权重值,根据场景标签的序号标签确定场景标签的权重值。
在一个实施例中,图像的场景标签的置信度在图像的所有场景标签的置信度最大时,场景标签对应的权重值在图像中的权重值最高。具体地,电子设备预存的序号标签对应的权重值时,第一标签预存的权重值最高,第二标签次之,以此类推。例如,电子设备可以预存第一标签的权重值为0.8、第二标签的权重值为0.5、第三标签的权重值为0.2,则上述例子中,第一标签蓝天的权重值为0.8、第二标签人像的权重值为0.5、第三标签海滩的权重值为0.2。
操作504,根据图像的场景标签及对应的权重值建立标签频率直方图。
具体地,电子设备可以根据视频中图像的场景标签及对应的权重值获取该场景标签的加权和,根据场景标签及对应的加权和建立标签频率直方图。电子设备还可以根据视频中图像的场景标签及对应的权重值获取该场景标签的加权平均值,根据场景标签及对应的加权平均值建立标签频率直方图。
通过根据图像的场景标签及对应的置信度确定图像中场景标签的权重值,根据视频中场景标签及对应的权重值建立标签频率直方图,从而确定视频的视频标签,可以提高视频标签的准确性。
在一个实施例中,提供的视频处理方法中提供的视频处理方法中根据图像的场景标签及对应的置信度建立标签频率直方图的过程,还包括:根据置信度大于阈值的场景标签 及对应的置信度建立视频的标签频率直方图。
电子设备根据置信度大于阈值的场景标签建立标签频率直方图,可以过滤图像中置信度小于阈值的场景标签。阈值可以根据实际需求来确定。具体地,阈值可以为0.1、0.15、0.2、0.3等不限于此。电子设备获取置信度大于阈值的场景标签及对应的置信度,根据场景标签的置信度确定对应的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图确定视频的视频标签,可以减少图像中置信度较低的场景标签对视频标签的影响,提高场景标签的准确度。例如,视频中的一帧图像中场景标签及置信度分别为狗0.8、猫0.2、草地0.7、美食0.1,若阈值为0.3,则丢弃猫0.2和美食0.1两个场景标签,根据狗0.8和草地0.7两个场景标签建立标签频率直方图。,
如图6所示,在一个实施例中,提供的视频处理方法还可以包括调整场景标签置信度的过程,具体操作包括:
操作602,获取视频拍摄时的位置信息。
电子设备在拍摄视频时,可以通过GPS(Global Positioning System,全球定位系统)来获取视频拍摄时的地址信息,根据地址信息可以得到视频拍摄时的位置信息。例如当GPS检测到视频拍摄的地址信息为北纬109.408984,东经18.294898时,电子设备可以根据地址信息获取对应的位置信息为海南三亚湾海滩。
操作604,根据位置信息调整图像中场景标签对应的置信度。
电子设备可以预存不同位置信息对应的场景标签及场景标签对应的权重,根据场景标签的权重调整图像中场景标签对应的置信度。具体地,场景标签对应的权重可以是根据对大量的图像素材进行统计学分析后得出的结果,根据结果相应地为不同的位置信息匹配对应的场景标签及场景标签对应的权值。例如,根据对大量的图像素材进行统计学分析后得出,当位置信息为“海滩”时,则与地址为“海滩”对应的场景为“沙滩”的权值为9,“蓝天”的权值为8,“风景”的权值为7,“雪景”的权值为-8,“绿草”的权值为-7,权值的取值范围为[-10,10]。权值越大说明在该图像中出现该场景的概率就越大,权值越小说明在该图像中出现该场景的概率就越小。权值从0开始每增加1,则对应场景的置信度增加1%,同样的,权值从0开始每减少1,则对应的场景的置信度减少1%。
通过根据视频拍摄的地址信信息得到位置信息,获取该位置信息下各场景标签对应的权值,对图像的场景标签的置信度进行调整,可以使图像的场景标签的置信度更加准确,从而提高视频标签的准确性。
如图7所示,在一个实施例中,提供的视频处理方法中根据标签频率直方图确定视频的视频标签的过程,具体操作包括:
操作702,根据标签频率直方图得到场景标签对应的频率。
具体地,电子设备可以根据标签频率直方图获取视频中各场景标签对应的频率,根据频率将场景标签按预设规则进行排序。
操作704,将频率大的预设个数个场景标签作为视频的视频标签。
预设个数可以根据实际应用场景来确定。具体地,电子设备根据视频标签对视频进行分类显示时,预设个数可以为1个;当电子设备将视频上传到视频网站时,电子设备根据视频网站对视频标签数量限定确定视频标签的预设个数。预设个数可以为1个,也可以为多个如2个、3个、4个等不限于此。例如,当电子设备将视频上传到限定视频标签为3个的视频网站时,则预设个数可以为3个。电子设备可以根据场景标签对应的频率将场景标签按从大到小进行排序,从而依次将频率大的预设个数个场景标签作为视频的视频标签。
在一个实施例中,提供了一种视频处理方法,实现该方法的具体操作如下所述:
首先,电子设备从视频中每间隔预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度。视频是由多帧静态图像组成的连续画面。预设帧可 以根据实际应用需求来确定。具体地,预设帧可以根据视频帧率来确定,也可以根据视频时长来确定,还可以根据帧率和时长二者结合来确定。电子设备可以根据VGG、CNN、SSD、决策树等深度学习算法训练场景识别模型,根据场景识别模型对图像进行场景识别。具体地,电子设备采用可输出多标签的神经网络对图像进行检测,从而输出图像中各个特征的场景标签及对应的置信度。
可选地,电子设备对图像进行场景识别,得到图像的场景标签,对图像进行目标检测,得到图像的目标标签,将场景标签和目标标签作为图像的分类标签。电子设备可预存有多个场景标签对应的图像特征信息,将需要进行场景识别的图像中的图像特征信息与预存的图像特征信息进行匹配,获取匹配成功的图像特征信息对应的场景标签作为图像的场景标签。电子设备对图像进行目标检测,可将图像中图像特征信息与预存的目标标签对应的特征信息进行匹配,获取匹配成功的特征信息对应的目标标签作为图像的目标标签。电子设备可以将场景标签和目标标签均作为图像的分类标签,并获取分类标签对应的置信度。
可选地,电子设备对图像进行场景分类和目标检测,得到图像的场景标签和目标标签,将场景标签和目标标签作为图像的分类标签。电子设备可以训练可同时实现场景分类和目标检测的神经网络,利用神经网络的基础网络层对图像进行特征提取,将提取的图像特征输入到分类网络和目标检测网络层,通过分类网络进行场景检测输出图像背景区域所属指定图像类别的置信度,通过目标检测网络层进行目标检测得到前景区域所属指定目标类别的置信度,从而输出图像的场景标签及对应的置信度,以及目标标签及对应的置信度和目标所在位置。
可选地,电子设备获取视频拍摄时的位置信息,根据位置信息调整图像中场景标签对应的置信度。电子设备在拍摄视频时,可以通过GPS来获取视频拍摄时的地址信息,根据地址信息可以得到视频拍摄时的位置信息。电子设备可以预存不同位置信息对应的场景标签及场景标签对应的权重,根据场景标签的权重调整图像中场景标签对应的置信度。
接着,电子设备根据图像的场景标签及对应的置信度建立标签频率直方图。标签频率直方图是指根据视频中各场景标签的频率建立的直方图。场景标签的频率是根据包含场景标签的图像数量及图像中该场景标签的置信度来确定的。具体地,场景标签的频率可以是视频中包含该场景标签的图像数量与所有提取的图像数量的比值,也可以是根据图像中场景标签的置信度或场景标签对应的位置区域的大小确定该图像中场景标签的权重值,将视频中包含该场景标签的图像的加权和或加权平均值作为该场景标签的频率等。
可选地,电子设备将图像的场景标签对应的置信度作为图像中场景标签的权重值。根据图像的场景标签及对应的权重值建立标签频率直方图。图像中场景标签的权重值是指图像中场景标签在视频标签中的重要程度。通过将图像的场景标签及对应的置信度作为图像中场景标签的权重值,根据视频中场景标签及对应的权重值建立标签频率直方图,从而确定视频的视频标签,可以提高视频标签的准确性。
可选地,电子设备根据图像的场景标签的置信度在图像的所有场景标签的置信度的大小确定场景标签的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图。电子设备可以根据场景标签的置信度的大小对该图像中的场景标签进行排序得到场景标签对应的序号标签,即将置信度最大的场景标签作为第一标签、其次的场景标签为第二标签以此类推。电子设备可以预存不同序号标签对应的权重值,根据场景标签的序号标签确定场景标签的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图。
可选地,图像的场景标签的置信度在图像的所有场景标签的置信度最大时,场景标签对应的权重值在图像中的权重值最高。具体地,电子设备预存的序号标签对应的权重值时,第一标签预存的权重值最高,第二标签次之,以此类推。
可选地,电子设备根据置信度大于阈值的场景标签及对应的置信度建立视频的标签 频率直方图。电子设备获取置信度大于阈值的场景标签及对应的置信度,根据场景标签的置信度确定对应的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图确定视频的视频标签,可以减少图像中置信度较低的场景标签对视频标签的影响,提高场景标签的准确度。
接着,电子设备根据标签频率直方图确定视频的视频标签。视频标签是指根据视频中出现的场景对视频进行标记,根据视频标签人们可以大概了解到视频的主要内容。电子设备可以根据标签频率直方图获取视频中各场景标签对应的频率,根据频率将场景标签按预设规则进行排序,将频率较大的场景标签作为视频的视频标签。
可选地,电子设备根据标签频率直方图得到场景标签对应的频率,将频率大的预设个数个场景标签作为视频的视频标签。预设个数可以根据实际应用场景来确定。具体地,当电子设备根据视频标签对视频进行分类显示时,预设个数可以为1个;当电子设备将视频上传到视频网站时,电子设备根据视频网站对视频标签数量限定确定视频标签的预设个数。电子设备可以根据场景标签对应的频率将场景标签按从大到小进行排序,从而依次将频率大的预设个数个场景标签作为视频的视频标签。
应该理解的是,虽然图2、4-7的流程图中的各个操作按照箭头的指示依次显示,但是这些操作并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些操作的执行并没有严格的顺序限制,这些操作可以以其它的顺序执行。而且,图2、4-7中的至少一部分操作可以包括多个子操作或者多个阶段,这些子操作或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子操作或者阶段的执行顺序也不必然是依次进行,而是可以与其它操作或者其它操作的子操作或者阶段的至少一部分轮流或者交替地执行。
图8为一个实施例的视频处理装置的结构框图。如图8所示,一种视频处理装置,包括场景识别模块802,直方图建立模块804,视频标签确定模块806。其中:
场景识别模块802,用于从视频中每间隔预设帧提取一帧图像,对提取的图像进行场景识别,得到图像的场景标签及对应的置信度。
直方图建立模块804,用于根据图像的场景标签及对应的置信度建立标签频率直方图。
视频标签确定模块806,用于根据标签频率直方图确定视频的视频标签。
在一个实施例中,直方图建立模块804还可以用于将图像的场景标签对应的置信度作为图像中场景标签的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图。
在一个实施例中,直方图建立模块804还可以用于根据图像的场景标签的置信度在图像的所有场景标签的置信度的大小确定场景标签的权重值,根据图像的场景标签及对应的权重值建立标签频率直方图。
在一个实施例中,直方图建立模块804还可以用于根据图像的场景标签的置信度在图像的所有场景标签的置信度的大小确定场景标签的权重值,图像的场景标签的置信度在图像的所有场景标签的置信度最大时,场景标签对应的权重值在图像中的权重值最高。
在一个实施例中,直方图建立模块804还可以用于根据置信度大于阈值的场景标签及对应的置信度建立视频的标签频率直方图。
在一个实施例中,提供的视频处理装置还可以包括置信度调整模块808,,置信度调整模块808用于获取视频拍摄时的位置信息,根据位置信息调整图像中场景标签对应的置信度。
在一个实施例中,视频标签确定模块806还可以用于根据标签频率直方图得到场景标签对应的频率,将频率大的预设个数个场景标签作为视频的视频标签。
上述视频处理装置中各个模块的划分仅用于举例说明,在其他实施例中,可将视频处理装置按照需要划分为不同的模块,以完成上述视频处理装置的全部或部分功能。
关于视频处理装置的具体限定可以参见上文中对于视频处理方法的限定,在此不再赘 述。上述视频处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
本申请实施例中提供的视频处理装置中的各个模块的实现可为计算机程序的形式。该计算机程序可在终端或服务器上运行。该计算机程序构成的程序模块可存储在终端或服务器的存储器上。该计算机程序被处理器执行时,实现本申请实施例中所描述方法的操作。
本申请实施例还提供了一种计算机可读存储介质。一个或多个包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行视频处理方法的操作。
一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行视频处理方法。
本申请实施例还提供一种电子设备。上述电子设备中包括图像处理电路,图像处理电路可以利用硬件和/或软件组件实现,可包括定义ISP(Image Signal Processing,图像信号处理)管线的各种处理单元。图9为一个实施例中图像处理电路的示意图。如图9所示,为便于说明,仅示出与本申请实施例相关的图像处理技术的各个方面。
如图9所示,图像处理电路包括ISP处理器940和控制逻辑器950。成像设备910捕捉的图像数据首先由ISP处理器940处理,ISP处理器940对图像数据进行分析以捕捉可用于确定和/或成像设备910的一个或多个控制参数的图像统计信息。成像设备910可包括具有一个或多个透镜912和图像传感器914的照相机。图像传感器914可包括色彩滤镜阵列(如Bayer滤镜),图像传感器914可获取用图像传感器914的每个成像像素捕捉的光强度和波长信息,并提供可由ISP处理器940处理的一组原始图像数据。传感器920(如陀螺仪)可基于传感器920接口类型把采集的图像处理的参数(如防抖参数)提供给ISP处理器940。传感器920接口可以利用SMIA(Standard Mobile Imaging Architecture,标准移动成像架构)接口、其它串行或并行照相机接口或上述接口的组合。
此外,图像传感器914也可将原始图像数据发送给传感器920,传感器920可基于传感器920接口类型把原始图像数据提供给ISP处理器940,或者传感器920将原始图像数据存储到图像存储器930中。
ISP处理器940按多种格式逐个像素地处理原始图像数据。例如,每个图像像素可具有9、10、12或14比特的位深度,ISP处理器940可对原始图像数据进行一个或多个图像处理操作、收集关于图像数据的统计信息。其中,图像处理操作可按相同或不同的位深度精度进行。
ISP处理器940还可从图像存储器930接收图像数据。例如,传感器920接口将原始图像数据发送给图像存储器930,图像存储器930中的原始图像数据再提供给ISP处理器940以供处理。图像存储器930可为存储器装置的一部分、存储设备、或电子设备内的独立的专用存储器,并可包括DMA(Direct Memory Access,直接直接存储器存取)特征。
当接收到来自图像传感器914接口或来自传感器920接口或来自图像存储器930的原始图像数据时,ISP处理器940可进行一个或多个图像处理操作,如时域滤波。处理后的图像数据可发送给图像存储器930,以便在被显示之前进行另外的处理。ISP处理器940从图像存储器930接收处理数据,并对所述处理数据进行原始域中以及RGB和YCbCr颜色空间中的图像数据处理。ISP处理器940处理后的图像数据可输出给显示器970,以供用户观看和/或由图形引擎或GPU(Graphics Processing Unit,图形处理器)进一步处理。此外,ISP处理器940的输出还可发送给图像存储器930,且显示器970可从图像存储器930读取图像数据。在一个实施例中,图像存储器930可被配置为实现一个或多个帧缓冲器。此外,ISP处理器940的输出可发送给编码器/解码器960,以便编码/解码图像数据。编码的图像数据可被保存,并在显示于显示器970设备上之前解压缩。编码器/解码器960 可由CPU或GPU或协处理器实现。
ISP处理器940确定的统计数据可发送给控制逻辑器950单元。例如,统计数据可包括自动曝光、自动白平衡、自动聚焦、闪烁检测、黑电平补偿、透镜912阴影校正等图像传感器914统计信息。控制逻辑器950可包括执行一个或多个例程(如固件)的处理器和/或微控制器,一个或多个例程可根据接收的统计数据,确定成像设备910的控制参数及ISP处理器940的控制参数。例如,成像设备910的控制参数可包括传感器920控制参数(例如增益、曝光控制的积分时间、防抖参数等)、照相机闪光控制参数、透镜912控制参数(例如聚焦或变焦用焦距)、或这些参数的组合。ISP控制参数可包括用于自动白平衡和颜色调整(例如,在RGB处理期间)的增益水平和色彩校正矩阵,以及透镜912阴影校正参数。
电子设备根据上述图像处理技术可以实现本申请实施例中所描述的视频处理方法。
本申请所使用的对存储器、存储、数据库或其它介质的任何引用可包括非易失性和/或易失性存储器。合适的非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM),它用作外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDR SDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (17)

  1. 一种视频处理方法,包括:
    从视频中每间隔预设帧提取一帧图像,对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度;
    根据所述图像的场景标签及对应的置信度建立标签频率直方图;及
    根据所述标签频率直方图确定所述视频的视频标签。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述图像的场景标签及对应的置信度建立标签频率直方图,包括:
    将所述图像的场景标签对应的置信度作为所述图像中场景标签的权重值;及
    根据所述图像的场景标签及对应的权重值建立标签频率直方图。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述图像的场景标签及对应的置信度建立标签频率直方图,包括:
    根据所述图像的场景标签的置信度在所述图像的所有场景标签的置信度的大小确定所述场景标签的权重值;及
    根据所述图像的场景标签及对应的权重值建立标签频率直方图。
  4. 根据权利要求3所述的方法,其特征在于,还包括:
    所述图像的场景标签的置信度在所述图像的所有场景标签的置信度最大时,所述场景标签对应的权重值在所述图像中的权重值最高。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,还包括:
    根据置信度大于阈值的场景标签及对应的置信度建立所述视频的标签频率直方图。
  6. 根据权利要求1所示的方法,其特征在于,还包括;
    获取所述视频拍摄时的位置信息;及
    根据所述位置信息调整所述图像中场景标签对应的置信度。
  7. 根据权利要求1所述的方法,其特征在于,还包括:
    根据所述标签频率直方图得到所述场景标签对应的频率;及
    将频率大的预设个数个场景标签作为所述视频的视频标签。
  8. 根据权利要求1所述的方法,其特征在于,所述对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度,包括:
    将提取的所述图像输入至神经网络;
    通过所述神经网络的分类网络层对所述图像的背景进行图像场景识别,得到所述图像的背景所属的各个分类标签的置信度;
    通过所述神经网络的目标检测网络层对所述图像进行前景目标检测,得到所述图像的前景目标所述的各个目标标签的置信度;及
    根据各个所述分类标签的置信度及目标标签的置信度得到所述图像的场景标签及对应的置信度。
  9. 一种电子设备,包括存储器及处理器,所述存储器中储存有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:
    从视频中每间隔预设帧提取一帧图像,对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度;
    根据所述图像的场景标签及对应的置信度建立标签频率直方图;及
    根据所述标签频率直方图确定所述视频的视频标签。
  10. 根据权利要求9所述的电子设备,其特征在于,所述处理器执行所述根据所述图像的场景标签及对应的置信度建立标签频率直方图时,还执行如下操作:
    将所述图像的场景标签对应的置信度作为所述图像中场景标签的权重值;及
    根据所述图像的场景标签及对应的权重值建立标签频率直方图。
  11. 根据权利要求9所述的电子设备,其特征在于,所述处理器执行所述根据所述图像的场景标签及对应的置信度建立标签频率直方图时,还执行如下操作:
    根据所述图像的场景标签的置信度在所述图像的所有场景标签的置信度的大小确定所述场景标签的权重值;及
    根据所述图像的场景标签及对应的权重值建立标签频率直方图。
  12. 根据权利要求11所述的电子设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器还执行如下操作:
    所述图像的场景标签的置信度在所述图像的所有场景标签的置信度最大时,所述场景标签对应的权重值在所述图像中的权重值最高。
  13. 根据权利要求9至12中任一项所述的电子设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器还执行如下操作:
    根据置信度大于阈值的场景标签及对应的置信度建立所述视频的标签频率直方图。
  14. 根据权利要求9中所述的电子设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器还执行如下操作:
    获取所述视频拍摄时的位置信息;及
    根据所述位置信息调整所述图像中场景标签对应的置信度。
  15. 根据权利要求9中所述的电子设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器还执行如下操作:
    根据所述标签频率直方图得到所述场景标签对应的频率;及
    将频率大的预设个数个场景标签作为所述视频的视频标签。
  16. 根据权利要求9中所述的电子设备,其特征在于,所述处理器执行所述对提取的所述图像进行场景识别,得到所述图像的场景标签及对应的置信度时,还执行如下操作:
    将提取的所述图像输入至神经网络;
    通过所述神经网络的分类网络层对所述图像的背景进行图像场景识别,得到所述图像的背景所属的各个分类标签的置信度;
    通过所述神经网络的目标检测网络层对所述图像进行前景目标检测,得到所述图像的前景目标所述的各个目标标签的置信度;及
    根据各个所述分类标签的置信度及目标标签的置信度得到所述图像的场景标签及对应的置信度。
  17. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至8中任一项所述的方法的操作。
PCT/CN2019/087557 2018-06-08 2019-05-20 视频处理方法、电子设备、计算机可读存储介质 WO2019233263A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810585928.9A CN108777815B (zh) 2018-06-08 2018-06-08 视频处理方法和装置、电子设备、计算机可读存储介质
CN201810585928.9 2018-06-08

Publications (1)

Publication Number Publication Date
WO2019233263A1 true WO2019233263A1 (zh) 2019-12-12

Family

ID=64024878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087557 WO2019233263A1 (zh) 2018-06-08 2019-05-20 视频处理方法、电子设备、计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN108777815B (zh)
WO (1) WO2019233263A1 (zh)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777815B (zh) * 2018-06-08 2021-04-23 Oppo广东移动通信有限公司 视频处理方法和装置、电子设备、计算机可读存储介质
CN109815365A (zh) * 2019-01-29 2019-05-28 北京字节跳动网络技术有限公司 用于处理视频的方法和装置
CN109871896B (zh) * 2019-02-26 2022-03-25 北京达佳互联信息技术有限公司 数据分类方法、装置、电子设备及存储介质
CN109960745B (zh) * 2019-03-20 2021-03-23 网易(杭州)网络有限公司 视频分类处理方法及装置、存储介质和电子设备
CN110348367B (zh) * 2019-07-08 2021-06-08 北京字节跳动网络技术有限公司 视频分类方法、视频处理方法、装置、移动终端及介质
CN110378946B (zh) * 2019-07-11 2021-10-01 Oppo广东移动通信有限公司 深度图处理方法、装置以及电子设备
CN110647933B (zh) * 2019-09-20 2023-06-20 北京达佳互联信息技术有限公司 一种视频的分类方法及装置
CN110933462B (zh) * 2019-10-14 2022-03-25 咪咕文化科技有限公司 视频处理方法、系统、电子设备及存储介质
CN111738042A (zh) * 2019-10-25 2020-10-02 北京沃东天骏信息技术有限公司 识别方法、设备及存储介质
CN110826471B (zh) * 2019-11-01 2023-07-14 腾讯科技(深圳)有限公司 视频标签的标注方法、装置、设备及计算机可读存储介质
CN110889012A (zh) * 2019-11-26 2020-03-17 成都品果科技有限公司 一种基于抽帧图片生成空镜标签系统的方法
CN111291800A (zh) * 2020-01-21 2020-06-16 青梧桐有限责任公司 房屋装修类型分析方法、系统、电子设备及可读存储介质
CN111368138A (zh) * 2020-02-10 2020-07-03 北京达佳互联信息技术有限公司 视频类别标签的排序方法、装置、电子设备及存储介质
CN113536823A (zh) * 2020-04-10 2021-10-22 天津职业技术师范大学(中国职业培训指导教师进修中心) 一种基于深度学习的视频场景标签提取系统、方法及其应用
CN111653103A (zh) * 2020-05-07 2020-09-11 浙江大华技术股份有限公司 一种目标对象的识别方法及装置
CN111625716B (zh) * 2020-05-12 2023-10-31 聚好看科技股份有限公司 媒资推荐方法、服务器及显示设备
CN114118114A (zh) * 2020-08-26 2022-03-01 顺丰科技有限公司 一种图像检测方法、装置及其存储介质
CN112948635B (zh) * 2021-02-26 2022-11-08 北京百度网讯科技有限公司 视频分析方法、装置、电子设备及可读存储介质
CN113014992A (zh) * 2021-03-09 2021-06-22 四川长虹电器股份有限公司 智能电视的画质切换方法及装置
CN113469033A (zh) * 2021-06-30 2021-10-01 北京集创北方科技股份有限公司 图像识别方法、装置、电子设备及存储介质
CN113569687B (zh) * 2021-07-20 2023-10-24 上海明略人工智能(集团)有限公司 基于双流网络的场景分类方法、系统、设备及介质
CN113613065B (zh) * 2021-08-02 2022-09-09 北京百度网讯科技有限公司 视频编辑方法、装置、电子设备以及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103945088A (zh) * 2013-01-21 2014-07-23 华为终端有限公司 场景识别方法及装置
CN105426883A (zh) * 2015-12-25 2016-03-23 中国科学院深圳先进技术研究院 视频分类快速识别的方法及装置
CN107180074A (zh) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 一种视频分类方法及装置
CN107403618A (zh) * 2017-07-21 2017-11-28 山东师范大学 基于堆叠基稀疏表示的音频事件分类方法及计算机设备
CN108777815A (zh) * 2018-06-08 2018-11-09 Oppo广东移动通信有限公司 视频处理方法和装置、电子设备、计算机可读存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469162A (zh) * 2015-08-18 2017-03-01 中兴通讯股份有限公司 一种图片排序方法和相应的图片存储显示设备
CN107895055A (zh) * 2017-12-21 2018-04-10 儒安科技有限公司 一种照片管理方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103945088A (zh) * 2013-01-21 2014-07-23 华为终端有限公司 场景识别方法及装置
CN105426883A (zh) * 2015-12-25 2016-03-23 中国科学院深圳先进技术研究院 视频分类快速识别的方法及装置
CN107180074A (zh) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 一种视频分类方法及装置
CN107403618A (zh) * 2017-07-21 2017-11-28 山东师范大学 基于堆叠基稀疏表示的音频事件分类方法及计算机设备
CN108777815A (zh) * 2018-06-08 2018-11-09 Oppo广东移动通信有限公司 视频处理方法和装置、电子设备、计算机可读存储介质

Also Published As

Publication number Publication date
CN108777815A (zh) 2018-11-09
CN108777815B (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2019233263A1 (zh) 视频处理方法、电子设备、计算机可读存储介质
CN108764208B (zh) 图像处理方法和装置、存储介质、电子设备
WO2020001197A1 (zh) 图像处理方法、电子设备、计算机可读存储介质
WO2019233262A1 (zh) 视频处理方法、电子设备、计算机可读存储介质
US10990825B2 (en) Image processing method, electronic device and computer readable storage medium
WO2019233393A1 (zh) 图像处理方法和装置、存储介质、电子设备
WO2019233341A1 (zh) 图像处理方法、装置、计算机可读存储介质和计算机设备
US11233933B2 (en) Method and device for processing image, and mobile terminal
WO2019237887A1 (zh) 图像处理方法、电子设备、计算机可读存储介质
US10896323B2 (en) Method and device for image processing, computer readable storage medium, and electronic device
WO2019233297A1 (zh) 数据集的构建方法、移动终端、可读存储介质
US20200412937A1 (en) Focusing method and device, electronic device and computer-readable storage medium
WO2019233266A1 (zh) 图像处理方法、计算机可读存储介质和电子设备
CN108875619B (zh) 视频处理方法和装置、电子设备、计算机可读存储介质
WO2019233392A1 (zh) 图像处理方法、装置、电子设备和计算机可读存储介质
CN108961302B (zh) 图像处理方法、装置、移动终端及计算机可读存储介质
WO2019233271A1 (zh) 图像处理方法、计算机可读存储介质和电子设备
CN109712177B (zh) 图像处理方法、装置、电子设备和计算机可读存储介质
WO2020001196A1 (zh) 图像处理方法、电子设备、计算机可读存储介质
WO2019233260A1 (zh) 广告信息推送方法和装置、存储介质、电子设备
CN108897786B (zh) 应用程序的推荐方法、装置、存储介质及移动终端
CN108959462B (zh) 图像处理方法和装置、电子设备、计算机可读存储介质
CN108804658B (zh) 图像处理方法和装置、存储介质、电子设备
WO2019223513A1 (zh) 图像识别方法、电子设备和存储介质
CN108848306B (zh) 图像处理方法和装置、电子设备、计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19814019

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19814019

Country of ref document: EP

Kind code of ref document: A1